❓
Why is optimization important for machine learning? What is special in the machine learning context than in general optimization?
Optimization is important because it allows us to reach the conclusion most effectively. Such a conclusion may also be considered an approximate one if the problem in question does not require a high degree of accuracy. In that case different optimization strategies can be employed. It is important to optimize in ML to obtain efficient algorithms that can be computationally feasible.

In parameter estimation we need to find the best possible solution, thus we need to find a way to optimize the process, which we often achieve by minimizing or maximizing an objective function $f$ .

💡Examples of $f$ for topics we already covered
For MLE, $f$ is the negative log-likelihood of the training data
For ERM $f$ is the empirical risk
For MAP $f$ is the negative log-posterior
For RRM $f$ is the regularized risk

We will now cover gradient based methods, which are iterative methods that are divided into two categories:

First order methods, e.g., gradient descent, or stochastic gradient descent.
Second order methods often use the Newton method or semi-newton methods, which improve on it.

💡 Newton’s method relies on using both the gradient (first derivative) and the Hessian matrix (second derivative) of the objective function to iteratively update the solution. It can converge quickly but requires the computation of the Hessian, which can be expensive.

Gradient descent techniques

Continuous gradient descent (CGD)

CGD aims to finding the global minimum point $\boldsymbol x^*$ of a given function $f$ . First, it selects a starting point $\boldsymbol x_0$ , then computes the gradient to “walk downhill” until it reaches the minimum point.

However, solving the gradient directly is often impractical, so we approximate it by using plain Gradient Descent (GD), which introduces the concept of learning rate.

Gradient descent (GD)

GD is based on the same concept as CGD but enables jumps, thus making it more practical in large search spaces.

Gradient descent and logistic regression

GD is often used in LR in the minimization of the negative log-likelihood.

Additional terminology

A single gradient computation and the subsequent parameter update is called an epoch.

In the $n$ -th epoch, we use the following learning rule for some learning rate $\varepsilon_n >0$ . In practice, parameters are adjusted by moving in the direction of the negative gradient, scaled by the learning rate, to minimize the objective function.

Stochastic gradient descent (SGD)

SGD is based on the same concept as GD, but this time it stochastically approximates the gradient, asymptotically approximating the same results as the other algorithms. In practice, it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).

SGD’s approach involves repeated parameter updates within an epoch. This makes SGD less deterministic and its updates potentially less precise, yet more efficient in practice, especially when dealing with large datasets. Ultimately, SGD’s many “quick and dirty” steps make it suitable for these scenarios.

In practice, we often set a large step size in the beginning of the run, allowing quick searching of the space, consequently reducing it over time.

In some cases, it might be useful to ensure the step size stays below a specified threshold.

In other cases, it can be adjusted dynamically using the bold driver heuristic, which suggests modifying the step size after each epoch. Specifically, it recommends increasing the learning rate slightly if the objective function has decreased, while sharply reducing the step size if the objective increases significantly.

Those decisions are often set in the hyperparameters of the model.

A modification that can be made to the SGD algorithm is to turn the data into batches for quicker analysis, and then return the final results at the end.

Moreover, it is worth noting that SGD can be efficiently parallelized.

Second order methods

In order to approximate a function $f$ at a particular point, we can use Taylor’s theorem. It gives us the best approximation for a given function around a certain point, together with the error generated by accepting such an approximation.

Newton method

With Newton’s method, we update the guess for the minimum point by taking the current guess $x_n$ and adjusting it based on the slope of the function $f’(x_n)$ and its curvature $f’'(x_n)$ . This method can converge to the minimum quickly because it takes into account both the slope and the curvature of the function, offering a better guess than simple gradient descent.

There is also a multivariate version of this, which combines the function value at $x_n$ , a linear vector term involving the gradient $\boldsymbol g_n$ , and a quadratic term involving the Hessian matrix $\bold{H}_n$ , which accounts for the curvature of the function.

This becomes very computationally expensive for high-dimensional data because the Hessian is very complex to obtain, and if the function isn’t strictly convex, Newton’s method might not converge or may even diverge.

📌 SGD and logistic regression In logistic regression, SGD calculates the gradient of the log-likelihood loss for each data point and updates the weights accordingly. This enables scalability for large datasets by processing one (or a few) data points at a time, rather than requiring all data in memory. Moreover, LR incorporates regularization (e.g., L1 or L2) to prevent overfitting, and SGD efficiently handles these regularization terms during updates.

❓
What is a gradient?
State the general learning rules of (i) gradient descent (GD) and (ii) stochastic gradient descent (SGD). What is the intuition behind these rules?
State both rules for logistic regression.
Compare GD and SGD.
How can we choose the learning rate? Does the choice matter?
Bold driver heuristic
For logistic regression, what can we say about the score of an example directly before and after an SGD step ?
The score of an example (the linear combination of features before applying the sigmoid function) directly before and after an SGD step reflects the effect of updating the model’s weights based on the gradient of the loss function.
What are second-order gradient-based methods? How are they motivated? Which information is needed in each step? Why only this information?
State the learning rule of Newton’s method in the univariate and in the multivariate case.
Univariate: uses the function’s first and second derivatives to update the variable $x$ .
Multivariate: generalizes to update a vector of parameters $\mathbf{w}$ using the gradient vector and the Hessian matrix.
Compare Newton’s method with GD/SGD.

Did you find this article interesting?

YesNo