Introduction and Background

In this article, I will give a brief introduction to quasi-Newton methods for training large-scale machine learning models. AI products are prevalent in everyone’s daily life, including voice assistants, face recognition, and omnipotent chatbots. The core of an AI application is a well-trained machine-learning model that can make predictions or generate responses from previously unseen data.

To train a machine learning model, researchers or engineers usually formalize a loss function specialized for their application and then train the model with optimization algorithms to minimize the loss function on a collection of documents. One of the most popular optimization algorithms is the gradient descent method and its variants. Intuitively, the gradient descent repeatedly takes steps opposite to the gradient direction to approach the local minimum of the loss function. In particular, let f(x) be the loss function, xt be the iterate at the t th iteration, then the gradient descent update can be written as
$$x_{t+1} = x_t – \eta \nabla f(x_t),$$
where $\eta$ is some carefully chosen positive number. When the dataset is very large, e.g., the training data for ChatGPT has billions of documents, computing the exact gradient over the entire dataset is too expensive for each iteration.

To circumvent this issue, stochastic or incremental optimization methods were introduced since they only require computing an estimation of the gradient by a single sample or a small mini-batch of samples at each round. Among the stochastic optimization methods, stochastic gradient descent (SGD) is perhaps the most popular method for large-scale optimization problems thanks to its cheap computation cost per iteration. However, SGD is still not fast enough in both theory and practice due to the ever-growing data size.

Quasi-Newton Methods

The well-known Newton method incorporates Hessian information at each iteration to accelerate the convergence speed. Specifically, the update rule of the Newton method is
$$x_{t+1} = x_t – H_t^{-1} \nabla f(x_t),$$
where the Ht is the Hessian matrix. Please refer to the wiki page for details of the Hessian matrix. Although Newton’s method converges much faster than the gradient method, the computation of Hessian and its inverse is very slow if the problem dimension is large.

To overcome this issue, quasi-Newton methods approximate the Hessian information with gradients at each iteration while maintaining a convergence speed similar to the Newton method.
In particular, the quasi-Newton update can be written as
$$x_{t+1} = x_t – B_t \nabla f(x_t),$$
where Bt is the approximation to the inverse of the Hessian. There are several different ways to obtain the Hessian approximation Bt, which leads to different quasi-Newton methods, including BFGS, SR1, and DFP methods. The quasi-Newton methods are widely used in various machine-learning models, including maximum likelihood estimation, unsupervised learning, and black-box parameter estimation problems.

Incremental Quasi-Newton Methods

Figure 1. Overview of IQN methods

Similar to the gradient descent method, the quasi-Newton methods require the exact gradient to compute its update at each iteration which can be computationally expensive in the big data setting. To overcome this issue, several stochastic variants of quasi-Newton methods have been proposed to estimate the Hessian information with a single or a minibatch of samples. However, most of these stochastic methods cannot match the convergence speed of the classical quasi-Newton methods. Mokhtari et al. [1], Lahoti et al. [2], Liu et al.[3]Â proposed the incremental quasi-Newton method by using aggregated information from history to construct an accurate gradient and Hessian estimator, and their methods match the classical quasi-Newton methods in both theory and practice. In particular, let the objective function be the finite sum of n component functions.
$$f(x) := \frac{1}{n}\sum_{i=1}^n f_i (x).$$
The IQN maintains a separate copy of the variable, gradient, and Hessian approximation for each component function. It updates one of the variables at each iteration in a cyclic fashion to make the algorithm efficient. An overview of the IGN method is presented in Figure 1. Furthermore, we conduct numerical experiments on the quadratic minimization problem to compare the efficiency of IQN and its variants. The experimental results are presented in Figure 2 below.

Figure 2: Normalized error vs. the number of effective passes for the quadratic function minimization.

Challenges and Future Work

There are several possible extensions to the IQN method which we have described above, including

• IQN stores the Hessian approximation for each component function, which can take a lot of memory in the large data scenario, can we design a more memory-efficient approach?
• Lahoti et al. [2], Liu et al.[3] provides the convergence rate of some variants of the IQN method. The explicit convergence rate of the vanilla IQN method is still unknown.
• Current analysis of the IQN methods assumes the loss function is strongly convex, which is rather restrictive since the loss functions of most modern machine learning applications are nonconvex. It is interesting to extend the current theoretical analysis to a more general setting.

Relations To AISGâ€™s Pillars And Technical Areas Of Interest

Due to the sheer volume of data size in modern machine learning applications, the need to design efficient optimization algorithms naturally arises. It is also an essential and inevitable step towards the overarching goal towards the resource-efficient AI.

References

1. Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro. Iqn: An incremental quasi-newton method with local superlinear convergence rate. SIAM Journal on Optimization, 28(2):1670â€“1698, 2018.
2. Aakash Sunil Lahoti, Spandan Senapati, Ketan Rajawat, and Alec Koppel. Sharpened lazy incremental quasi-newton method. In International Conference on Artificial Intelligence and Statistics, pages 4735â€“4743. PMLR, 2024.
3. Zhuanghua Liu, Luo Luo, and Bryan Kian Hsiang Low. Incremental quasi-newton methods with faster superlinear convergence rates. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 14097â€“14105, 2024.