Acceleration in Fixed Point Iteration and Nesterov;s Gradient Method
Author
Jinxiong Zhang
Last Updated
6 years ago
License
Creative Commons CC BY 4.0
Abstract
A note on gradient-based optimization.
A note on gradient-based optimization.
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{url}
\usepackage{hyperref}
\title{Acceleration}
\author{Jinxiong Zhang }
\date{March 2019}
\begin{document}
\maketitle
\section{Introduction}
It is popular to use Nesterov's accelerated gradient method to minimize the cost function $f(x)$ in machine learning specially deep learning, where it is called momentum method. The source of its acceleration is studied from many perspectives such as the blogs at \url{https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/}.
If Anderson acceleration for fixed point iteration is applied to the problem
$$x=x-\alpha\nabla f(x)$$
where $\alpha$ is a scalar function, it is still an open problem whether it is equivalent to Nesterov's accelerated gradient method as far as known.
It is worthy of exploring it.
\section{Nesterov's accelerated gradient method}
The general form of Nesterov's accelerated gradient method to minimize the cost function $f(x)$ can be written in the following form:
$$x^{k+1}=x^k-\gamma \nabla f(x^k+\mu(x^k-x^{k-1}))+\mu(x^k-x^{k-1}).$$
The parameters $\gamma$ and $\mu$ are difficult to tune when $f(x)$ is non-convex.
\section{Anderson acceleration}
We apply Anderson acceleration to the problem $x=x-\alpha\nabla f(x)$:
$$x^{k+1}=x^k-\alpha \nabla f(x^k)+(1-\alpha_1)\alpha[\nabla f(x^k)-\nabla f(x^{k-1})]+(1-\alpha_1)(x^{k-1}-x^k)$$
where $\alpha_1=\arg\min_{\alpha_1}{\|\alpha_1\nabla f(x^k)+(1-\alpha_1) \nabla f(x^{k-1})\|}_2^2=-\frac{\|\nabla f(x^k)-\nabla f(x^{k-1})\|_2^2}{\left<\nabla f(x^k)-\nabla f(x^{k-1}), \nabla f(x^{k-1})\right>}$.
There is a difference of previous iteration in Anderson acceleration as well as Nesterov's accelerated gradient method.
It seems that
$$\gamma \nabla f(x^k+\mu(x^k-x^{k-1}))\approx (1-\alpha_1)\alpha[\nabla f(x^k)-\nabla f(x^{k-1})].$$
And what is the connection of these two methods in mathematics?
\end{document}