In this post, we will review the most basic and the most intuitive optimization method – the gradient decent method – in optimization.
The gradient descent algorithm works as follow: The algorithm requires an initial point and step size . Then the algorithm repeats to execute:
until . In the following of this section, we will assume that the gradient of is L-lipschitz, i.e.
we call is L-lipschitz gradient or L-smooth.
Analysis for General Smooth Functions
In order to show the convergence of gradient descent for the general functions, we need the following lemma,
Suppose is L-lipschitz gradient, then
Proof: From the basic calculus and the assumption that is L-lipschitz gradient, we have
The previous lemma shows and upper and lower bound of a point if the gradient of the function is lipschitz. From the previous lemma, we can then show the convergence of gradient descent.
Let be a function with L-Lipschitz gradient and be any minimizer of . The gradient descent with step size outputs a point such that in iterations.
Proof: From the previous lemma, we have the following equation
Then we have
Sum up the equations for , we can get
From this equation, we can see that the gradient descent outputs a point such that in iterations.
Analysis for Convex Functions
In this section, we assume that the function , which we want to optimize, is convex with L-lipschitz gradient. As a result of the convexity, we can show that the difference between and is . Before we state the theorem, we prove an auxilary lemma.
For any convex function , we have
Proof: From the first order condition, if is convex, we have
Arranging the terms and using the basic property of norms, we have
With the help of the previous lemma, we can show the convergence bound for convex functions now.
Let be convex with L-Lipschitz gradient and be any minimizer of . With step size , the sequence in Gradient Descent satis es
Proof: Let . Since is L-lipschitz gradient, we have
Then from the previous lemma, we have
Therefore, we have
Moreover, we can also bound by the following inequality,
Then we have
which completes the proof.
Strongly Convex Functions
In the previous analysis for general smooth functions and convex functions, we show that the gradient descent algorithm needs and iterations to converge to an -approximation(Note: Here, -approximation does not mean the formal definition of -approximation in approximation algorithms. In the sense of convergence of general functions, it means the gradient is small, and in the sense of convergence of convex functions, it means the error to the optimal is small). Now we introduce the strongly convex functions, and analysis the convergence of gradient descent algorithm for strongly convex functions. In particular, we will show that it will use iterations to converge to an -approximation.
Strongly convex: We call a function is -strongly convex if for any ,
The following theorem gives an equivalent definition for strongly convex funtions if the function is twice differentiable.
Strongly convex: Let . Then is -strongly convex if and only if
We will not prove this theorem. This theorem can be shown by the Taylor’s expansion in multi-variable case.
With the definition of strongly convex functions, we can now prove the following theorem.
Let be -strongly convex with L-Lipschitz gradient and be any minimizer of . With step size , the sequence in GradientDescent satis es
Proof: From the previous analysis, since is L-lipschitz gradient, we have
Then since is -strongly convex, we have
Rearranging the terms, we have
Putting the above inequality into the first inequality in the proof, we have