# Theory of Optimization: Gradient Descent

** Published:**

In this post, we will review the most basic and the most intuitive optimization method – the gradient decent method – in optimization.

## Gradient Descent

The gradient descent algorithm works as follow: The algorithm requires an initial point and step size . Then the algorithm repeats to execute:

until . In the following of this section, we will assume that the gradient of is L-lipschitz, i.e.

we call is L-lipschitz gradient or L-smooth.

### Analysis for General Smooth Functions

In order to show the convergence of gradient descent for the general functions, we need the following lemma,

Suppose is L-lipschitz gradient, then

*Proof:* From the basic calculus and the assumption that is L-lipschitz gradient, we have

\begub{align} &|f(y) - f(x) - \nabla f(x)^T(y-x)| \newline =& \int_{0}^1 \nabla f(x + t(y-x))^T(y-x)dt - \nabla f(x)^T(y-x)\newline =& |\int_{0}^1 (\nabla f(x + t(y-x))-\nabla f(x))^T(y-x)dt| \newline \le& \int_{0}^1 ||\nabla f(x + t(y-x))-\nabla f(x)||\cdot ||y-x||dt \newline \le& \int_{0}^1 tL||y-x||^2dt \newline =& \frac{L}{2}||y-x||^2. \end{align}

⬜

The previous lemma shows and upper and lower bound of a point if the gradient of the function is lipschitz. From the previous lemma, we can then show the convergence of gradient descent.

Let be a function with L-Lipschitz gradient and be any minimizer of . The gradient descent with step size outputs a point such that in iterations.

*Proof:* From the previous lemma, we have the following equation

\begub{align} f(x^{(t+1)}) \le& f(x^{(t)}) - \nabla f(x^{(t)})^T(x^{(t+1)} - x^{(t)}) + \frac{L}{2}||x^{(t+1)} - x^{(t)}||^2 \newline \le& f(x^{(t)}) - \nabla f(x^{(t)})^T\cdot \frac{1}{L}\nabla f(x^{(t)}) + \frac{L}{2}\frac{1}{L^2}||\nabla f(x^{(t)})||^2 \newline =& f(x^{(t)}) - \frac{1}{2L}||\nabla f(x^{(t)})||^2. \end{align}

Then we have

Sum up the equations for , we can get

From this equation, we can see that the gradient descent outputs a point such that in iterations.

⬜

### Analysis for Convex Functions

In this section, we assume that the function , which we want to optimize, is convex with L-lipschitz gradient. As a result of the convexity, we can show that the difference between and is . Before we state the theorem, we prove an auxilary lemma.

For any convex function , we have

*Proof:* From the first order condition, if is convex, we have

Arranging the terms and using the basic property of norms, we have

⬜

With the help of the previous lemma, we can show the convergence bound for convex functions now.

Let be convex with L-Lipschitz gradient and be any minimizer of . With step size , the sequence in Gradient Descent satis es

where

*Proof:* Let . Since is L-lipschitz gradient, we have

Then from the previous lemma, we have

Therefore, we have

Moreover, we can also bound by the following inequality,

Then we have

Then

which completes the proof.

⬜

### Strongly Convex Functions

In the previous analysis for general smooth functions and convex functions, we show that the gradient descent algorithm needs and iterations to converge to an -approximation(Note: Here, -approximation does not mean the formal definition of -approximation in approximation algorithms. In the sense of convergence of general functions, it means the gradient is small, and in the sense of convergence of convex functions, it means the error to the optimal is small). Now we introduce the strongly convex functions, and analysis the convergence of gradient descent algorithm for strongly convex functions. In particular, we will show that it will use iterations to converge to an -approximation.

Strongly convex: We call a function is -strongly convex if for any ,

The following theorem gives an equivalent definition for strongly convex funtions if the function is twice differentiable.

Strongly convex: Let . Then is -strongly convex if and only if

We will not prove this theorem. This theorem can be shown by the Taylor’s expansion in multi-variable case.

With the definition of strongly convex functions, we can now prove the following theorem.

Let be -strongly convex with L-Lipschitz gradient and be any minimizer of . With step size , the sequence in GradientDescent satis es

*Proof:* From the previous analysis, since is L-lipschitz gradient, we have

where .

Then since is -strongly convex, we have

Rearranging the terms, we have

\begub{align} f(x^{(k)}) - f(x^*) \le& \nabla f(x^{(k)})^T(x^{(k)} - x^*) - \frac{\mu}{2}||x^* - x^{(k)}||^2 \newline \le& \max_{\Delta}(\nabla f(x^{(k)})^T\Delta - \frac{\mu}{2}\Delta^2)\newline =& \frac{1}{2\mu}||\nabla f(x^{(k)})||^2. \end{align}

Putting the above inequality into the first inequality in the proof, we have

\begub{align} \epsilon_{k+1} \le& \epsilon_k - \frac{1}{2L}||\nabla f(x^{(k)})||_2^2 \newline \le& \epsilon_k - \frac{\mu}{L}\epsilon_k \newline \le& \left(1-\frac{\mu}{L}\right)^{k+1}\epsilon_0. \end{align}

⬜