Theory of Optimization: Projected (Sub)Gradient Descent

5 minute read


In this post, we will continue our analysis for gradient descent. Different from the previous post, we will not assume that the function is smooth. We will only assume that the function is convex and has some Lipschitz constant.

In the previous lecture, we assume that all of the functions has -Lipschitz gradient. For general -smooth functions, the gradient descent algorithm can get a first order -critical proint in iteration. When the function is convex, we show that we can use iterations to get a solution differ from the optimal for at most . When the function is strongly convex and smooth, we show that the number of iterations can be reduced to .

However, in the previous post, we assume that the function is smooth, which implies that the function has gradient at all points. In this post, we will first assume that the function is convex but not smooth. Besides, in the previous post, we also focus on the unconstraint case, and in this posts, we will also introduce the analysis for constraint minimization.

Projected Subgradient Descent

In this post, we assume that the convex optimization problem has the following form:

In the problem, is the constraint set, which could be .

Subgradient: Let , and . Then is a subgradient of at if for any , one has

We use to denote the set of subgradient at , i.e.

Note that if is differentiable at point , then is just . So the notion of subgradient is compatible with that of the original gradient.

We will also introduce the projection operator on by

We have the following lemma for projection operator.

Let and , then

which also implies

Then, we will introduce the projected gradient descent algorithm. The algorithm works as follow:

  1. For
  2. and

Analysis for Lipschitz Functions

Suppose is convex. Furthermore, suppose that for any , we have , then the projected subgradient method with satisfies

Proof: By the convexity of , we have

\begin{align} &f(x^{(s)}) - f(x^) \newline \le& g_s(x^{(s)} - x^) \newline =& \frac{1}{\eta}(x^{(s)} - y^{(s+1)})^T(x^{(s)} - x^) \newline =& \frac{1}{2\eta}(||x^{(s)} - x^||^2 + ||x^{(s)} - y^{(s+1)}||^2 - ||x^{} - y^{(s+1)}||^2) \newline =& \frac{1}{2\eta}(||x^{(s)} - x^||^2 - ||x^{*} - y^{(s+1)}||^2) + \frac{\eta}{2}||g_s||^2. \end{align}

Now, from and , we have

Summing up the equations, we complete the proof.

Analysis for Smooth Functions(Lipschitz Gradient)

In this section, we assume that is convex and -smooth. In the previous post when the optimization problem is not constraint, we have the following inequality(change parameters)

However, this inequality may not be true in the constraint case, since we need to apply the projection operation. The next lemma shows the `right’ quantity to measure the descent procedure.

Let and . Then the following holds true:

Proof: We first observe that

since the above inequality is equivalent to

Then, we have

\begin{align} &f(x^+) - f(y) \newline =& f(x^+) -f(x) + f(x) - f(y) \newline \le& \nabla f(x)^T(x^+ - x) + \frac{\beta}{2}||x^+ - x||^2 + \nabla f(x)^T(x-y) \newline =& \nabla f(x)^T(x^+ - y) + \frac{1}{2\beta}||g_{\mathcal X}(x)||^2 \newline \le& g_{\mathcal X}(x)^T(x^+ - y) + \frac{1}{2\beta}||g_{\mathcal X}(x)||^2 \newline =& g_{\mathcal X}(x)^T(x - y) - \frac{1}{2\beta}||g_{\mathcal X}(x)||^2. \end{align}

Now we can show the following theorem for the convergence of PGD.

Let be convex and -smooth on . Then projected gradient descent with satisfies

Proof: From the previous lemma, we have


Then we show that is decreasing with . From the previous lemma, we can also find

then we have

\begin{align} &||x^{(s+1)} - x^||^2 \newline =& ||x^{(s)} - \frac{1}{\beta}g_{\mathcal X}(x^{(s)}) - x^||^2 \newline =& ||x^{(s)} - x^||^2 - \frac{2}{\beta}g_{\mathcal X}(x^{(s)})^T(x^{(s)} - x^) + \frac{1}{\beta^2}||g_{\mathcal X}(x^{(s)})||^2\newline \le& ||x^{(s)} - x^*||^2. \end{align}

Let , we have

\begin{align} \epsilon_{s+1} \le& \epsilon_s - \frac{1}{2\beta}||g_{\mathcal X}(x^{(s)})||^2 \newline \le& \epsilon_s - \frac{1}{2\beta||x^{(s)} - x^||^2}\epsilon_{s+1}^2 \newline \le& \epsilon_s - \frac{1}{2\beta||x^{(1)} - x^||^2}\epsilon_{s+1}^2. \end{align}

Then we can finish the proof by simple induction.

Analysis for Strongly Convex Functions

In this section, we consider the projected gradient descent with time-varying step size , that is

  1. For
  2. and

Then we have the following theorem

Let be -strongly convex and -Lipschitz on . Then projected subgradient descent with satisfies,

Proof: Similar with the projected subgradient descent for lipschitz functions, we have

Multiplying the inequalities by , we have

Summing up the above inequalities and applying the Jensen’s inequality, we complete the proof.

Analysis for Strongly Convex and Smooth Functions

The key improvement compared with convex and smooth function is that, one can show

Based on this result, we have the following theorem

Let be -strongly convex and -smooth on . Then projected gradient descent with satisfies,

Proof: Using the previous inequality with , we have

\begin{align} & ||x^{(t+1)} - x^||^2 \newline = & ||x^{(t)} - \frac{1}{\beta}g_{\mathcal X}(x^{(t)}) - x^||^2 \newline = & ||x^{(t)} - x^||^2 - \frac{2}{\beta}g_{\mathcal X}(x^{(t)})^T(x^{(t)} - x^) + \frac{1}{\beta^2}||g_{\mathcal X}(x^{(t)})||^2 \newline \le & ||x^{(t)} - x^||^2 - \frac{\alpha}{\beta}||x^{(t)} - x^||^2 \newline = & \left(1-\frac{\alpha}{\beta}\right)^t||x^{(1)} - x^*||^2. \end{align}