In this post, we will continue our analysis for gradient descent. In the previous lecture, we assume that all of the functions has -Lipschitz gradient. For general -smooth functions, the gradient descent algorithm can get a first order -critical proint in iteration. When the function is convex, we show that we can use iterations to get a solution differ from the optimal for at most . When the function is strongly convex and smooth, we show that the number of iterations can be reduced to .
However, in the previous post, we assume that the function is smooth, which implies that the function has gradient at all points. In this post, we will first assume that the function is convex but not smooth. Besides, in the previous post, we also focus on the unconstraint case, and in this posts, we will also introduce the analysis for constraint minimization.
Projected Subgradient Descent
In this post, we assume that the convex optimization problem has the following form:
In the problem, is the constraint set, which could be .
Subgradient: Let , and . Then is a subgradient of at if for any , one has
We use to denote the set of subgradient at , i.e.
Note that if is differentiable at point , then is just . So the notion of subgradient is compatible with that of the original gradient.
We will also introduce the projection operator on by
We have the following lemma for projection operator.
Let and , then
which also implies
Then, we will introduce the projected gradient descent algorithm. The algorithm works as follow:
Analysis for Lipschitz Functions
Suppose is convex. Furthermore, suppose that for any , we have , then the projected subgradient method with satisfies
Proof: By the convexity of , we have
Now, from and , we have
Summing up the equations, we complete the proof.
Analysis for Smooth Functions(Lipschitz Gradient)
In this section, we assume that is convex and -smooth. In the previous post when the optimization problem is not constraint, we have the following inequality(change parameters)
However, this inequality may not be true in the constraint case, since we need to apply the projection operation. The next lemma shows the `right’ quantity to measure the descent procedure.
Let and . Then the following holds true:
Proof: We first observe that
since the above inequality is equivalent to
Then, we have
Now we can show the following theorem for the convergence of PGD.
Let be convex and -smooth on . Then projected gradient descent with satisfies
Proof: From the previous lemma, we have
Then we show that is decreasing with . From the previous lemma, we can also find
then we have
Let , we have
Then we can finish the proof by simple induction.
Analysis for Strongly Convex Functions
In this section, we consider the projected gradient descent with time-varying step size , that is
Then we have the following theorem
Let be -strongly convex and -Lipschitz on . Then projected subgradient descent with satisfies,
Proof: Similar with the projected subgradient descent for lipschitz functions, we have
Multiplying the inequalities by , we have
Summing up the above inequalities and applying the Jensen’s inequality, we complete the proof.
Analysis for Strongly Convex and Smooth Functions
The key improvement compared with convex and smooth function is that, one can show
Based on this result, we have the following theorem
Let be -strongly convex and -smooth on . Then projected gradient descent with satisfies,
Proof: Using the previous inequality with , we have