Theory of Optimization: More on Mirror Descent

1 minute read


In this post, we will continue on our discuss of mirror descent. We will present a variant of mirror descent: the lazy mirror descent, also known as Nesterov’s dual averaging.

Lazy Mirror Descent(Nesterov’s Dual Averaging)


In this section, we will provide a more efficient version of mirror descent, named Lazy Mirror Descent. We will show that the convergence rate is the same as that of the original mirror descent(omit constants). The lazy mirror descent changes the update process of from


Moreover, is such that .

The Lazy Mirror Descent algorithm works as follow:

  1. Initial point , step size .
  2. For do
  3. Pick any .


This analysis is adapted from Convex Optimization: Algorithms and Complexity. We have the following theorem.

Let be a mirror map -strongly convex on w.r.t . Let , and be convex and -Lipschitz w.r.t . Then the lazy mirror descent with satisfies

Proof: We define , so that . Since is -strongly convex, we have is -strongly convex.

Then, we can compute

\begin{align} \Psi_t(x^{(t+1)}) - \Psi_t(x^{(t)}) \le& \nabla \Psi_{t}(x^{(t+1)})^T(x^{(t+1)} - x^{(t)}) - \frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2 \newline \le& - \frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2, \end{align}

where the second inequality comes from the first order optimality condition for . Next, we observe that

\begin{align} \Psi_t(x^{(t+1)}) - \Psi_t(x^{(t)}) =& \Psi_{t-1}(x^{(t+1)}) - \Psi_{t-1}(x^{(t)}) + \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)})\newline \ge& \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}). \end{align}

Putting together the two above displays and using Cauchy-Schwarz (with the assumption ) one obtains

This shows that and thus with the above display

Then we proof the following result: for every ,

which would conclude the proof.

The above equation is equivalent to

We prove this by induction,

\begin{align} &\sum_{s=0}^{t-1} (g^{(s)})^T x^{(s+1)} + \frac{\Phi(x^{(0)})}{\eta} \newline \le & (g^{(s)})^Tx^{(t)} + \sum_{s=0}^{t-2} (g^{(s)})^Tx^{(t)} + \frac{\Phi(x^{(t)})}{\eta}\newline \le & \sum_{s=0}^{t-1} (g^{(s)})^Tx + \frac{\Phi(x)}{\eta}, \end{align}

where the first inequality follows from the induction step and the second follows from the definition of .