Theory of Optimization: More on Mirror Descent

2 minute read


In this post, we will continue on our discuss of mirror descent. We will present a variant of mirror descent: the lazy mirror descent, also known as Nesterov’s dual averaging.

Lazy Mirror Descent(Nesterov’s Dual Averaging)


In this section, we will provide a more efficient version of mirror descent, named Lazy Mirror Descent. We will show that the convergence rate is the same as that of the original mirror descent(omit constants). The lazy mirror descent changes the update process of \(y^{(k)}\) from

\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(x^{(k)}) - \eta\cdot g^{(k)},\]


\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(y^{(k)}) - \eta\cdot g^{(k)}.\]

Moreover, \(y^{(0)}\) is such that \(\nabla \Phi(y^{(0)}) = 0\).

The Lazy Mirror Descent algorithm works as follow:

  1. Initial point \(x^{(0)} \in \mathbb R^n\), step size \(h > 0\).
  2. For \(k=0,1,2,\dots\) do
  3. Pick any \(g^{(k)}\in\partial f(x^{(k)})\).
  4. \[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(y^{(k)}) - \eta\cdot g^{(k)}, \nabla \Phi(y^{(0)}) = 0.\]
  5. \[x^{(k+1)} = \Pi_{\mathcal X}^{\Phi}(y^{(k+1)}) := \arg\min_{x\in \mathcal X\cap\mathcal D}D_{\Phi}(x,y^{(k+1)}).\]


This analysis is adapted from Convex Optimization: Algorithms and Complexity. We have the following theorem.

Let \(\Phi\) be a mirror map \(\rho\)-strongly convex on \(\mathcal X\cap\mathcal D\) w.r.t \(||\cdot||\) . Let \(R^2 = \sup_{x\in\mathcal X\cap\mathcal D}\Phi(x) - \Phi(x^{(0)})\), and \(f\) be convex and \(L\)-Lipschitz w.r.t \(||\cdot||\) . Then the lazy mirror descent with \(\eta = \frac{R}{L}\sqrt{\frac{\rho}{2t}}\) satisfies

\[f\left(\frac{1}{t}\sum_{s=0}^{t-1}x^{(s)}\right) - f(x^*) \le 2RL\sqrt{\frac{2}{\rho t}}.\]

Proof: We define \(\Psi_t(x) = \eta \sum_{s=0}^{t-1}g_s^Tx + \Phi(x)\), so that \(x^{(t)}\in\arg\min_{\mathcal X\cap\mathcal D}\Psi_{t-1}(x)\). Since \(\Phi\) is \(\rho\)-strongly convex, we have \(\Psi_t\) is \(\rho\)-strongly convex.

Then, we can compute

\begin{align} \Psi_t(x^{(t+1)}) - \Psi_t(x^{(t)}) \le& \nabla \Psi_{t}(x^{(t+1)})^T(x^{(t+1)} - x^{(t)}) - \frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2 \newline \le& - \frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2, \end{align}

where the second inequality comes from the first order optimality condition for \(x^{(t+1)}\). Next, we observe that

\begin{align} \Psi_t(x^{(t+1)}) - \Psi_t(x^{(t)}) =& \Psi_{t-1}(x^{(t+1)}) - \Psi_{t-1}(x^{(t)}) + \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)})\newline \ge& \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}). \end{align}

Putting together the two above displays and using Cauchy-Schwarz (with the assumption \(||g^{(t)}||_{*} \le L\) ) one obtains

\[\frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2\le \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}) \le \eta L ||x^{(t+1)} - x^{(t)}||.\]

This shows that \(||x^{(t+1)} - x^{(t)}|| \le \frac{2\eta L}{\rho}\) and thus with the above display

\[\eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}) \le \frac{2\eta L^2}{\rho}.\]

Then we proof the following result: for every \(x\in\mathcal X\cap\mathcal D\),

\[\sum_{s=0}^{t-1} (g^{(s)})^T(x^{(s)}-x) \le \sum_{s=0}^{t-1} (g^{(s)})^T(x^{(s)}-x^{(s+1)}) + \frac{\Phi(x) - \Phi(x^{(0)})}{\eta},\]

which would conclude the proof.

The above equation is equivalent to

\[\sum_{s=0}^{t-1} (g^{(s)})^Tx^{(s+1)} + \frac{\Phi(x^{(0)})}{\eta} \le \sum_{s=0}^{t-1} (g^{(s)})^Tx + \frac{\Phi(x)}{\eta}.\]

We prove this by induction,

\begin{align} &\sum_{s=0}^{t-1} (g^{(s)})^T x^{(s+1)} + \frac{\Phi(x^{(0)})}{\eta} \newline \le & (g^{(s)})^Tx^{(t)} + \sum_{s=0}^{t-2} (g^{(s)})^Tx^{(t)} + \frac{\Phi(x^{(t)})}{\eta}\newline \le & \sum_{s=0}^{t-1} (g^{(s)})^Tx + \frac{\Phi(x)}{\eta}, \end{align}

where the first inequality follows from the induction step and the second follows from the definition of \(x^{(t)}\).