Theory of Optimization: More on Mirror Descent

2 minute read

Published: February 15, 2019

In this post, we will continue on our discuss of mirror descent. We will present a variant of mirror descent: the lazy mirror descent, also known as Nesterov’s dual averaging.

Lazy Mirror Descent(Nesterov’s Dual Averaging)

Algorithm

In this section, we will provide a more efficient version of mirror descent, named Lazy Mirror Descent. We will show that the convergence rate is the same as that of the original mirror descent(omit constants). The lazy mirror descent changes the update process of \(y^{(k)}\) from

\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(x^{(k)}) - \eta\cdot g^{(k)},\]

into

\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(y^{(k)}) - \eta\cdot g^{(k)}.\]

Moreover, \(y^{(0)}\) is such that \(\nabla \Phi(y^{(0)}) = 0\).

The Lazy Mirror Descent algorithm works as follow:

Initial point \(x^{(0)} \in \mathbb R^n\), step size \(h > 0\).
For \(k=0,1,2,\dots\) do
Pick any \(g^{(k)}\in\partial f(x^{(k)})\).
\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(y^{(k)}) - \eta\cdot g^{(k)}, \nabla \Phi(y^{(0)}) = 0.\]
\[x^{(k+1)} = \Pi_{\mathcal X}^{\Phi}(y^{(k+1)}) := \arg\min_{x\in \mathcal X\cap\mathcal D}D_{\Phi}(x,y^{(k+1)}).\]

Analysis

This analysis is adapted from Convex Optimization: Algorithms and Complexity. We have the following theorem.

Let \(\Phi\) be a mirror map \(\rho\)-strongly convex on \(\mathcal X\cap\mathcal D\) w.r.t \(||\cdot||\) . Let \(R^2 = \sup_{x\in\mathcal X\cap\mathcal D}\Phi(x) - \Phi(x^{(0)})\), and \(f\) be convex and \(L\)-Lipschitz w.r.t \(||\cdot||\) . Then the lazy mirror descent with \(\eta = \frac{R}{L}\sqrt{\frac{\rho}{2t}}\) satisfies
\[f\left(\frac{1}{t}\sum_{s=0}^{t-1}x^{(s)}\right) - f(x^*) \le 2RL\sqrt{\frac{2}{\rho t}}.\]

Proof: We define \(\Psi_t(x) = \eta \sum_{s=0}^{t-1}g_s^Tx + \Phi(x)\), so that \(x^{(t)}\in\arg\min_{\mathcal X\cap\mathcal D}\Psi_{t-1}(x)\). Since \(\Phi\) is \(\rho\)-strongly convex, we have \(\Psi_t\) is \(\rho\)-strongly convex.

Then, we can compute

\begin{align} \Psi_t(x^{(t+1)}) - \Psi_t(x^{(t)}) \le& \nabla \Psi_{t}(x^{(t+1)})^T(x^{(t+1)} - x^{(t)}) - \frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2 \newline \le& - \frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2, \end{align}

where the second inequality comes from the first order optimality condition for \(x^{(t+1)}\). Next, we observe that

\begin{align} \Psi_t(x^{(t+1)}) - \Psi_t(x^{(t)}) =& \Psi_{t-1}(x^{(t+1)}) - \Psi_{t-1}(x^{(t)}) + \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)})\newline \ge& \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}). \end{align}

Putting together the two above displays and using Cauchy-Schwarz (with the assumption \(||g^{(t)}||_{*} \le L\) ) one obtains

\[\frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2\le \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}) \le \eta L ||x^{(t+1)} - x^{(t)}||.\]

This shows that \(||x^{(t+1)} - x^{(t)}|| \le \frac{2\eta L}{\rho}\) and thus with the above display

\[\eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}) \le \frac{2\eta L^2}{\rho}.\]

Then we proof the following result: for every \(x\in\mathcal X\cap\mathcal D\),

\[\sum_{s=0}^{t-1} (g^{(s)})^T(x^{(s)}-x) \le \sum_{s=0}^{t-1} (g^{(s)})^T(x^{(s)}-x^{(s+1)}) + \frac{\Phi(x) - \Phi(x^{(0)})}{\eta},\]

which would conclude the proof.

The above equation is equivalent to

\[\sum_{s=0}^{t-1} (g^{(s)})^Tx^{(s+1)} + \frac{\Phi(x^{(0)})}{\eta} \le \sum_{s=0}^{t-1} (g^{(s)})^Tx + \frac{\Phi(x)}{\eta}.\]

We prove this by induction,

\begin{align} &\sum_{s=0}^{t-1} (g^{(s)})^T x^{(s+1)} + \frac{\Phi(x^{(0)})}{\eta} \newline \le & (g^{(s)})^Tx^{(t)} + \sum_{s=0}^{t-2} (g^{(s)})^Tx^{(t)} + \frac{\Phi(x^{(t)})}{\eta}\newline \le & \sum_{s=0}^{t-1} (g^{(s)})^Tx + \frac{\Phi(x)}{\eta}, \end{align}

where the first inequality follows from the induction step and the second follows from the definition of \(x^{(t)}\).

⬜

Share on

Twitter Facebook LinkedIn

Haoyu Zhao

Theory of Optimization: More on Mirror Descent

Lazy Mirror Descent(Nesterov’s Dual Averaging)

Algorithm

Analysis

Share on

You May Also Enjoy

Deep Reinforcement Learning: Model Based Reinforcement Learning

Deep Reinforcement Learning: Policy Gradient and Actor-Critic

Theory of Optimization: Frank-Wolfe Algorithm

Theory of Optimization: Mirror Descent