Theory of Optimization: Mirror Descent

7 minute read


In this post, we will introduce the Mirror Descent algorithm that solves the convex optimization algorithm.

Mirror Descent

Intuition and Algorithm

In the previous analysis for projected gradient descent, we can find that the PGD works for arbitrary Hilbert space \(\mathcal H\). Suppose that now we want to optimize functions in Banach space \(\mathcal B\), and the gradient descent does not make sence. In fact, \(x^{(t)}\) lies in space \(\mathcal B\) and the gradient \(\nabla f(x^{(t)})\) lies in the dual space \(\mathcal B^*\). There is no problem in the Hilbert space \(\mathcal H\), since by Riesz representation theorem \(\mathcal H^*\) is isometric to \(\mathcal H\).

The great insight of Nemirovski and Yudin(creator of Mirror Descent) is that one can still do a gradient descent by first mapping(by a function \(\Phi\) called mirror map) the point \(x\in\mathcal B\) into the dual space \(\mathcal B^*\), performing the gradient update in dual space, and finally mapping back to \(\mathcal B\) and projecting to the constraint set \(\mathcal X\). We will first introduce the mirror map and then discuss the projection step.

Mirror map: Let \(\mathcal D\subset\mathbb R^n\) be a convex open set such that \(\mathbb X\) is included in its closure, that is \(\mathcal X \subset \mathcal{\bar D}\) and \(\mathcal X\cap\mathcal D\neq\phi\). We say that \(\Phi:\mathcal D\to\mathbb R\) is a mirror map if it satisfies the following properties:

  1. \(\Phi\) is strictly convex and differentiable.
  2. The gradient of \(\Phi\) takes all possible values, that is \(\nabla\Phi(\mathcal D) = \mathbb R^n\).
  3. The gradient of \(\Phi\) diverges on the boundary of \(\mathcal D\).

The projection operator is designed by the following Bergman divergence.

Bregman divergence: For any strictly convex function \(\Phi\), we de fine the Bregman divergence as

\[D_{\Phi}(y,x) := \Phi(y) - \Phi(x) - \langle\nabla \Phi(x),y-x\rangle.\]

And we define the projection of \(y\) as

\[\Pi_{\mathcal X}^{\Phi}(y) = \arg\min_{x\in\mathcal X\cap\mathcal D}D_{\Phi}(x,y).\]

The Mirror Descent algorithm works as follow:

  1. Initial point \(x^{(0)} \in \mathbb R^n\) such that \(\nabla \Phi(x^{(0)}) = 0\), step size \(h > 0\).
  2. For \(k=0,1,2,\dots\) do
  3. Pick any \(g^{(k)}\in\partial f(x^{(k)})\).
  4. \[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(x^{(k)}) - \eta\cdot g^{(k)}.\]
  5. \[x^{(k+1)} = \Pi_{\mathcal X}^{\Phi}(y^{(k+1)}) := \arg\min_{x\in \mathcal X\cap\mathcal D}D_{\Phi}(x,y^{(k+1)}).\]

Below is the illustration for the mirror descent(Picture from Convex Opitmization:Algorithms and Complexity) Alt text

Actually, one can also rewrite mirror descent as follows:

\begin{align} x^{(t+1)} =& \arg\min_{x\in\mathcal X\cap\mathcal D}D_{\Phi}(x,y^{(t+1)}) \newline =& \arg\min_{x\in\mathcal X\cap\mathcal D}(\Phi(x) - \Phi(y^{(t+1)}) - \nabla \Phi(y^{(t+1)})^Tx) \newline =& \arg\min_{x\in\mathcal X\cap\mathcal D}(\Phi(x) - (\nabla \Phi (x^{(t)}) - \eta (g^{(t)})^Tx)) \newline =& \arg\min_{x\in\mathcal X\cap\mathcal D}(\eta (g^{(t)})^Tx + D_{\Phi}(x,x^{(t)})) \end{align}

This expression is often taken as the definition of mirror descent. It gives a proximal point of view on mirror descent.

Analysis of Mirror Descent

We first have the following property:

\[(\nabla f(x) - \nabla f(y))^T(x-z) = D_f(x,y) + D_f(z,x) - D_f(z,y).\]

Proof: The proof will be completed by trivial computation.

\begin{align} RHS =& f(x) - f(y) - \nabla f(y)^T(x-y) + f(z) - f(x) \newline &\quad - \nabla f(x)^T(z-x) - f(z) + f(y) + \nabla f(y)^T(z-y) \newline =& \nabla f(x)^T(x-z) - \nabla f(y)^T(x-z) \newline =& LHS. \end{align}

Then we have the following Pythagorean Theorem.

Pythagorean Theorem: Let \(x\in\mathcal X\cap\mathcal D\) and \(y\in\mathcal D\), then

\[(\nabla\Phi(\Pi_{\mathcal X}^{\Phi}(y)) - \nabla\Phi(y))^T(\Pi_{\mathcal X}^{\Phi}(y)-x)\le 0,\]

which also implies

\[D_{\Phi}(x,\Pi_{\mathcal X}^{\Phi}(y)) + D_{\Phi}(\Pi_{\mathcal X}^{\Phi}(y),y) \le D_{\Phi}(x,y).\]

Proof: First, it is easy to show that

\[\nabla_xD_{\Phi}(x,y) = \nabla \Phi(x) - \nabla \Phi(y).\]

Then it is equivalent to show that

\[\nabla_xD_{\Phi}(\Pi_{\mathcal X}^{\Phi}(y),y)^T(\Pi_{\mathcal X}^{\Phi}(y)-x)\le 0.\]

Let \(D_{\Phi}(\cdot,y) := f(\cdot)\), then \(\Pi_{\mathcal X}^{\Phi}(y)\) is the minimizer of \(f\) in set \(\mathcal X\cap\mathcal D\). Then from the optimality of convex functions, we have

\[\nabla f(x^*)^T(x^* - x)\le 0,\forall x\in \mathcal X\cap\mathcal D,\]

which is just the inequality

\[\nabla_xD_{\Phi}(\Pi_{\mathcal X}^{\Phi}(y),y)^T(\Pi_{\mathcal X}^{\Phi}(y)-x)\le 0.\]

Then, we have the following theorem:

Let \(\Phi\) be a mirror map \(\rho\)-strongly convex on \(\mathcal X\cap\mathcal D\) w.r.t \(||\cdot||\) . Let \(R^2 = \sup_{x\in\mathcal X\cap\mathcal D}\Phi(x) - \Phi(x^{(0)})\), and \(f\) be convex and \(L\)-Lipschitz w.r.t \(||\cdot||\) . Then the mirror descent with \(\eta = \frac{R}{L}\sqrt{\frac{2\rho}{t}}\) satisfies

\[f\left(\frac{1}{t}\sum_{s=0}^{t-1}x^{(s)}\right) - f(x^*) \le RL\sqrt{\frac{2}{\rho t}}.\]

Proof: Let \(x\in\mathcal X\cap\mathcal D\). The claimed bound will be obtained by taking a limit \(x\to x^*\). By the convexity of \(f\), the definition of mirror descent, and the previous 2 lemmas, we have

\begin{align} &f(x^{(s)}) - f(x) \newline \le& (g^{(s)})^T(x^{(s)} - x) \newline =& \frac{1}{\eta}(\nabla \Phi(x^{(s)}) - \nabla\Phi(y^{(s+1)}))^T(x^{(s)} - x) \newline =& \frac{1}{\eta}\left(D_{\Phi}(x,x^{(s)})+D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x,y^{(s+1)})\right) \newline \le & \frac{1}{\eta}\left(D_{\Phi}(x,x^{(s)})+D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x,x^{(s+1)}) - D_{\Phi}(x^{(s+1)},y^{(s+1)})\right). \end{align}

The term \(D_{\Phi}(x,x^{(s)})-D_{\Phi}(x,x^{(s+1)})\) will lead to a telescopic sum, and it remains to bound the other term. By the \(\rho\)-strongly convexity of the mirror map and \(az-bz^2 \le \frac{a^2}{4b}\), we have

\begin{align} & D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x^{(s+1)},y^{(s+1)}) \newline =& \Phi(x^{(s)}) - \Phi(x^{(s+1)}) - \nabla \Phi(y^{(s+1)})^T(x^{(s)} - x^{(s+1)}) \newline \le& (\nabla\Phi(x^{(s)})- \nabla \Phi(y^{(s+1)}))^T(x^{(s)} - x^{(s+1)}) - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline =& \eta (g^{(s)})^T(x^{(s)} - x^{(s+1)}) - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline \le& \eta L ||x^{(s)} - x^{(s+1)}|| - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline \le& \frac{(\eta L)^2}{2\rho}. \end{align}

Then we have

\[\sum_{s=0}^{t-1}(f(x^{(s)}) - f(x)) \le \frac{D_{\Phi}(x,x^{(0)}) - D_{\Phi}(x,x^{(t)})}{\eta} + \eta\frac{L^2 t}{2\rho},\]

which concludes the proof up to trivial computation.

Standard Setups for Mirror Descent

Ball setup: Taking \(\Phi(x) = \frac{1}{2}||x||^2\) on \(\mathcal D = \mathbb R^n\), the function \(\Phi\) is a mirror map and strongly convex w.r.t \(||\cdot ||_2\), and furthermore, the associated Bregman divergence is given by

\begin{align} D_{\Phi}(x,y) =& \Phi(x) - \Phi(y) - \nabla\Phi(y)^T(x-y) \newline =& \frac{||x||^2}{2} - \frac{||y||^2}{2} - y^T(x-y) \newline =& \frac{1}{2}||x-y||^2. \end{align}

In this case, the mirror descent is exactly equivalent to projected subgradient descent.

Simplex setup: Now we focus on case where \(\Phi(x) = \sum x_i\log x_i\) and the simplex \(\mathcal D = \{x_i\ge 0, \sum x_i = 1\}\).

Step formula: From computation, we can know that

\[x_i^{(k+1)} = e^{-\eta g_i^{(k)}}x_i^{(k)} / Z,\]

for some normalization constant \(Z\). Note that this update formula is just the multiplicative weight updating.

Strong convexity: From computation, we can know that \(\Phi(x)\) is 1-strongly convex w.r.t 1-norm, i.e.

\[\Phi(y) - \Phi(x) - \nabla\Phi(x)^T(y-x) \ge \frac{1}{2}||x-y||_1^2.\]

Hence, we have \(\rho = 1\).

Diameter: Direct computation will show that \(-\log n\le\Phi(x)\le 0\). Hence, \(R^2 \le \log n\).

Final result: We have the following theorem.

Let \(f\) be 1-Lipschitz functions on \(||\cdot||_1\). Then, the mirror descent with the mirror map discussed above outputs \(x\) in \(T\) iterations such that

\[f(x) - \min_x f(x) \le \sqrt{\frac{2\log n}{T}}.\]

In comparison, projected gradient descent gives \(\sqrt{\frac{n}{T}}\), which is much larger.

Note that the proofs of projected subgradient descent in the previous post and the mirror descent in this post do not require that we are solving the same convex function \(f\). Based on this observation, we can easily extend the previous results to the online setting. For the online mirror descent(online subgradient descent is a special case of online mirror descent), we have

\[\sum_{s=0}^{t-1}(f(x^{(s)}) - f(x)) \le RL\sqrt{\frac{2t}{\rho}},\]

which shows that the regret is \(O(\sqrt{t})\).