Theory of Optimization: Mirror Descent

5 minute read


In this post, we will introduce the Mirror Descent algorithm that solves the convex optimization algorithm.

Mirror Descent

Intuition and Algorithm

In the previous analysis for projected gradient descent, we can find that the PGD works for arbitrary Hilbert space . Suppose that now we want to optimize functions in Banach space , and the gradient descent does not make sence. In fact, lies in space and the gradient lies in the dual space . There is no problem in the Hilbert space , since by Riesz representation theorem is isometric to .

The great insight of Nemirovski and Yudin(creator of Mirror Descent) is that one can still do a gradient descent by first mapping(by a function called mirror map) the point into the dual space , performing the gradient update in dual space, and finally mapping back to and projecting to the constraint set . We will first introduce the mirror map and then discuss the projection step.

Mirror map: Let be a convex open set such that is included in its closure, that is and . We say that is a mirror map if it satisfies the following properties:

  1. is strictly convex and differentiable.
  2. The gradient of takes all possible values, that is .
  3. The gradient of diverges on the boundary of .

The projection operator is designed by the following Bergman divergence.

Bregman divergence: For any strictly convex function , we de fine the Bregman divergence as

And we define the projection of as

The Mirror Descent algorithm works as follow:

  1. Initial point such that , step size .
  2. For do
  3. Pick any .

Below is the illustration for the mirror descent(Picture from Convex Opitmization:Algorithms and Complexity) Alt text

Actually, one can also rewrite mirror descent as follows:

\begin{align} x^{(t+1)} =& \arg\min_{x\in\mathcal X\cap\mathcal D}D_{\Phi}(x,y^{(t+1)}) \newline =& \arg\min_{x\in\mathcal X\cap\mathcal D}(\Phi(x) - \Phi(y^{(t+1)}) - \nabla \Phi(y^{(t+1)})^Tx) \newline =& \arg\min_{x\in\mathcal X\cap\mathcal D}(\Phi(x) - (\nabla \Phi (x^{(t)}) - \eta (g^{(t)})^Tx)) \newline =& \arg\min_{x\in\mathcal X\cap\mathcal D}(\eta (g^{(t)})^Tx + D_{\Phi}(x,x^{(t)})) \end{align}

This expression is often taken as the definition of mirror descent. It gives a proximal point of view on mirror descent.

Analysis of Mirror Descent

We first have the following property:

Proof: The proof will be completed by trivial computation.

\begin{align} RHS =& f(x) - f(y) - \nabla f(y)^T(x-y) + f(z) - f(x) \newline &\quad - \nabla f(x)^T(z-x) - f(z) + f(y) + \nabla f(y)^T(z-y) \newline =& \nabla f(x)^T(x-z) - \nabla f(y)^T(x-z) \newline =& LHS. \end{align}

Then we have the following Pythagorean Theorem.

Pythagorean Theorem: Let and , then

which also implies

Proof: First, it is easy to show that

Then it is equivalent to show that

Let , then is the minimizer of in set . Then from the optimality of convex functions, we have

which is just the inequality

Then, we have the following theorem:

Let be a mirror map -strongly convex on w.r.t . Let , and be convex and -Lipschitz w.r.t . Then the mirror descent with satisfies

Proof: Let . The claimed bound will be obtained by taking a limit . By the convexity of , the definition of mirror descent, and the previous 2 lemmas, we have

\begin{align} &f(x^{(s)}) - f(x) \newline \le& (g^{(s)})^T(x^{(s)} - x) \newline =& \frac{1}{\eta}(\nabla \Phi(x^{(s)}) - \nabla\Phi(y^{(s+1)}))^T(x^{(s)} - x) \newline =& \frac{1}{\eta}\left(D_{\Phi}(x,x^{(s)})+D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x,y^{(s+1)})\right) \newline \le & \frac{1}{\eta}\left(D_{\Phi}(x,x^{(s)})+D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x,x^{(s+1)}) - D_{\Phi}(x^{(s+1)},y^{(s+1)})\right). \end{align}

The term will lead to a telescopic sum, and it remains to bound the other term. By the -strongly convexity of the mirror map and , we have

\begin{align} & D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x^{(s+1)},y^{(s+1)}) \newline =& \Phi(x^{(s)}) - \Phi(x^{(s+1)}) - \nabla \Phi(y^{(s+1)})^T(x^{(s)} - x^{(s+1)}) \newline \le& (\nabla\Phi(x^{(s)})- \nabla \Phi(y^{(s+1)}))^T(x^{(s)} - x^{(s+1)}) - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline =& \eta (g^{(s)})^T(x^{(s)} - x^{(s+1)}) - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline \le& \eta L ||x^{(s)} - x^{(s+1)}|| - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline \le& \frac{(\eta L)^2}{2\rho}. \end{align}

Then we have

which concludes the proof up to trivial computation.

Standard Setups for Mirror Descent

Ball setup: Taking on , the function is a mirror map and strongly convex w.r.t , and furthermore, the associated Bregman divergence is given by

\begin{align} D_{\Phi}(x,y) =& \Phi(x) - \Phi(y) - \nabla\Phi(y)^T(x-y) \newline =& \frac{||x||^2}{2} - \frac{||y||^2}{2} - y^T(x-y) \newline =& \frac{1}{2}||x-y||^2. \end{align}

In this case, the mirror descent is exactly equivalent to projected subgradient descent.

Simplex setup: Now we focus on case where and the simplex .

Step formula: From computation, we can know that

for some normalization constant . Note that this update formula is just the multiplicative weight updating.

Strong convexity: From computation, we can know that is 1-strongly convex w.r.t 1-norm, i.e.

Hence, we have .

Diameter: Direct computation will show that . Hence, .

Final result: We have the following theorem.

Let be 1-Lipschitz functions on . Then, the mirror descent with the mirror map discussed above outputs in iterations such that

In comparison, projected gradient descent gives , which is much larger.

Note that the proofs of projected subgradient descent in the previous post and the mirror descent in this post do not require that we are solving the same convex function . Based on this observation, we can easily extend the previous results to the online setting. For the online mirror descent(online subgradient descent is a special case of online mirror descent), we have

which shows that the regret is .