Theory of Optimization: Frank-Wolfe Algorithm

Published:

In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe.

Frank-Wolfe

Algorithm

Frank-Wolfe algorithm solves the following convex optimization problem

for $f$ such that $\nabla f(x)$ is `Lipschitz’ in a certain sense. The algorithm reduces the problem into a sequence of linear optimization problems.

Frank-Wolfe algorithm has the following procedure:

1. Initial point $x^{(0)}\in\mathbb R^n$, step size $h>0$.
2. For $k=0,1,\dots$ do
3. Compute $y^{(k)} = \arg\min_{y\in\mathcal D}\langle y, \nabla f(x^{(k)})\rangle$.
4. $x^{(k+1)} \leftarrow (1-h_k)x^{(k)} + h_ky^{(k)}$ with $h_k = \frac{2}{k+2}$.

Analysis

We have the following theorem for Frank-Wolfe algorithm.

Given a convex function $f$ on a convex set $\mathcal D$ and a constant $C_f$ such that

for any $x,y\in\mathcal D$ and $h\in [0,1]$, we have

Proof: By the definition of $C_f$, we have

Note that from the convexity of $f$ and the definition of $y^{(k)}$, we have

Hence, we have

Let $\epsilon_k = f(x^{(k)}) - f(x^*)$, we have

Note that $\epsilon_0 = f^{(0)} - f(x^*) \le \frac{1}{2}C_f$, we can prove the theorem by induction.

Note that if $\nabla f(x)$ is L-Lipschitz with respect to $||\cdot||$, over the domain $\mathcal D$, then $C_f \le L\cdot diam_{||\cdot||}(\mathcal D)^2$.

Sparsity Analysis

For the domain $\mathcal D$ is a simplex, the step $y^{(k)}$ is always a vertex of the simplex, and hence, each step of Frank-Wolfe increases the sparsity by at most $1$. We can think that the convergence result of Frank Wolfe proves that there is an approximate sparse solution of the optimization problem.

Given a polytope $\mathcal P = conv(v_i)$ lies inside an unit ball. For any $u\in\mathcal P$, there are $k = O(\frac{1}{\epsilon^2} )$ vertices $v_1,\dots, v_k$ of $\mathcal P$ such that

for some $\sum_i \lambda_i = 1, \lambda_i \ge 0$.

Proof: Run Frank-Wolfe algorithm with $f(x) = ||x-u||_2^2$. Note that $f$ is 1-Lipschitz with respect to $||x||_2$ and that the diameter of $\mathcal P$ is bounded by $1$. Hence, we have $C_f \le 1$.

Therefore, Frank-Wolfe algorithm shows that

Since $f(u) = 0$, we have $||x^{(k)} - u||^2 \le \frac{4}{k+2}$, and this gives the result.

Tags: