# Theory of Optimization: Frank-Wolfe Algorithm

Published:

In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe.

## Frank-Wolfe

### Algorithm

Frank-Wolfe algorithm solves the following convex optimization problem

$\min_{x\in\mathcal D}f(x),$

for $f$ such that $\nabla f(x)$ is `Lipschitz’ in a certain sense. The algorithm reduces the problem into a sequence of linear optimization problems.

Frank-Wolfe algorithm has the following procedure:

1. Initial point $x^{(0)}\in\mathbb R^n$, step size $h>0$.
2. For $k=0,1,\dots$ do
3. Compute $y^{(k)} = \arg\min_{y\in\mathcal D}\langle y, \nabla f(x^{(k)})\rangle$.
4. $x^{(k+1)} \leftarrow (1-h_k)x^{(k)} + h_ky^{(k)}$ with $h_k = \frac{2}{k+2}$.

### Analysis

We have the following theorem for Frank-Wolfe algorithm.

Given a convex function $f$ on a convex set $\mathcal D$ and a constant $C_f$ such that

$f((1-h)x + hy)\le f(x) + h\langle \nabla f(x), y-x\rangle + \frac{1}{2}C_f h^2$

for any $x,y\in\mathcal D$ and $h\in [0,1]$, we have

$f(x^{(k)}) - f(x^*) \le \frac{2C_f}{k+2}.$

Proof: By the definition of $C_f$, we have

$f(x^{(k+1)})\le f(x^{(k)}) + h_k\langle \nabla f(x^{(k)}), y^{(k)}-x^{(k)}\rangle + \frac{1}{2}C_f h_k^2.$

Note that from the convexity of $f$ and the definition of $y^{(k)}$, we have

$f(x^*) \ge f(x^{(k)}) + \langle \nabla f(x^{(k)}), x^* - x^{(k)}\rangle \ge f(x^{(k)})+\langle \nabla f(x^{(k)}), y^{(k)} - x^{(k)}\rangle.$

Hence, we have

$f(x^{(k+1)})\le f(x^{(k)}) - h_k(f(x^{(k)}) - f(x^*)) + \frac{1}{2}C_f h_k^2.$

Let $\epsilon_k = f(x^{(k)}) - f(x^*)$, we have

$\epsilon_{k+1} \le (1-h_k)\epsilon_k + \frac{1}{2}C_f h_k^2.$

Note that $\epsilon_0 = f^{(0)} - f(x^*) \le \frac{1}{2}C_f$, we can prove the theorem by induction.

Note that if $\nabla f(x)$ is L-Lipschitz with respect to $||\cdot||$, over the domain $\mathcal D$, then $C_f \le L\cdot diam_{||\cdot||}(\mathcal D)^2$.

### Sparsity Analysis

For the domain $\mathcal D$ is a simplex, the step $y^{(k)}$ is always a vertex of the simplex, and hence, each step of Frank-Wolfe increases the sparsity by at most $1$. We can think that the convergence result of Frank Wolfe proves that there is an approximate sparse solution of the optimization problem.

Given a polytope $\mathcal P = conv(v_i)$ lies inside an unit ball. For any $u\in\mathcal P$, there are $k = O(\frac{1}{\epsilon^2} )$ vertices $v_1,\dots, v_k$ of $\mathcal P$ such that

$||\sum_{i=1}^k\lambda_i v_i -u||_2 \le \epsilon,$

for some $\sum_i \lambda_i = 1, \lambda_i \ge 0$.

Proof: Run Frank-Wolfe algorithm with $f(x) = ||x-u||_2^2$. Note that $f$ is 1-Lipschitz with respect to $||x||_2$ and that the diameter of $\mathcal P$ is bounded by $1$. Hence, we have $C_f \le 1$.

Therefore, Frank-Wolfe algorithm shows that

$f(x^{(k)}) - f(x^*) \le \frac{2}{k+2}.$

Since $f(u) = 0$, we have $||x^{(k)} - u||^2 \le \frac{4}{k+2}$, and this gives the result.

Tags:

Categories: