# Posts by Tags

## Theory of Optimization: More on Mirror Descent

Published:

In this post, we will continue on our discuss of mirror descent. We will present a variant of mirror descent: the lazy mirror descent, also known as Nesterov’s dual averaging. Read more

## Theory of Optimization: Frank-Wolfe Algorithm

Published:

In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe. Read more

## Theory of Optimization: Mirror Descent

Published:

In this post, we will introduce the Mirror Descent algorithm that solves the convex optimization algorithm. Read more

## Theory of Optimization: Projected (Sub)Gradient Descent

Published:

In this post, we will continue our analysis for gradient descent. In the previous lecture, we assume that all of the functions has $L$-Lipschitz gradient. For general $L$-smooth functions, the gradient descent algorithm can get a first order $\epsilon$-critical proint in $O(\frac{1}{\epsilon^2})$ iteration. When the function is convex, we show that we can use $O(\frac{1}{\epsilon})$ iterations to get a solution differ from the optimal for at most $\epsilon$. When the function is strongly convex and smooth, we show that the number of iterations can be reduced to $O(poly(\log \frac{1}{\epsilon}))$. However, in the previous post, we assume that the function is smooth, which implies that the function has gradient at all points. In this post, we will first assume that the function is convex but not smooth. Besides, in the previous post, we also focus on the unconstraint case, and in this posts, we will also introduce the analysis for constraint minimization. Read more

## Theory of Optimization: Gradient Descent

Published:

In this post, we will review the most basic and the most intuitive optimization method – the gradient decent method – in optimization.

The gradient descent algorithm works as follow: The algorithm requires an initial point $x^{(0)}\in\mathbb R^n$ and step size $h > 0$. Then the algorithm repeats to execute: until $||\nabla f(x^{(t)})|| \le \epsilon$. In the following of this section, we will assume that the gradient of $f$ is L-lipschitz, i.e. we call $f$ is L-lipschitz gradient or L-smooth. Read more