Theory of Optimization: Preliminaries and Basic Properties

7 minute read


Recently, I find an interesting course taught by Prof. Yin Tat Lee at UW. The course is called `Theory of Optimization and Continuous Algorithms’, and the lecture notes are available under the homepage of this courseuw-cse535-winter19. As a great fan of optimization theory and algorithm design, I think I will follow this course and write a bunch of blogs to record my study of this course.

Most of the materials in this series of blogs will follow the lecture notes of the course, and and interesting optimization book Convex Optimization: Algorithms and Complexity by Sebastien Bubeck.

Since this is the first blog about this course, I will present the preliminaries of the optimization theory, and some basic knowledge about convex optimization, including some basic properties of convex functions.


First, we will introduce some of the preliminaries about theory of optimization. All of the definitions will be used later.


First, we will introduce the notion of lipschitz and differentiable

Lipschitz: A function \(f:V\to W\) is L-lipschitz if

\[||f(x) - f(y)||_W \le L\cdot ||x - y||_V,\]

where the norms \(||\cdot||_{W}\) and \(||\cdot||_{V}\) are \(l_2\) norms if unspecified.

Differentiable: A function \(f\in\mathcal C^k(\mathbb R^n)\) is k-differentiable if its \(k^{th}\) derivative is continuous.

Linear Algebra

The notion of positive semi-definite and the norm of matrices are also important in the world of optimization. We will introduce these notions.

Positive semi-definite: A symmetric matrix \(A\) is positive semi-defi nite (PSD) if \(x^T Ax \ge 0\) for all \(x \in \mathbb R^n\). We write \(A \succeq B\) if \(A - B\) is PSD.

Trace, norm of matrix: For any matrix A, we defi ne

\[trA = \sum A_{ii}, ||A||_F^2 = tr(A^TA), ||A||_{op} = \max_{||x||_2\le 1}||Ax||_2.\]

For symmetric matrix \(A\), we have \(trA = \sum \lambda_i, ||A||_F = \sum \lambda_i^2\) and \(||A||_{op} = \max_i | \lambda_i |\), where \(\lambda_i\) are the eigenvalues of matrix \(A\).


As far as I know, we can get some intuitions of the design and proofs of some optimization algorithm through the Taylor’s expansion.

Taylor’s Remainder Theorem: For any \(g \in \mathcal C^{k+1}(\mathbb R)\), and any \(x\) and \(y\), there is a \(\xi\in [x, y]\) such that

\[g(y) = \sum_{j=0}^k g^{(j)}(x)\frac{(y-x)^j}{j!} + g^{(k+1)}(\xi)\frac{(y-x)^{k+1}}{(k+1)!}.\]


KL-divergence: The KL-divergence of a density \(\rho\) with respect to another density \(\nu\) is defined as

\[D_{KL}(\rho || \nu ) = \int \rho(x) \log \frac{\rho(x)}{\nu(x)}dx.\]

Ito’s lemma: For any process \(x_t \in \mathbb R^n\) satisfying \(dx_i = \mu(x_t)dt + \sigma(x_i)dW_t\), where \(\mu(x_t)\in\mathbb R^n\) and \(\sigma(x_t)\in\mathbb R^{n\times m}\), we have that

\[df(x_t) = \nabla f(x_t)^T\mu(x_t)dt + \nabla f(x_t)^T\sigma(x_t)dW_t + \frac{1}{2}tr(\sigma(x_t)^T\nabla^2f(x_t)\sigma(x_t))dt.\]

Introduction–Basic properties of convex functions

In this section, we will introduce some basic properties of convex functions, and we will also give some examples.


Here will briefly define the convexity of sets and functions.

Convex set: A set \(K\) in \(\mathbb R^n\) is convex if for every pair of points \(x,y\in K\), we have \([x,y] \subseteq K\).

Convex function: A function \(f:\mathbb R^n\to\mathbb R\) is

  1. Convex: if for any \(\lambda\in [0,1]\), we have \(f(\lambda x + (1-\lambda) y) \le \lambda f(x) + (1-\lambda) f(y)\).
  2. Concave: if for any \(\lambda\in [0,1]\), we have \(f(\lambda x + (1-\lambda) y) \ge \lambda f(x) + (1-\lambda) f(y)\).
  3. Logconcave: if \(f\) is nonnegative and \(\log f\) is concave.
  4. Quasiconvex: if the level sets of \(f\), defined by \(\{x:f(x)\le t\}\) are convex for all \(t\).

Seperation theorems

Why is convex functions so useful? Maybe convexity is not only general enough to model different kinds of problem, and also simple enough to be solved efficiently. In this section, we will introduce the seperation theorem for convex sets, which allows us to do binary search to find a point in a convex set. This is the basis for polynomial-time algorithms for optimizing general convex functions.

Seperation of a point and a closed convex set: Let \(K\) be a closed convex set in \(\mathbb R^n\) and \(y\notin K\). There is a non-zero \(\theta\in\mathbb R^n\) such that

\[\langle \theta, y \rangle > \max_{x\in K}\langle \theta, x \rangle\]

Proof: Let \(x^*\) be a point in \(K\) closest to \(y\), i.e.

\[x^* \in \arg\min_{x\in K}||x-y||^2.\]

(Such a minimizer always exists for closed convex sets, this is sometimes calle Hilbert’s projection theorem). Using convexity of \(K\), for any \(x \in K\) and any \(0 \le t \le 1\) , we have that

\[||(1 - t)x^* + tx - y||^2 \ge ||x^* - y||^2.\]

Expand the LHS, we have

\begin{align} ||(1 - t)x^* + tx - y||^2 =& ||x^* - y + t(x - x^)||^2 \newline =& ||x^ - y||^2 + 2t\langle x^* - y, x - x^*\rangle +O(t^2). \end{align}

Taking \(t\to 0^{+}\), we have \(\langle x^* - y, x - x^*\rangle\) for all \(x\in K\).

Taking \(\theta = y - x^*\), we have

\[\langle \theta, y \rangle >\langle \theta, x^* \rangle \ge \langle \theta, x \rangle,\forall x\in K.\]

From the above theorem, we know that a polytope is as general as a convex set. As a corollary, we have

Intersection of halfspaces: Any closed convex set \(K\) can be written as the intersection of halfspaces as follows:

\[K = \bigcap_{\theta\in\mathbb R^n}\left\{x:\langle \theta,x\rangle\le\max_{y\in K}\langle\theta,y\rangle\right\}.\]

In other words, any convex set is a limit of a sequence of polyhedra. Next, we will prove a `seperation theorem’ for convex functions. Based on this theorem, one can design polynomial algorithms for optimizing convex functions.

First order definition: Let \(f\in\mathcal C^1(\mathbb R^n)\) be convex. Then for any \(x,y\in\mathbb R^n\), we have

\[f(y) \ge f(x) + \nabla f(x)^T(y-x).\]

Proof: Fix any \(x,y\in\mathbb R^n\). Let \(g(t) = f((1-t)x + ty)\). Since \(f\) is convex, so is \(g\)(any convex function is also convex when restricted to a line), and we have

\[g(t) \le (1-t)g(0) + tg(1),\]

which implies that

\[g(1) \ge g(0) + \frac{g(t) - g(0)}{t}.\]

Taking \(t\to 0^{+}\), we have \(g(1) \ge g(0) + g'(0)\). Applying the chain rule, we have

\[f(y) \ge f(x) + \nabla f(x)^T(y-x).\]

Actually, the above theorem can be viewed as an equivalent definition of the convex functions. It is easy to show that, when

\[f(y) \ge f(x) + \nabla f(x)^T(y-x),\]

holds for all \(x,y\in K\)(\(K = \text{dom}f\), which is the domain of \(f\), is convex), we have \(f\) is convex.

Then from the previous theorem, we have the first order optimality condition, which is shown as follow

First order condition for unconstraint problems: Let \(f\in\mathcal C^1(\mathbb R^n)\) be convex. Then \(x\) is the minimizer of \(\min_{x\in\mathbb R^n}f(x)\) if and only if \(\nabla f(x) = 0\).

Proof: If \(\nabla f(x) = 0\), then

\[f(x - \epsilon \nabla f(x)) = f(x) - (\epsilon + o(\epsilon))||f(x)||^2 < f(x),\]

for small enough \(\epsilon\). Hence, such a point cannot be the minimizer.

On the other hand, if \(\nabla f(x) = 0\), the previous theorem shows that

\[f(y) \ge f(x) + \nabla f(x)^T(y - x) = f(x), \text{for all } y.\]