<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://haoyuzhao123.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://haoyuzhao123.github.io/" rel="alternate" type="text/html" /><updated>2026-02-14T20:59:30-08:00</updated><id>https://haoyuzhao123.github.io/feed.xml</id><title type="html">Haoyu Zhao’s Website</title><subtitle>An amazing website</subtitle><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><entry><title type="html">Deep Reinforcement Learning: Model Based Reinforcement Learning</title><link href="https://haoyuzhao123.github.io/deeprl/drl6-mbrl/" rel="alternate" type="text/html" title="Deep Reinforcement Learning: Model Based Reinforcement Learning" /><published>2020-07-18T00:00:00-07:00</published><updated>2020-07-18T00:00:00-07:00</updated><id>https://haoyuzhao123.github.io/deeprl/drl6-mbrl</id><content type="html" xml:base="https://haoyuzhao123.github.io/deeprl/drl6-mbrl/"><![CDATA[<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<script src="https://cdnjs.cloudflare.com/ajax/libs/mermaid/8.6.0/mermaid.min.js"></script>

<!--more-->

<div class="mermaid">
graph TD;
    id1[Time of Planning]--&gt;id2[Decision Time<br />Planning];
    id1--&gt;id3[Background<br />Planning];
    id2--&gt;id4[Continuous<br />Actions];
    id2--&gt;id5[Discrete<br />Actions];
    id4--&gt;id6[Shooting];
    id4--&gt;id7[Collocation];
    id3--&gt;id8[Simulate<br />Environment];
    id3--&gt;id9[Assist Learning<br />Algorithm];
    id6--&gt;id10[iLQR<br />DDP]:::methods;
    id7--&gt;id11[Direct collocation<br />STOMP]:::methods;
    id5--&gt;id12[Heuristic search<br />MCTS]:::methods;
    id8--&gt;id13[DYNA<br />MVE<br />MBPO]:::methods;
    id9--&gt;id14[Policy backprop<br />SVG<br />Dreamer]:::methods;
    classDef methods fill:#f96;
</div>

<h1 id="optimal-control-and-planning">Optimal Control and Planning</h1>
<h2 id="what-if-we-knew-the-transition-dynamics">What if we knew the transition dynamics</h2>
<p>Often we do know the dynamics</p>
<ol>
  <li>Games (e.g. Go)</li>
  <li>Easily modeled systems (e.g., navigating a car)</li>
  <li>Simulated environments (e.g, simulated robots, video games)
Often we learn the dynamics</li>
  <li>System identification - fit unknown parameters of a known model</li>
  <li>Learning - fit a general purpose model to observed transition data</li>
</ol>

<p>Model-based reinforcement learning
Model-based reinforcement learning: learn the transition dynamics, then figure out how to choose actions</p>]]></content><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><category term="deeprl" /><category term="reinforcementlearning" /><category term="deeplearning" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Deep Reinforcement Learning: Policy Gradient and Actor-Critic</title><link href="https://haoyuzhao123.github.io/deeprl/drl3-pg/" rel="alternate" type="text/html" title="Deep Reinforcement Learning: Policy Gradient and Actor-Critic" /><published>2020-06-16T00:00:00-07:00</published><updated>2020-06-16T00:00:00-07:00</updated><id>https://haoyuzhao123.github.io/deeprl/drl3-pg</id><content type="html" xml:base="https://haoyuzhao123.github.io/deeprl/drl3-pg/"><![CDATA[<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<script src="https://cdnjs.cloudflare.com/ajax/libs/mermaid/8.6.0/mermaid.min.js"></script>

<p>In this post, we review the basic policy gradient algorithm for deep reinforcement learning and the actor-critic algorithm. Most of the contents are derived from <a href="http://rail.eecs.berkeley.edu/deeprlcourse/" title="CS 285 at UC Berkeley">CS 285 at UC Berkeley</a>.</p>

<!--more-->

<div class="mermaid">
graph TD;
    id1[fit a model/estimate the return]--&gt;id2[improve the policy];
    id2--&gt;id3[generate samples];
    id3--&gt;id1;
</div>

<h1 id="policy-gradient-introduction">Policy Gradient Introduction</h1>
<p>In this part, we will first focus on the fully observable model with finite horizon, i.e., \(o_t = s_{t}\) and \(t\) is bounded. At a very high level, the policy gradient method is very simple: The policy gradient algorithm parametrized the policy \(\pi_{\theta}(a|s)\) by \(\theta\), and use samples (interactions with the environment) to update the parameter \(\theta\).</p>

<p>Next, we first derive the policy gradient algorithm (REINFORCE) and introduce some techniques to improve the performance of policy graident.</p>

<h2 id="evaluating-the-objective">Evaluating the objective</h2>
<p>First, we use \(\tau = \{s_1,a_1,s_2,a_2,\dots,s_T,a_T\}\) to denote the trajectory of 1 trial, \(r(\tau)\) to denote the reward of trajectory \(\tau\), and \(p_{\theta}(\tau),\pi_{\theta}(\tau)\) to denote the probability that trajectory \(\tau\) appears under the policy \(\pi_{\theta}(a|s)\).</p>

<p>Recall that our reinforcement learning objective can be written as</p>

\[\theta^* = \arg\max_{\theta} \mathbb E_{\tau\sim p_{\theta}(\tau)}\left[\sum_t r(s_{t},a_{t})\right].\]

<p>We define \(J(\theta) = \mathbb E_{\tau\sim p_{\theta}(\tau)}\left[\sum_t r(s_{t},a_{t})\right]\), and the objective becomes \(\theta^* = \arg\max_{\theta} J(\theta)\). We want to compute the gradient of \(J(\theta)\), and then we can update our policy \(\pi_{\theta}\).</p>

<p>First, we have</p>

\[J(\theta) = \mathbb E_{\tau\sim p_{\theta}(\tau)}\left[\sum_t r(s_{t},a_{t})\right] = \int \pi_{\theta}(\tau)r(\tau)d\tau.\]

<p>From the basic calculus computations, we have</p>

\[\pi_{\theta}(\tau)\nabla_{\theta}\log\pi_{\theta}(\tau) = \pi_{\theta}(\tau)\frac{\nabla_{\theta}\pi_{\theta}(\tau)}{\pi_{\theta}(\tau)} = \nabla_{\theta}\pi_{\theta}(\tau).\]

<p>Then, plug into the the derivative of \(J(\theta)\), we can get</p>

\[\nabla_{\theta}J(\theta) = \int\nabla_{\theta}\pi_{\theta}(\tau)r(\tau)d\tau = \mathbb E_{\tau\sim\pi_{\theta}(\tau)}\left[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)\right].\]

<p>Then, recall that from the Markovian assumption,we have</p>

\[\pi_{\theta}(s_1,a_1,\dots,s_T,a_T) = p(s_1)\prod_{t=1}^T \pi_{\theta}(a_{t} | s_{t})p(s_{t+1} | s_{t},a_{t}).\]

<p>Taking the logarithms on both sides, we can get</p>

\[\log\pi_{\theta}(\tau) = \log p(s_1) + \sum_{t=1}^T(\log\pi_{\theta}(a_{t} | s_{t})+\log p(s_{t+1}|s_{t},a_{t})).\]

<p>Taking the gradient, the first term and the third term are thrown away, and we have the following formula for the policy gradient.</p>

\[\nabla_{\theta} J(\theta) = \mathbb E_{\tau\sim\pi_{\theta}(\tau)}\left[\left(\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\right)\left(\sum_{t=1}^T r(s_{t},a_{t})\right)\right]\]

<h2 id="evaluating-the-policy-graident-and-reinforce-algorithm">Evaluating the policy graident and REINFORCE algorithm</h2>
<p>Now given the formula for the policy gradient, we can estimate the expectation in policy gradient by samples. Specifically, we have the following formula for the estimatation.</p>

\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right),\]

<p>where \(i\) denotes the index of the trajectory.</p>

<p>Then, given an estimation of the policy gradient, we can then improve our policy by the gradient method \(\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\).</p>

<p>Putting them all together, we have a simple <strong>REINFORCE</strong> algorithm (Monte-Carlo policy gradient).</p>

<blockquote>

  <ol>
    <li>For \(s=1,2,\dots,\)
      <ol>
        <li>sample \(\{\tau^i\}\) from \(\pi_{\theta}(a_{t}\|s_{t})\) (run the policy)</li>
        <li>
\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\]
        </li>
        <li>
\[\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\]
        </li>
      </ol>
    </li>
  </ol>
</blockquote>

<p>Note that this algorithm may behave not very well because of the high variance. We need to use variance reduction techniques to improve this algorithm. But before we introduce the variance reduction techniques, we may first see an example of the REINFORCE algorithm and see its generalization into the partially observable case.</p>

<p>Now, we compute the policy gradient for Gaussian policies</p>

<p>\begin{align}
\pi_{\theta}(a_{t}|s_{t}) =&amp; \mathcal N(f_{\theta}(s_{t});\Sigma), \newline
\log\pi_{\theta}(a_{t}|s_{t}) =&amp; -\frac{1}{2}(f_{\theta}(s_{t})-a_{t})^T\Sigma^{-1}(f_{\theta}(s_{t})-a_{t}) + const, \newline
\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}) =&amp; -\Sigma^{-1}(f_{\theta}(s_{t})-a_{t})\frac{df}{d\theta}.
\end{align}</p>

<p>The REINFORCE algorithm simply tends to increase the probability of a <em>good</em> trajectory with high reward, and decrease the probability of a <em>bad</em> trajectory with low reward. The intuition behind REINFORCE can be summarized as: good stuff is made more likely, bad stuff is mode less likely, simply formalized the notion of “trial and error”.</p>

<p>It is also very easy to generalize the REINFORCE algorithm into the partially observable case. For partially observable cases, we just change \(s_{t}\) into \(o_t\) in the policy part. The estimation of the policy gradient is given as follow</p>

\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|o_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right).\]

<h2 id="variance-reduction-technqiues">Variance Reduction Technqiues</h2>
<p>What is the problem with the policy gradient? As mentioned before, the policy gradient suffers from very high variance.</p>

<p>Recall that the policy gradient is</p>

\[\nabla_{\theta}J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau).\]

<p>The variance is <em>vary large</em> because \(r(\tau)\) is very large. Moreover, adding the reward function by a constant will change the result. For example if we add a large constant to the reward function and let all rewards to be positive, then in each iteration, the policy gradient method will try to increase the probability of all sampled trajectories.</p>

<p>There are 2 methods that can reduce the variance of the policy gradient method, <em>causality</em> and <em>baseline</em>, and we will introduce them one by one.</p>

<h3 id="causality">Causality</h3>
<p>The intuition of <em>causality</em> is very simple: the policy at time \(t'\) cannot affect reward at time \(t &lt; t'\). Then instead of using the whole \(r(\tau)\) to update the \(\pi_{\theta}\) at time \(t\), we can only focus on the reward that happens after time \(t\). Formally, we have the following formula</p>

<p>\begin{align<em>}
\nabla_{\theta} J(\theta)\approx &amp; \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right) \newline
= &amp; \frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\left(\sum_{t’=t}^T r(s_{i,t’},a_{i,t’})\right)
\end{align</em>}</p>

<p>The last term can be viewed as “reward to go”, and we use \(\hat Q_{i,t} = \sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\) to denote that quantity.</p>

<p>It is easy to know that the <em>causality</em> helps to reduce the variance, because smaller number leads to smaller variance. In practice this basically always help.</p>

<h3 id="baselines">Baselines</h3>
<p>The <em>baselines</em> technique simply means to add constants to the reward function.</p>

<p>We want</p>

\[\nabla_{\theta}J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(\tau)[r(\tau)-b],\]

<p>where \(b\) is some constant. We may simply choose \(b = \frac{1}{N}\sum_{i=1}^N r(\tau)\).</p>

<p>Then, we show that adding constants to the reward function does not change the policy gradient. We have
\begin{align}
\mathbb E\left[\nabla_{\theta}\log\pi_{\theta}(\tau)b\right] = &amp; \int\pi_{\theta}(\tau)\nabla_{\theta}\log\pi_{\theta}(\tau)bd\tau \newline
= &amp; b\nabla_{\theta}\int \pi_{\theta}(\tau)d\tau \newline
= &amp; b\nabla_{\theta}1 = 0.
\end{align}</p>

<p>The previous argument shows that subtracting a baseline is <em>unbiased</em> in expectation. The average reward is not the best baseline, but it is often good in practice. A more careful analysis will show that \(b = \frac{g(\tau)^2r(\tau)}{\mathbb E[g(\tau)^2]}\) is the best baseline, where \(g(\tau) = \nabla_{\theta}\log\pi_{\theta}(\tau)\).</p>

<h2 id="off-policy-learning-and-importance-sampling">Off-policy learning and importance sampling</h2>
<p>Policy gradient is on-policy
\(\nabla_{\theta} J(\theta) = \mathbb E_{\tau\sim\pi_{\theta}(\tau)}\left[\left(\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\right)r(\tau)\right]\)</p>

<p>Changing the policy will change the probability distribution for the expectation. Neural networks change only a litle bit with rach gradient step, and on-policy learning can be extremely inefficient.</p>

<p>Now recall that \(J(\theta) = \mathbb E_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)]\), what if we don’t have samples from \(\pi_{\theta}(\tau)\) but have samples from $$\bar\pi(\tau) instead?</p>

<blockquote>

  <p><strong>Importance sampling</strong>
\begin{align}
\mathbb E_{s\sim p(x)}[f(x)] =&amp; \int p(x)f(x)dx \newline
 =&amp; \int\frac{q(x)}{q(x)}p(x)f(x)dx \newline
 =&amp; \int q(x)\frac{p(x)}{q(x)}f(x)dx \newline
 =&amp; \mathbb E_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right]
\end{align}</p>
</blockquote>

<p>Now using the importance sampling,</p>

<p>\(J(\theta) = \mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\pi_{\theta}(\tau)}{\hat\pi(\tau)}r(\tau)\right]\).</p>

<table>
  <tbody>
    <tr>
      <td>Now recall that $$\pi_{\theta}(\tau) = p(s_1)\prod_{t=1}^T \pi_{\theta}(a_{t}</td>
      <td>s_{t})p(s_{t+1}</td>
      <td>s_{t},a_{t})$$, we know that</td>
    </tr>
  </tbody>
</table>

\[\frac{\pi_{\theta}(\tau)}{\bar\pi(\tau)} = \frac{p(s_1)\prod_{t=1}^T \pi_{\theta}(a_{t}|s_{t})p(s_{t+1}|s_{t},a_{t})}{p(s_1)\prod_{t=1}^T \hat\pi(a_{t}|s_{t})p(s_{t+1}|s_{t},a_{t})} = \frac{\prod_{t=1}^T \pi_{\theta}(a_{t}|s_{t})}{\prod_{t=1}^T\hat\pi(a_{t}|s_{t})}\]

<p>Deriving the policy gradient with importance sampling</p>

\[J(\theta) = \mathbb E_{\tau\sim\bar\pi(\tau)}r(\tau)\]

<p>can we estimate the value of some new parameters \(\theta'\)?</p>

\[J(\theta') = \mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\prod_{t=1}^T \pi_{\theta'}(a_{t}|s_{t})}{\prod_{t=1}^T \pi_{\theta}(a_{t}|s_{t})}r(\tau)\right]\]

<p>Taking the gradient</p>

\[\nabla_{\theta'}J(\theta') = \mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\nabla_{\theta'}\pi_{\theta'}(\tau)}{\frac{\pi_{\theta}(\tau)}r(\tau)\right] = \mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\frac{\pi_{\theta}(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau)r(\tau)\right]\]

<p>When \(\theta = \theta'\), it is exactly the policy gradient. If \(\theta \neq\theta'\),</p>

<p>\(\nabla_{\theta'}J(\theta') =\mathbb E_{\tau\sim\bar\pi(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\frac{\pi_{\theta}(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau)r(\tau)\right]\\
=\mathbb E_{\tau\sim\bar\pi(\tau)}\left[\left(\prod_{t=1}^T\frac{ \pi_{\theta'}(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}\right)\left(\nabla_{\theta'}\log\pi_{\theta'}(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)\right]\\
=\mathbb E_{\tau\sim\bar\pi(\tau)}\left[\sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_{i,t}|s_{i,t})\left(\prod_{t'=1}^t\frac{ \pi_{\theta'}(a_{t'}|s_{t'})}{\pi_{\theta}(a_{t'}|s_{t'})}\right)\left(\sum_{t''=t}^{t'} r(s_{t'},a_{t'})\right)\left(\prod_{t''=t}^{t'}\frac{ \pi_{\theta'}(a_{t''}|s_{t''})}{\pi_{\theta}(a_{t''}|s_{t''})}\right)\right]\)
where the last line uses <em>causality</em>.</p>

<h2 id="policy-gradient-with-automatic-differentiation">Policy gradient with automatic differentiation</h2>

\[J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\hat Q_{i,t}\]

<p>Just implement a “pseudo-loss” as a weighted maximum likelihood:</p>

\[\tilde J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\hat Q_{i,t}\]

<h1 id="actor-critic-introduction">Actor-Critic Introduction</h1>

<h2 id="review-of-policy-gradient-and-intuition-for-actor-critic">Review of policy gradient and intuition for actor-critic</h2>

\[J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\hat Q_{i,t}\]

<p>\(\hat Q_{i,t}\): estimate of expected reward if we take action \(a_{i,t}\) in state \(s_{i,t}\). Can we get a better estimate?</p>

<table>
  <tbody>
    <tr>
      <td>$$Q(s_{t},a_{t}) = \sum_{t’=t}^{T}\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})</td>
      <td>s_{t},a_{t}]$$ is the true <em>expected</em> reward to go.</td>
    </tr>
  </tbody>
</table>

\[J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})Q(s_{i,t},a_{i,t})\]

<p>What about the baseline? Previously, we set \(b = \frac{1}{N}\sum_i\hat Q_{i,t}\), and now we want to set the baseline \(b_t = \frac{1}{N}\sum_i Q(s_{i,t},a_{i,t})\).</p>

\[V(s_{t}) = \mathbb E_{a_{t}\sim\pi_{\theta}(a_{t}|s_{t})[Q(s_{t},a_{t})]}\]

<p>Note that the Q-functions, the value functions are all related to a policy \(\pi\), and we have</p>

<table>
  <tbody>
    <tr>
      <td>$$Q^{\pi}(s_{t},a_{t}) = \sum_{t’=t}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})</td>
      <td>s_{t},a_{t}]$$,</td>
    </tr>
  </tbody>
</table>

<table>
  <tbody>
    <tr>
      <td>$$V^{\pi}(s_{t}) = \mathbb E_{a_{t}\sim \pi_{\theta}(a_{t}</td>
      <td>s_{t})}[Q^{\pi}(s_{t},a_{t})]$$,</td>
    </tr>
  </tbody>
</table>

<p>\(A^{\pi}(s_{t},a_{t}) = Q^{\pi}(s_{t},a_{t}) - V^{\pi}(s_{t})\): the advantage, how much \(a_{t}\) is.</p>

<p>Then,</p>

\[J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})A^{\pi}(s_{i,t},a_{i,t})\]

<p>The better the estimate, the lower the variance.</p>

<p>Value function fitting
What to fit? \(Q^{\pi},V^{\pi},A^{\pi}\)</p>

<table>
  <tbody>
    <tr>
      <td>$$Q^{\pi}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t’=t+1}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})</td>
      <td>s_{t},a_{t}]\approx r(s_{t},a_{t}) + V^{\pi}(s_{t+1})$$,</td>
    </tr>
  </tbody>
</table>

<p>where the last step is unbiased. Then, we can write out the advantage \(A^{\pi}\) as</p>

\[A^{\pi}(s_{t},a_{t}) \approx- r(s_{t},a_{t}) + V^{\pi}(s_{t+1})-V^{\pi}(s_{t})\]

<p>Let’s fit \(V^{\pi}\). We use a neural network to represent \(\hat V^{\pi}(s)\) with parameter \(\phi\). Currently, \(\phi\) and \(\theta\), the parameters for the policy and the advantage are not the same (Later, combine them).</p>

<h2 id="policy-evaluation">Policy evaluation</h2>

<table>
  <tbody>
    <tr>
      <td>$$V^{\pi}(s_{t}) = \sum_{t’=t}^T\mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})</td>
      <td>s_{t}]$$,</td>
    </tr>
  </tbody>
</table>

\[J(\theta) = \mathbb E_{}[V^{\pi}(s_1)]\]

<p>how can we perform policy evaluation? Monte Carlo policy evaluation (this is what policy gradient does)</p>

\[V^{\pi}(s_{t})\approx\sum_{t'=t}^Tr(s_{t'},a_{t'})\]

\[V^{\pi}(s_{t})\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'})\]

<p>Need to “reset” the simulators.</p>

<p>Monte Carlo evaluation with function approximation</p>

\[V^{\pi}(s_{t})\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\]

<p>not as good as \(V^{\pi}(s_{t})\approx\frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'})\) but is still pretty good.</p>

<p>Training data: \(\{(s_{i,t},\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}))\} = \{(s_{i,t},y_{i,t})\}\)</p>

<table>
  <tbody>
    <tr>
      <td>Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j}</td>
      <td> </td>
      <td>\hat V^{\pi}_{\phi}(s_j)-y_j</td>
      <td> </td>
      <td>^2$$.</td>
    </tr>
  </tbody>
</table>

<p>Can we do better?</p>

<table>
  <tbody>
    <tr>
      <td>Ideal target: $$y_{i,t} = \sum_{t’=t}^T \mathbb E_{\pi_{\theta}}[r(s_{t’},a_{t’})</td>
      <td>s_{i,t}] \approx r(s_{i,t},a_{i,t}) + V^{\pi}(s_{i,t+1})\approx  r(s_{i,t},a_{i,t})+V^{\pi}<em>{\phi}(s</em>{i,t+1})$$</td>
    </tr>
  </tbody>
</table>

<p>The last step means to directly use the previous fitted value function. Smaller variance.</p>

<p>Monte Carlo target: \(y_{i,t} = \sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\)</p>

<p>Training data: \(\{(s_{i,t},r(s_{i,t},a_{i,t}) + \hat V^{\pi}_{\phi})\} = \{(s_{i,t},y_{i,t})\}\)</p>

<table>
  <tbody>
    <tr>
      <td>Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j}</td>
      <td> </td>
      <td>\hat V^{\pi}_{\phi}(s_j)-y_j</td>
      <td> </td>
      <td>^2$$.</td>
    </tr>
  </tbody>
</table>

<p>This is sometimes referred to as a “bootstrapped” estimate.</p>

<h2 id="actor-critic-algorithm">Actor-critic algorithm</h2>
<p>batch actor-critic algorithm:</p>
<ol>
  <li>
    <table>
      <tbody>
        <tr>
          <td>sample \(\{s_i,a_i\}\) from $$\pi_{\theta}(a_{t}</td>
          <td>s_{t})$$ (run the policy)</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>fit \(\hat V^{\pi}_{\phi}(s)\) to sampled reward sums</li>
  <li>evaluate \(\hat A^{\pi}(s_i,a_i) = r(s_i,a_i) + \hat V^{\pi}_{\phi}(s'_i) - \hat V^{\pi}_{\phi}(s_i)\)</li>
  <li>
\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\hat A^{\pi}(s_{i},a_{i})\]
  </li>
  <li>
\[\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\]
  </li>
</ol>

<p>Here, recall that we have training data: \(\{(s_{i,t},r(s_{i,t},a_{i,t}) + \hat V^{\pi}_{\phi})\} = \{(s_{i,t},y_{i,t})\}\)</p>

<table>
  <tbody>
    <tr>
      <td>Supervised regression: $$\mathcal L(\phi) = \frac{1}{2}\sum_{j}</td>
      <td> </td>
      <td>\hat V^{\pi}_{\phi}(s_j)-y_j</td>
      <td> </td>
      <td>^2$$.</td>
    </tr>
  </tbody>
</table>

<p>Aside: discount factors</p>

<p>When the trajectories are likely to go to infinity or become very large, use a discout factor. We set</p>

<p>\(y_{i,t} \approx r(s_{i,t},a_{i,t}) + \gamma\hat V^{\pi}_{\phi})\), discount factor \(\gamma\in [0,1]\) (and 0.99 works well)</p>

<p>with critic</p>

\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(r(s_i,a_i) + \gamma\hat V^{\pi}_{\phi}(s'_i) - \hat V^{\pi}_{\phi}(s_i))\right)\]

<p>batch actor-critic algorithm with discount:</p>
<ol>
  <li>
    <table>
      <tbody>
        <tr>
          <td>sample \(\{s_i,a_i\}\) from $$\pi_{\theta}(a_{t}</td>
          <td>s_{t})$$ (run the policy)</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>fit \(\hat V^{\pi}_{\phi}(s)\) to sampled reward sums</li>
  <li>evaluate \(\hat A^{\pi}(s_i,a_i) = r(s_i,a_i) + \gamma\hat V^{\pi}_{\phi}(s'_i) - \hat V^{\pi}_{\phi}(s_i)\)</li>
  <li>
\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\hat A^{\pi}(s_{i},a_{i})\]
  </li>
  <li>
\[\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\]
  </li>
</ol>

<p>online actor-critic algorithm with discount:</p>
<ol>
  <li>
    <table>
      <tbody>
        <tr>
          <td>take action $$a\sim \pi_{\theta}(a</td>
          <td>s)\(, get\)(s,a,a’,r)$$</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>update \(\hat V^{\pi}_{\phi}\) using target \(r+\gamma \hat V^{\pi}_{\phi}(s')\)</li>
  <li>evaluate \(\hat A^{\pi}(s,a) = r(s,a) + \gamma\hat V^{\pi}_{\phi}(s') - \hat V^{\pi}_{\phi}(s)\)</li>
  <li>
\[\nabla_{\theta} J(\theta)\approx \left(\nabla_{\theta}\log\pi_{\theta}(a|s)\right)\hat A^{\pi}(s,a)\]
  </li>
  <li>
\[\theta\leftarrow\theta + \alpha\nabla_{\theta}J(\theta)\]
  </li>
</ol>

<p>Architecture design</p>

<p>two network design: simple and stable, but not feature sharing</p>

<p>one network design: the output contains 2 networks, harder to tune the hyper-parameters</p>

<p>online actor-critic algorithm in practice</p>

<p>online actor-critic will not work for deep neural networks, in step 2, would like to update the critic by a batch of data. In practice, get the data in parallel.</p>

<p>Critics as state-dependent baselines.</p>

<p>For the actor-critic algorithm, the gradient is estimated by</p>

\[\nabla_{\theta} J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\left(\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\right)\left(r(s_i,a_i) + \gamma\hat V^{\pi}_{\phi}(s'_i) - \hat V^{\pi}_{\phi}(s_i))\right)\]

<p>The good thing is this representation has low variance (due to the critic), but it is not unbiased (if the critic is not perfect).</p>

<p>For the original policy gradient</p>

\[\nabla_{\theta} J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\left(\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)-b\right)\]

<p>This method has no bias, but has a higher variance (because of the single-sample estimate). Combine them together</p>

\[\nabla_{\theta} J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\left(\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'})\right)-\hat V^{\pi}_{\phi}(s_i)\right)\]

<p>There is no bias and has a lower variance.</p>

<p>n-step estimation</p>

\[\hat A^{\pi}_C(s_{t},a_{t}) = r(s_{t},a_{t}) + \gamma \hat V^{\pi}_{\phi}(s_{t+1}) - \hat V^{\pi}_{\phi}(s_{t})\]

\[\hat A^{\pi}_{MC}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t'=t}^{\infty}\gamma^{t'-t}r(s_{t'},a_{t'}) - \hat V^{\pi}_{\phi}(s_{t})\]

\[\hat A^{\pi}_{n}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'}) - \hat V^{\pi}_{\phi}(s_{t})+\gamma^n \hat V^{\pi}_{\phi}(s_{t+n})\]

<p>choosing \(n &gt; 1\) often works better</p>

<p>Generalized advantage estimation</p>

\[\hat A^{\pi}_{n}(s_{t},a_{t}) = r(s_{t},a_{t}) + \sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'}) - \hat V^{\pi}_{\phi}(s_{t})+\gamma^n \hat V^{\pi}_{\phi}(s_{t+n})\]

<p>\(\hat A^{\pi}_{GAE}(s_{t},a_{t}) = \sum_{n=1}^{\infty}w_n\hat A^{\pi}_{n}(s_{t},a_{t})\),</p>

<p>can choose \(w_n\prop \lambda^{n-1}\)</p>

<h1 id="references">References</h1>
<p><a href="http://rail.eecs.berkeley.edu/deeprlcourse/" title="CS 285 at UC Berkeley">[1] CS 285 at UC Berkeley</a></p>]]></content><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><category term="deeprl" /><category term="reinforcementlearning" /><category term="deeplearning" /><summary type="html"><![CDATA[In this post, we review the basic policy gradient algorithm for deep reinforcement learning and the actor-critic algorithm. Most of the contents are derived from CS 285 at UC Berkeley.]]></summary></entry><entry><title type="html">Theory of Optimization: More on Mirror Descent</title><link href="https://haoyuzhao123.github.io/opttheory/opttheory6/" rel="alternate" type="text/html" title="Theory of Optimization: More on Mirror Descent" /><published>2019-02-15T00:00:00-08:00</published><updated>2019-02-15T00:00:00-08:00</updated><id>https://haoyuzhao123.github.io/opttheory/opttheory6</id><content type="html" xml:base="https://haoyuzhao123.github.io/opttheory/opttheory6/"><![CDATA[<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p>In this post, we will continue on our discuss of mirror descent. We will present a variant of mirror descent: the lazy mirror descent, also known as Nesterov’s dual averaging.</p>

<!--more-->

<h2 id="lazy-mirror-descentnesterovs-dual-averaging">Lazy Mirror Descent(Nesterov’s Dual Averaging)</h2>
<h3 id="algorithm">Algorithm</h3>
<p>In this section, we will provide a more efficient version of mirror descent, named Lazy Mirror Descent. We will show that the convergence rate is the same as that of the original mirror descent(omit constants). The lazy mirror descent changes the update process of \(y^{(k)}\) from</p>

\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(x^{(k)}) - \eta\cdot g^{(k)},\]

<p>into</p>

\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(y^{(k)}) - \eta\cdot g^{(k)}.\]

<p>Moreover, \(y^{(0)}\) is such that \(\nabla \Phi(y^{(0)}) = 0\).</p>

<p>The <strong>Lazy Mirror Descent</strong> algorithm works as follow:</p>
<ol>
  <li>Initial point \(x^{(0)} \in \mathbb R^n\), step size \(h &gt; 0\).</li>
  <li><strong>For</strong> \(k=0,1,2,\dots\) <strong>do</strong></li>
  <li>Pick any \(g^{(k)}\in\partial f(x^{(k)})\).</li>
  <li>
\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(y^{(k)}) - \eta\cdot g^{(k)}, \nabla \Phi(y^{(0)}) = 0.\]
  </li>
  <li>
\[x^{(k+1)} = \Pi_{\mathcal X}^{\Phi}(y^{(k+1)}) := \arg\min_{x\in \mathcal X\cap\mathcal D}D_{\Phi}(x,y^{(k+1)}).\]
  </li>
</ol>

<h3 id="analysis">Analysis</h3>
<p>This analysis is adapted from <a href="https://arxiv.org/abs/1405.4980">Convex Optimization: Algorithms and Complexity</a>. We have the following theorem.</p>

<blockquote>
  <p>Let \(\Phi\) be a mirror map \(\rho\)-strongly convex on \(\mathcal X\cap\mathcal D\) w.r.t 
\(||\cdot||\)
. Let \(R^2 = \sup_{x\in\mathcal X\cap\mathcal D}\Phi(x) - \Phi(x^{(0)})\), and \(f\) be convex and \(L\)-Lipschitz w.r.t 
\(||\cdot||\)
. Then the lazy mirror descent with \(\eta = \frac{R}{L}\sqrt{\frac{\rho}{2t}}\) satisfies</p>

\[f\left(\frac{1}{t}\sum_{s=0}^{t-1}x^{(s)}\right) - f(x^*) \le 2RL\sqrt{\frac{2}{\rho t}}.\]
</blockquote>

<p><em>Proof:</em> We define \(\Psi_t(x) = \eta \sum_{s=0}^{t-1}g_s^Tx + \Phi(x)\), so that \(x^{(t)}\in\arg\min_{\mathcal X\cap\mathcal D}\Psi_{t-1}(x)\). Since \(\Phi\) is \(\rho\)-strongly convex, we have \(\Psi_t\) is \(\rho\)-strongly convex.</p>

<p>Then, we can compute</p>

<p>\begin{align}
\Psi_t(x^{(t+1)}) - \Psi_t(x^{(t)}) \le&amp; \nabla \Psi_{t}(x^{(t+1)})^T(x^{(t+1)} - x^{(t)}) - \frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2 \newline
\le&amp; - \frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2,
\end{align}</p>

<p>where the second inequality comes from the first order optimality condition for \(x^{(t+1)}\). Next, we observe that</p>

<p>\begin{align}
\Psi_t(x^{(t+1)}) - \Psi_t(x^{(t)}) =&amp; \Psi_{t-1}(x^{(t+1)}) - \Psi_{t-1}(x^{(t)}) + \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)})\newline
\ge&amp; \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}).
\end{align}</p>

<p>Putting together the two above displays and using Cauchy-Schwarz (with the assumption 
\(||g^{(t)}||_{*} \le L\)
) one obtains</p>

\[\frac{\rho}{2}||x^{(t+1)} - x^{(t)}||^2\le \eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}) \le \eta L ||x^{(t+1)} - x^{(t)}||.\]

<p>This shows that 
\(||x^{(t+1)} - x^{(t)}|| \le \frac{2\eta L}{\rho}\)
 and thus with the above display</p>

\[\eta (g^{(t)})^T(x^{(t+1)} - x^{(t)}) \le \frac{2\eta L^2}{\rho}.\]

<p>Then we proof the following result: for every \(x\in\mathcal X\cap\mathcal D\),</p>

\[\sum_{s=0}^{t-1} (g^{(s)})^T(x^{(s)}-x) \le \sum_{s=0}^{t-1} (g^{(s)})^T(x^{(s)}-x^{(s+1)}) + \frac{\Phi(x) - \Phi(x^{(0)})}{\eta},\]

<p>which would conclude the proof.</p>

<p>The above equation is equivalent to</p>

\[\sum_{s=0}^{t-1} (g^{(s)})^Tx^{(s+1)} + \frac{\Phi(x^{(0)})}{\eta} \le \sum_{s=0}^{t-1} (g^{(s)})^Tx + \frac{\Phi(x)}{\eta}.\]

<p>We prove this by induction,</p>

<p>\begin{align}
&amp;\sum_{s=0}^{t-1} (g^{(s)})^T x^{(s+1)} + \frac{\Phi(x^{(0)})}{\eta} \newline
\le &amp; (g^{(s)})^Tx^{(t)} + \sum_{s=0}^{t-2} (g^{(s)})^Tx^{(t)} + \frac{\Phi(x^{(t)})}{\eta}\newline
\le &amp; \sum_{s=0}^{t-1} (g^{(s)})^Tx + \frac{\Phi(x)}{\eta},
\end{align}</p>

<p>where the first inequality follows from the induction step and the second follows from the definition of \(x^{(t)}\).</p>

<p align="right">&#11036;</p>]]></content><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><category term="opttheory" /><category term="optimization" /><summary type="html"><![CDATA[In this post, we will continue on our discuss of mirror descent. We will present a variant of mirror descent: the lazy mirror descent, also known as Nesterov’s dual averaging.]]></summary></entry><entry><title type="html">Theory of Optimization: Frank-Wolfe Algorithm</title><link href="https://haoyuzhao123.github.io/opttheory/opttheory5/" rel="alternate" type="text/html" title="Theory of Optimization: Frank-Wolfe Algorithm" /><published>2019-02-13T00:00:00-08:00</published><updated>2019-02-13T00:00:00-08:00</updated><id>https://haoyuzhao123.github.io/opttheory/opttheory5</id><content type="html" xml:base="https://haoyuzhao123.github.io/opttheory/opttheory5/"><![CDATA[<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p>In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe.</p>

<!--more-->

<h2 id="frank-wolfe">Frank-Wolfe</h2>
<h3 id="algorithm">Algorithm</h3>
<p>Frank-Wolfe algorithm solves the following convex optimization problem</p>

\[\min_{x\in\mathcal D}f(x),\]

<p>for \(f\) such that \(\nabla f(x)\) is `Lipschitz’ in a certain sense. The algorithm reduces the problem into a sequence of linear optimization problems.</p>

<p>Frank-Wolfe algorithm has the following procedure:</p>

<ol>
  <li>Initial point \(x^{(0)}\in\mathbb R^n\), step size \(h&gt;0\).</li>
  <li><strong>For</strong> \(k=0,1,\dots\) <strong>do</strong></li>
  <li>Compute \(y^{(k)} = \arg\min_{y\in\mathcal D}\langle y, \nabla f(x^{(k)})\rangle\).</li>
  <li>\(x^{(k+1)} \leftarrow (1-h_k)x^{(k)} + h_ky^{(k)}\) with \(h_k = \frac{2}{k+2}\).</li>
</ol>

<h3 id="analysis">Analysis</h3>
<p>We have the following theorem for Frank-Wolfe algorithm.</p>

<blockquote>
  <p>Given a convex function \(f\) on a convex set \(\mathcal D\) and a constant \(C_f\) such that</p>

\[f((1-h)x + hy)\le f(x) + h\langle \nabla f(x), y-x\rangle + \frac{1}{2}C_f h^2\]

  <p>for any \(x,y\in\mathcal D\) and \(h\in [0,1]\), we have</p>

\[f(x^{(k)}) - f(x^*) \le \frac{2C_f}{k+2}.\]
</blockquote>

<p><em>Proof:</em> By the definition of \(C_f\), we have</p>

\[f(x^{(k+1)})\le f(x^{(k)}) + h_k\langle \nabla f(x^{(k)}), y^{(k)}-x^{(k)}\rangle + \frac{1}{2}C_f h_k^2.\]

<p>Note that from the convexity of \(f\) and the definition of \(y^{(k)}\), we have</p>

\[f(x^*) \ge f(x^{(k)}) + \langle \nabla f(x^{(k)}), x^* - x^{(k)}\rangle \ge f(x^{(k)})+\langle \nabla f(x^{(k)}), y^{(k)} - x^{(k)}\rangle.\]

<p>Hence, we have</p>

\[f(x^{(k+1)})\le f(x^{(k)}) - h_k(f(x^{(k)}) - f(x^*)) + \frac{1}{2}C_f h_k^2.\]

<p>Let \(\epsilon_k = f(x^{(k)}) - f(x^*)\), we have</p>

\[\epsilon_{k+1} \le (1-h_k)\epsilon_k + \frac{1}{2}C_f h_k^2.\]

<p>Note that \(\epsilon_0 = f^{(0)} - f(x^*) \le \frac{1}{2}C_f\), we can prove the theorem by induction.</p>

<p align="right">&#11036;</p>

<p>Note that if \(\nabla f(x)\) is L-Lipschitz with respect to 
\(||\cdot||\),
 over the domain \(\mathcal D\), then 
\(C_f \le L\cdot diam_{||\cdot||}(\mathcal D)^2\).</p>

<h3 id="sparsity-analysis">Sparsity Analysis</h3>
<p>For the domain \(\mathcal D\) is a simplex, the step \(y^{(k)}\) is always a vertex of the simplex, and hence, each step of Frank-Wolfe increases the sparsity by at most \(1\). We can think that the convergence result of Frank Wolfe proves that there is an approximate sparse solution of the optimization problem.</p>

<blockquote>
  <p>Given a polytope \(\mathcal P = conv(v_i)\) lies inside an unit ball. For any \(u\in\mathcal P\), there are \(k = O(\frac{1}{\epsilon^2} )\) vertices \(v_1,\dots, v_k\) of \(\mathcal P\) such that</p>

\[||\sum_{i=1}^k\lambda_i v_i -u||_2 \le \epsilon,\]

  <p>for some \(\sum_i \lambda_i = 1, \lambda_i \ge 0\).</p>
</blockquote>

<p><em>Proof:</em> Run Frank-Wolfe algorithm with 
\(f(x) = ||x-u||_2^2\). 
Note that \(f\) is 1-Lipschitz with respect to 
\(||x||_2\) and that the diameter of \(\mathcal P\) is bounded by \(1\). Hence, we have \(C_f \le 1\).</p>

<p>Therefore, Frank-Wolfe algorithm shows that</p>

\[f(x^{(k)}) - f(x^*) \le \frac{2}{k+2}.\]

<p>Since \(f(u) = 0\), we have
\(||x^{(k)} - u||^2 \le \frac{4}{k+2}\), and this gives the result.</p>

<p align="right">&#11036;</p>]]></content><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><category term="opttheory" /><category term="optimization" /><summary type="html"><![CDATA[In this post, we describe a new geometry dependent algorithm that relies on different set of assumptions. The algorithm is called conditional gradient descent, aka Frank-Wolfe.]]></summary></entry><entry><title type="html">Theory of Optimization: Mirror Descent</title><link href="https://haoyuzhao123.github.io/opttheory/opttheory4/" rel="alternate" type="text/html" title="Theory of Optimization: Mirror Descent" /><published>2019-02-06T00:00:00-08:00</published><updated>2019-02-06T00:00:00-08:00</updated><id>https://haoyuzhao123.github.io/opttheory/opttheory4</id><content type="html" xml:base="https://haoyuzhao123.github.io/opttheory/opttheory4/"><![CDATA[<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p>In this post, we will introduce the Mirror Descent algorithm that solves the convex optimization algorithm.</p>

<!--more-->

<h2 id="mirror-descent">Mirror Descent</h2>
<h3 id="intuition-and-algorithm">Intuition and Algorithm</h3>
<p>In the previous analysis for projected gradient descent, we can find that the PGD works for arbitrary Hilbert space \(\mathcal H\). Suppose that now we want to optimize functions in Banach space \(\mathcal B\), and the gradient descent does not make sence. In fact, \(x^{(t)}\) lies in space \(\mathcal B\) and the gradient \(\nabla f(x^{(t)})\) lies in the dual space \(\mathcal B^*\). There is no problem in the Hilbert space \(\mathcal H\), since by Riesz representation theorem \(\mathcal H^*\) is isometric to \(\mathcal H\).</p>

<p>The great insight of Nemirovski and Yudin(creator of Mirror Descent) is that one can still do a gradient descent by first mapping(by a function \(\Phi\) called mirror map) the point \(x\in\mathcal B\) into the dual space \(\mathcal B^*\), performing the gradient update in dual space, and finally mapping back to \(\mathcal B\) and projecting to the constraint set \(\mathcal X\). We will first introduce the mirror map and then discuss the projection step.</p>

<p><strong>Mirror map</strong>: Let \(\mathcal D\subset\mathbb R^n\) be a convex open set such that \(\mathbb X\) is included in its closure, that is \(\mathcal X \subset \mathcal{\bar D}\) and \(\mathcal X\cap\mathcal D\neq\phi\). We say that \(\Phi:\mathcal D\to\mathbb R\) is a mirror map if it satisfies the following properties:</p>
<ol>
  <li>\(\Phi\) is strictly convex and differentiable.</li>
  <li>The gradient of \(\Phi\) takes all possible values, that is \(\nabla\Phi(\mathcal D) = \mathbb R^n\).</li>
  <li>The  gradient of \(\Phi\) diverges on the boundary of \(\mathcal D\).</li>
</ol>

<p>The projection operator is designed by the following Bergman divergence.</p>

<blockquote>
  <p><strong>Bregman divergence</strong>: For any strictly convex function \(\Phi\), we define the Bregman divergence as</p>

\[D_{\Phi}(y,x) := \Phi(y) - \Phi(x) - \langle\nabla \Phi(x),y-x\rangle.\]
</blockquote>

<p>And we define the projection of \(y\) as</p>

\[\Pi_{\mathcal X}^{\Phi}(y) = \arg\min_{x\in\mathcal X\cap\mathcal D}D_{\Phi}(x,y).\]

<p>The <strong>Mirror Descent</strong> algorithm works as follow:</p>
<ol>
  <li>Initial point \(x^{(0)} \in \mathbb R^n\) such that \(\nabla \Phi(x^{(0)}) = 0\), step size \(h &gt; 0\).</li>
  <li><strong>For</strong> \(k=0,1,2,\dots\) <strong>do</strong></li>
  <li>Pick any \(g^{(k)}\in\partial f(x^{(k)})\).</li>
  <li>
\[\nabla \Phi(y^{(k+1)}) = \nabla \Phi(x^{(k)}) - \eta\cdot g^{(k)}.\]
  </li>
  <li>
\[x^{(k+1)} = \Pi_{\mathcal X}^{\Phi}(y^{(k+1)}) := \arg\min_{x\in \mathcal X\cap\mathcal D}D_{\Phi}(x,y^{(k+1)}).\]
  </li>
</ol>

<p>Below is the illustration for the mirror descent(Picture from Convex Opitmization:Algorithms and Complexity)
<img src="/assets/images/opttheory4-mirror-descent.PNG" alt="Alt text" /></p>

<p>Actually, one can also rewrite mirror descent as follows:</p>

<p>\begin{align}
    x^{(t+1)} =&amp; \arg\min_{x\in\mathcal X\cap\mathcal D}D_{\Phi}(x,y^{(t+1)}) \newline
    =&amp; \arg\min_{x\in\mathcal X\cap\mathcal D}(\Phi(x) - \Phi(y^{(t+1)}) - \nabla \Phi(y^{(t+1)})^Tx) \newline
    =&amp; \arg\min_{x\in\mathcal X\cap\mathcal D}(\Phi(x) - (\nabla \Phi (x^{(t)}) - \eta (g^{(t)})^Tx)) \newline
    =&amp; \arg\min_{x\in\mathcal X\cap\mathcal D}(\eta (g^{(t)})^Tx + D_{\Phi}(x,x^{(t)}))
\end{align}</p>

<p>This expression is often taken as the definition of mirror descent. It gives a proximal point of view on mirror descent.</p>

<h3 id="analysis-of-mirror-descent">Analysis of Mirror Descent</h3>
<p>We first have the following property:</p>
<blockquote>
\[(\nabla f(x) - \nabla f(y))^T(x-z) = D_f(x,y) + D_f(z,x) - D_f(z,y).\]
</blockquote>

<p><em>Proof:</em> The proof will be completed by trivial computation.</p>

<p>\begin{align}
RHS =&amp; f(x) - f(y) - \nabla f(y)^T(x-y) + f(z) - f(x) \newline
&amp;\quad - \nabla f(x)^T(z-x) - f(z) + f(y) + \nabla f(y)^T(z-y) \newline
=&amp; \nabla f(x)^T(x-z) - \nabla f(y)^T(x-z) \newline
=&amp; LHS.
\end{align}</p>
<p align="right">&#11036;</p>

<p>Then we have the following Pythagorean Theorem.</p>

<blockquote>
  <p><strong>Pythagorean Theorem</strong>: Let \(x\in\mathcal X\cap\mathcal D\) and \(y\in\mathcal D\), then</p>

\[(\nabla\Phi(\Pi_{\mathcal X}^{\Phi}(y)) - \nabla\Phi(y))^T(\Pi_{\mathcal X}^{\Phi}(y)-x)\le 0,\]

  <p>which also implies</p>

\[D_{\Phi}(x,\Pi_{\mathcal X}^{\Phi}(y)) + D_{\Phi}(\Pi_{\mathcal X}^{\Phi}(y),y) \le D_{\Phi}(x,y).\]
</blockquote>

<p><em>Proof:</em> First, it is easy to show that</p>

\[\nabla_xD_{\Phi}(x,y) = \nabla \Phi(x) - \nabla \Phi(y).\]

<p>Then it is equivalent to show that</p>

\[\nabla_xD_{\Phi}(\Pi_{\mathcal X}^{\Phi}(y),y)^T(\Pi_{\mathcal X}^{\Phi}(y)-x)\le 0.\]

<p>Let \(D_{\Phi}(\cdot,y) := f(\cdot)\), then \(\Pi_{\mathcal X}^{\Phi}(y)\) is the minimizer of \(f\) in set \(\mathcal X\cap\mathcal D\). Then from the optimality of convex functions, we have</p>

\[\nabla f(x^*)^T(x^* - x)\le 0,\forall x\in \mathcal X\cap\mathcal D,\]

<p>which is just the inequality</p>

\[\nabla_xD_{\Phi}(\Pi_{\mathcal X}^{\Phi}(y),y)^T(\Pi_{\mathcal X}^{\Phi}(y)-x)\le 0.\]

<p align="right">&#11036;</p>

<p>Then, we have the following theorem:</p>

<blockquote>
  <p>Let \(\Phi\) be a mirror map \(\rho\)-strongly convex on \(\mathcal X\cap\mathcal D\) w.r.t 
\(||\cdot||\)
. Let \(R^2 = \sup_{x\in\mathcal X\cap\mathcal D}\Phi(x) - \Phi(x^{(0)})\), and \(f\) be convex and \(L\)-Lipschitz w.r.t 
\(||\cdot||\)
. Then the mirror descent with \(\eta = \frac{R}{L}\sqrt{\frac{2\rho}{t}}\) satisfies</p>

\[f\left(\frac{1}{t}\sum_{s=0}^{t-1}x^{(s)}\right) - f(x^*) \le RL\sqrt{\frac{2}{\rho t}}.\]
</blockquote>

<p><em>Proof:</em> Let \(x\in\mathcal X\cap\mathcal D\). The claimed bound will be obtained by taking a limit \(x\to x^*\). By the convexity of \(f\), the definition of mirror descent, and the previous 2 lemmas, we have</p>

<p>\begin{align}
    &amp;f(x^{(s)}) - f(x) \newline
    \le&amp; (g^{(s)})^T(x^{(s)} - x) \newline
    =&amp; \frac{1}{\eta}(\nabla \Phi(x^{(s)}) - \nabla\Phi(y^{(s+1)}))^T(x^{(s)} - x) \newline
    =&amp; \frac{1}{\eta}\left(D_{\Phi}(x,x^{(s)})+D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x,y^{(s+1)})\right) \newline
    \le &amp; \frac{1}{\eta}\left(D_{\Phi}(x,x^{(s)})+D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x,x^{(s+1)}) - D_{\Phi}(x^{(s+1)},y^{(s+1)})\right).
\end{align}</p>

<p>The term \(D_{\Phi}(x,x^{(s)})-D_{\Phi}(x,x^{(s+1)})\) will lead to a telescopic sum, and it remains to bound the other term. By the \(\rho\)-strongly convexity of the mirror map and \(az-bz^2 \le \frac{a^2}{4b}\), we have</p>

<p>\begin{align}
    &amp; D_{\Phi}(x^{(s)},y^{(s+1)}) - D_{\Phi}(x^{(s+1)},y^{(s+1)}) \newline
    =&amp; \Phi(x^{(s)}) - \Phi(x^{(s+1)}) - \nabla \Phi(y^{(s+1)})^T(x^{(s)} - x^{(s+1)}) \newline
    \le&amp; (\nabla\Phi(x^{(s)})- \nabla \Phi(y^{(s+1)}))^T(x^{(s)} - x^{(s+1)}) - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline
    =&amp; \eta (g^{(s)})^T(x^{(s)} - x^{(s+1)}) - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline
    \le&amp; \eta L ||x^{(s)} - x^{(s+1)}|| - \frac{\rho}{2}||x^{(s)} - x^{(s+1)}||^2 \newline
    \le&amp; \frac{(\eta L)^2}{2\rho}.
\end{align}</p>

<p>Then we have</p>

\[\sum_{s=0}^{t-1}(f(x^{(s)}) - f(x)) \le \frac{D_{\Phi}(x,x^{(0)}) - D_{\Phi}(x,x^{(t)})}{\eta} + \eta\frac{L^2 t}{2\rho},\]

<p>which concludes the proof up to trivial computation.</p>

<p align="right">&#11036;</p>

<h3 id="standard-setups-for-mirror-descent">Standard Setups for Mirror Descent</h3>
<p><strong>Ball setup:</strong> Taking \(\Phi(x) = \frac{1}{2}||x||^2\) on \(\mathcal D = \mathbb R^n\), the function \(\Phi\) is a mirror map and strongly convex w.r.t
\(||\cdot ||_2\),
and furthermore, the associated Bregman divergence is given by</p>

<p>\begin{align}
    D_{\Phi}(x,y) =&amp; \Phi(x) - \Phi(y) - \nabla\Phi(y)^T(x-y) \newline
    =&amp; \frac{||x||^2}{2} - \frac{||y||^2}{2} - y^T(x-y) \newline
    =&amp; \frac{1}{2}||x-y||^2.
\end{align}</p>

<p>In this case, the mirror descent is exactly equivalent to projected subgradient descent.</p>

<p><strong>Simplex setup:</strong> Now we focus on case where \(\Phi(x) = \sum x_i\log x_i\) and the simplex \(\mathcal D = \{x_i\ge 0, \sum x_i = 1\}\).</p>

<p><em>Step formula:</em> From computation, we can know that</p>

\[x_i^{(k+1)} = e^{-\eta g_i^{(k)}}x_i^{(k)} / Z,\]

<p>for some normalization constant \(Z\). Note that this update formula is just the multiplicative weight updating.</p>

<p><em>Strong convexity:</em> From computation, we can know that \(\Phi(x)\) is 1-strongly convex w.r.t 1-norm, i.e.</p>

\[\Phi(y) - \Phi(x) - \nabla\Phi(x)^T(y-x) \ge \frac{1}{2}||x-y||_1^2.\]

<p>Hence, we have \(\rho = 1\).</p>

<p><em>Diameter:</em> Direct computation will show that \(-\log n\le\Phi(x)\le 0\). Hence, \(R^2 \le \log n\).</p>

<p><em>Final result:</em> We have the following theorem.</p>

<blockquote>
  <p>Let \(f\) be 1-Lipschitz functions on 
\(||\cdot||_1\). Then, the mirror descent with the mirror map discussed above outputs \(x\) in \(T\) iterations such that</p>

\[f(x) - \min_x f(x) \le \sqrt{\frac{2\log n}{T}}.\]
</blockquote>

<p>In comparison, projected gradient descent gives \(\sqrt{\frac{n}{T}}\), which is much larger.</p>

<h2 id="related-to-online-convex-optimization">Related to Online Convex Optimization</h2>

<p>Note that the proofs of projected subgradient descent in the previous post and the mirror descent in this post do not require that we are solving the same convex function \(f\). Based on this observation, we can easily extend the previous results to the online setting. For the online mirror descent(online subgradient descent is a special case of online mirror descent), we have</p>

\[\sum_{s=0}^{t-1}(f(x^{(s)}) - f(x)) \le RL\sqrt{\frac{2t}{\rho}},\]

<p>which shows that the regret is \(O(\sqrt{t})\).</p>]]></content><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><category term="opttheory" /><category term="optimization" /><summary type="html"><![CDATA[In this post, we will introduce the Mirror Descent algorithm that solves the convex optimization algorithm.]]></summary></entry><entry><title type="html">Theory of Optimization: Projected (Sub)Gradient Descent</title><link href="https://haoyuzhao123.github.io/opttheory/opttheory3/" rel="alternate" type="text/html" title="Theory of Optimization: Projected (Sub)Gradient Descent" /><published>2019-02-04T00:00:00-08:00</published><updated>2019-02-04T00:00:00-08:00</updated><id>https://haoyuzhao123.github.io/opttheory/opttheory3</id><content type="html" xml:base="https://haoyuzhao123.github.io/opttheory/opttheory3/"><![CDATA[<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p>In this post, we will continue our analysis for gradient descent. Different from the previous post, we will not assume that the function is smooth. We will only assume that the function is convex and has some Lipschitz constant.</p>

<!--more-->

<p>In the previous lecture, we assume that all of the functions has \(L\)-Lipschitz gradient. For general \(L\)-smooth functions, the gradient descent algorithm can get a first order \(\epsilon\)-critical proint in \(O(\frac{1}{\epsilon^2})\) iteration. When the function is convex, we show that we can use \(O(\frac{1}{\epsilon})\) iterations to get a solution differ from the optimal for at most \(\epsilon\). When the function is strongly convex and smooth, we show that the number of iterations can be reduced to \(O(poly(\log \frac{1}{\epsilon}))\).</p>

<p>However, in the previous post, we assume that the function is smooth, which implies that the function has gradient at all points. In this post, we will first assume that the function is convex but not smooth. Besides, in the previous post, we also focus on the unconstraint case, and in this posts, we will also introduce the analysis for constraint minimization.</p>

<h2 id="projected-subgradient-descent">Projected Subgradient Descent</h2>
<p>In this post, we assume that the convex optimization problem has the following form:</p>

\[\min_{x\in\mathcal X}f(x).\]

<p>In the problem, \(\mathcal X\) is the constraint set, which could be \(\mathbb R^n\).</p>

<blockquote>
  <p><strong>Subgradient</strong>: Let \(\mathcal X\subset \mathbb R^n\), and \(f:\mathcal X\to \mathbb R\). Then \(g\in\mathbb R^n\) is a subgradient of \(f\) at \(x\in\mathcal X\) if for any \(y\in\mathcal X\), one has</p>

\[f(y) \ge f(x) + g^T(y-x).\]

  <p>We use \(\partial f(x)\) to denote the set of subgradient at \(x\), i.e.</p>

\[\partial f(x) := \{g\in\mathbb R^n: g \text{ is a subgradient of } f \text{ at } x\}.\]
</blockquote>

<p>Note that if \(f\) is differentiable at point \(x\), then \(g\) is just \(\nabla f(x)\). So the notion of subgradient is compatible with that of the original gradient.</p>

<p>We will also introduce the projection operator \(\Pi_{\mathcal X}\) on \(\mathcal X\) by</p>

\[\Pi_{\mathcal X}(x) = \arg\min_{y\in\mathcal X}||x-y||^2.\]

<p>We have the following lemma for projection operator.</p>

<blockquote>
  <p>Let \(x\in\mathcal X\) and \(y\in\mathbb R^n\), then</p>

\[(\Pi_{\mathcal X}(y) - x)^T(\Pi_{\mathcal X}(y) - y)\le 0,\]

  <p>which also implies
\(||\Pi_{\mathcal X}(y) - x||^2 + ||y-\Pi_{\mathcal X}(y)||^2 \le ||y-x||^2.\)</p>
</blockquote>

<p>Then, we will introduce the projected gradient descent algorithm. The algorithm works as follow:</p>

<ol>
  <li>For \(t = 1,2,\dots\)</li>
  <li>\(y^{(t+1)} = x^{(t)} - \eta g_t, g_t\in\partial f(x^{(t)})\) and \(x^{(t+1)} = \Pi_{\mathcal X}(y^{(t+1)})\)</li>
</ol>

<h3 id="analysis-for-lipschitz-functions">Analysis for Lipschitz Functions</h3>
<blockquote>
  <p>Suppose \(f\) is convex. Furthermore, suppose that for any 
\(x\in\mathcal X, g\in\partial f(x)\), we have 
\(||g|| \le L\), then the projected subgradient method with \(\eta = \frac{R}{L\sqrt{t}}\) satisfies</p>

\[f\left(\frac{1}{t}\sum_{s=1}^t x_s\right) - f(x^*) \le \frac{RL}{\sqrt{t}}.\]
</blockquote>

<p><em>Proof:</em> By the convexity of \(f\), we have</p>

<p>\begin{align}
&amp;f(x^{(s)}) - f(x^<em>) \newline
\le&amp; g_s(x^{(s)} - x^</em>) \newline
=&amp; \frac{1}{\eta}(x^{(s)} - y^{(s+1)})^T(x^{(s)} - x^<em>) \newline
=&amp; \frac{1}{2\eta}(||x^{(s)} - x^</em>||^2 + ||x^{(s)} - y^{(s+1)}||^2 - ||x^{<em>} - y^{(s+1)}||^2) \newline
=&amp; \frac{1}{2\eta}(||x^{(s)} - x^</em>||^2 - ||x^{*} - y^{(s+1)}||^2) + \frac{\eta}{2}||g_s||^2.
\end{align}</p>

<p>Now, from \(||g|| \le L\) and 
\(||x^{*} - y^{(s+1)}||^2 \ge ||x^{*} - x^{(s+1)}||^2\), 
we have</p>

\[f(x^{(s)}) - f(x^*) \le \frac{1}{2\eta}(||x^{(s)} - x^*||^2 - ||x^{*} - x^{(s+1)}||^2) + \frac{\eta}{2}L^2.\]

<p>Summing up the equations, we complete the proof.</p>

<p align="right">&#11036;</p>

<h3 id="analysis-for-smooth-functionslipschitz-gradient">Analysis for Smooth Functions(Lipschitz Gradient)</h3>
<p>In this section, we assume that \(f\) is convex and \(\beta\)-smooth. In the previous post when the optimization problem is not constraint, we have the following inequality(change parameters)</p>

\[f(x^{(k+1)}) \le f(x^{(k)}) - \frac{1}{2\beta}||\nabla f(x^{(k)})||_2^2.\]

<p>However, this inequality may not be true in the constraint case, since we need to apply the projection operation. The next lemma shows the `right’ quantity to measure the descent procedure.</p>

<blockquote>
  <p>Let \(x,y\in\mathcal X,x^{+} = \Pi_{\mathcal X}(x-\frac{1}{\beta}\nabla f(x))\) and \(g_{\mathcal X}(x) = \beta (x-x^{+})\). Then the following holds true:</p>

\[f(x^+) - f(y) \le g_{\mathcal X}(x)^T(x-y) - \frac{1}{2\beta}||g_{\mathcal X}(x)||^2.\]
</blockquote>

<p><em>Proof:</em> We first observe that</p>

\[\nabla f(x)^T(x^+ - y) \le g_{\mathcal X}(x)^T(x^+ - y),\]

<p>since the above inequality is equivalent to</p>

\[\left(x^+ - \left(x - \frac{1}{\beta}\nabla f(x)\right)\right)^T(x^+ - y) \le 0.\]

<p>Then, we have</p>

<p>\begin{align}
    &amp;f(x^+) - f(y) \newline
    =&amp; f(x^+) -f(x) + f(x) - f(y) \newline
    \le&amp; \nabla f(x)^T(x^+ - x) + \frac{\beta}{2}||x^+ - x||^2 + \nabla f(x)^T(x-y) \newline
    =&amp; \nabla f(x)^T(x^+ - y) + \frac{1}{2\beta}||g_{\mathcal X}(x)||^2 \newline
    \le&amp; g_{\mathcal X}(x)^T(x^+ - y) + \frac{1}{2\beta}||g_{\mathcal X}(x)||^2 \newline
    =&amp; g_{\mathcal X}(x)^T(x - y) - \frac{1}{2\beta}||g_{\mathcal X}(x)||^2.
\end{align}</p>

<p align="right">&#11036;</p>

<p>Now we can show the following theorem for the convergence of PGD.</p>

<blockquote>
  <p>Let \(f\) be convex and \(\beta\)-smooth on \(\mathcal X\). Then projected gradient descent with \(\eta = \frac{1}{\beta}\) satisfies</p>

\[f(x^{(t)}) - f(x^*) \le \frac{3\beta ||x^{(1)} - x^*||^2 + f(x^{(1)}) - f(x^*)}{t}.\]
</blockquote>

<p><em>Proof:</em> From the previous lemma, we have</p>

\[f(x^{(s+1)}) - f(x^{(s)}) \le -\frac{1}{2\beta}||g_{\mathcal X}(x^{(s)})||^2,\]

<p>and</p>

\[f(x^{(s+1)}) - f(x^*) \le ||f_{\mathcal X}(x^{(s)}||\cdot ||x^{(s)} - x^*||.\]

<p>Then we show that 
\(||x^{(s)} - x^*||\)
 is decreasing with \(s\). From the previous lemma, we can also find</p>

\[g_{\mathcal X}(x^{(s)})^T(x^{(s)} - x^*) \ge \frac{1}{2\beta}||g_{\mathcal X}(x^{(s)})||^2,\]

<p>then we have</p>

<p>\begin{align}
    &amp;||x^{(s+1)} - x^<em>||^2 \newline
    =&amp; ||x^{(s)} - \frac{1}{\beta}g_{\mathcal X}(x^{(s)}) - x^</em>||^2 \newline
    =&amp; ||x^{(s)} - x^<em>||^2 - \frac{2}{\beta}g_{\mathcal X}(x^{(s)})^T(x^{(s)} - x^</em>) + \frac{1}{\beta^2}||g_{\mathcal X}(x^{(s)})||^2\newline
    \le&amp; ||x^{(s)} - x^*||^2.
\end{align}</p>

<p>Let \(\epsilon_s = f(x^{(s)}) - f(x^*)\), we have</p>

<p>\begin{align}
\epsilon_{s+1} \le&amp; \epsilon_s - \frac{1}{2\beta}||g_{\mathcal X}(x^{(s)})||^2 \newline
\le&amp; \epsilon_s - \frac{1}{2\beta||x^{(s)} - x^<em>||^2}\epsilon_{s+1}^2 \newline
\le&amp; \epsilon_s - \frac{1}{2\beta||x^{(1)} - x^</em>||^2}\epsilon_{s+1}^2.
\end{align}</p>

<p>Then we can finish the proof by simple induction.</p>

<p align="right">&#11036;</p>

<h3 id="analysis-for-strongly-convex-functions">Analysis for Strongly Convex Functions</h3>
<p>In this section, we consider the projected gradient descent with time-varying step size \((\eta_t)_{t\ge 1}\), that is</p>
<ol>
  <li>For \(t = 1,2,\dots\)</li>
  <li>\(y^{(t+1)} = x^{(t)} - \eta_t g_t, g_t\in\partial f(x^{(t)})\) and \(x^{(t+1)} = \Pi_{\mathcal X}(y^{(t+1)})\)</li>
</ol>

<p>Then we have the following theorem</p>

<blockquote>
  <p>Let \(f\) be \(\alpha\)-strongly convex and \(L\)-Lipschitz on \(\mathcal X\). Then projected subgradient descent with \(\eta_s = \frac{2}{\alpha (s+1)}\) satisfies,</p>

\[f\left(\sum_{s=1}^t\frac{2s}{t(t+1)}x_s\right) - f(x^*) \le \frac{2L^2}{\alpha(t+1)}.\]
</blockquote>

<p><em>Proof:</em> Similar with the projected subgradient descent for lipschitz functions, we have</p>

\[f(x^{(s)}) - f(x^*) \le \left(\frac{1}{2\eta_s} - \frac{\alpha}{2}\right)||x^{(s)} - x^*||^2 - \frac{1}{2\eta_s}||x^{*} - x^{(s+1)}||^2 + \frac{\eta_s}{2}L^2.\]

<p>Multiplying the inequalities by \(s\), we have</p>

\[s(f(x^{(s)}) - f(x^*)) \le \frac{\alpha}{4}\left(s(s-1)||x^{(s)} - x^*||^2 - s(s+1)||x^{*} - x^{(s+1)}||^2\right) + \frac{L^2}{\alpha}.\]

<p>Summing up the above inequalities and applying the Jensen’s inequality, we complete the proof.</p>

<p align="right">&#11036;</p>

<h3 id="analysis-for-strongly-convex-and-smooth-functions">Analysis for Strongly Convex and Smooth Functions</h3>
<p>The key improvement compared with convex and smooth function is that, one can show</p>

\[f(x^+) - f(y) \le g_{\mathcal X}(x)^T(x-y) - \frac{1}{2\beta}||g_{\mathcal X}(x)||^2 - \frac{\alpha}{2}||x-y||^2.\]

<p>Based on this result, we have the following theorem</p>

<blockquote>
  <p>Let \(f\) be \(\alpha\)-strongly convex and \(\beta\)-smooth on \(\mathcal X\). Then projected gradient descent with \(\eta = \frac{1}{\beta}\) satisfies,</p>

\[||x^{(t+1)} - x^*||^2 = \left(1-\frac{\alpha}{\beta}\right)^t||x^{(1)} - x^*||^2.\]
</blockquote>

<p><em>Proof:</em> Using the previous inequality with \(y = x^*\), we have</p>

<p>\begin{align}
    &amp; ||x^{(t+1)} - x^<em>||^2 \newline
    = &amp; ||x^{(t)} - \frac{1}{\beta}g_{\mathcal X}(x^{(t)}) - x^</em>||^2 \newline
    = &amp; ||x^{(t)} - x^<em>||^2 - \frac{2}{\beta}g_{\mathcal X}(x^{(t)})^T(x^{(t)} - x^</em>) + \frac{1}{\beta^2}||g_{\mathcal X}(x^{(t)})||^2 \newline
    \le &amp; ||x^{(t)} - x^<em>||^2 - \frac{\alpha}{\beta}||x^{(t)} - x^</em>||^2 \newline
    = &amp; \left(1-\frac{\alpha}{\beta}\right)^t||x^{(1)} - x^*||^2.
\end{align}</p>

<p align="right">&#11036;</p>]]></content><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><category term="opttheory" /><category term="optimization" /><summary type="html"><![CDATA[In this post, we will continue our analysis for gradient descent. Different from the previous post, we will not assume that the function is smooth. We will only assume that the function is convex and has some Lipschitz constant.]]></summary></entry><entry><title type="html">Theory of Optimization: Gradient Descent</title><link href="https://haoyuzhao123.github.io/opttheory/opttheory2/" rel="alternate" type="text/html" title="Theory of Optimization: Gradient Descent" /><published>2019-02-03T00:00:00-08:00</published><updated>2019-02-03T00:00:00-08:00</updated><id>https://haoyuzhao123.github.io/opttheory/opttheory2</id><content type="html" xml:base="https://haoyuzhao123.github.io/opttheory/opttheory2/"><![CDATA[<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p>In this post, we will review the most basic and the most intuitive optimization method – the gradient decent method – in optimization.</p>

<!--more-->

<h2 id="gradient-descent">Gradient Descent</h2>
<p>The gradient descent algorithm works as follow: The algorithm requires an initial point \(x^{(0)}\in\mathbb R^n\) and step size \(h &gt; 0\). Then the algorithm repeats to execute:</p>

\[x^{(t+1)} = x^{(t)} - h\cdot\nabla f(x^{(t)}),\]

<p>until 
\(||\nabla f(x^{(t)})|| \le \epsilon\). 
In the following of this section, we will assume that the gradient of \(f\) is L-lipschitz, i.e.</p>

\[||\nabla f(x) - \nabla f(y)|| \le L||x-y||,\]

<p>we call \(f\) is L-lipschitz gradient or L-smooth.</p>

<h3 id="analysis-for-general-smooth-functions">Analysis for General Smooth Functions</h3>
<p>In order to show the convergence of gradient descent for the general functions, we need the following lemma,</p>

<blockquote>
  <p>Suppose \(f\) is L-lipschitz gradient, then</p>

\[|f(y) - f(x) - \nabla f(x)^T(y-x)| \le \frac{L}{2}||x - y||^2.\]
</blockquote>

<p><em>Proof:</em> From the basic calculus and the assumption that \(f\) is L-lipschitz gradient, we have</p>

<p>\begub{align}
    &amp;|f(y) - f(x) - \nabla f(x)^T(y-x)| \newline
    =&amp; \int_{0}^1 \nabla f(x + t(y-x))^T(y-x)dt - \nabla f(x)^T(y-x)\newline
    =&amp; |\int_{0}^1 (\nabla f(x + t(y-x))-\nabla f(x))^T(y-x)dt| \newline
    \le&amp; \int_{0}^1 ||\nabla f(x + t(y-x))-\nabla f(x)||\cdot ||y-x||dt \newline
    \le&amp; \int_{0}^1 tL||y-x||^2dt \newline
    =&amp; \frac{L}{2}||y-x||^2.
\end{align}</p>

<p align="right">&#11036;</p>

<p>The previous lemma shows and upper and lower bound of a point if the gradient of the function is lipschitz. From the previous lemma, we can then show the convergence of gradient descent.</p>

<blockquote>
  <p>Let \(f\) be a function with L-Lipschitz gradient and \(x^*\) be any minimizer of \(f\). The gradient descent with step size \(h = \frac{1}{L}\) outputs a point \(x\) such that
\(||\nabla f(x)||\le \epsilon\)
in \(\frac{2L}{\epsilon^2}(f(x^{(0)}) - f(x^*))\) iterations.</p>
</blockquote>

<p><em>Proof:</em> From the previous lemma, we have the following equation</p>

<p>\begub{align}
f(x^{(t+1)}) \le&amp; f(x^{(t)}) - \nabla f(x^{(t)})^T(x^{(t+1)} - x^{(t)}) + \frac{L}{2}||x^{(t+1)} - x^{(t)}||^2 \newline
\le&amp; f(x^{(t)}) - \nabla f(x^{(t)})^T\cdot \frac{1}{L}\nabla f(x^{(t)}) + \frac{L}{2}\frac{1}{L^2}||\nabla f(x^{(t)})||^2 \newline
=&amp; f(x^{(t)}) - \frac{1}{2L}||\nabla f(x^{(t)})||^2.
\end{align}</p>

<p>Then we have
\(||\nabla f(x^{(t)})||^2 \le 2L \cdot f(x^{(t)}) - f(x^{(t+1)}).\)</p>

<p>Sum up the equations for \(t=1,2,\dots\), we can get</p>

\[\sum_{t=0}^T ||\nabla f(x^{(t)})||^2 \le 2L(f(x^{(0)}) - f(x^{(T+1)})) \le 2L(f(x^{(0)}) - f(x^{*})).\]

<p>From this equation, we can see that the gradient descent outputs a point \(x\) such that
\(||\nabla f(x)||\le \epsilon\)
in \(\frac{2L}{\epsilon^2}(f(x^{(0)}) - f(x^*))\) iterations.</p>

<p align="right">&#11036;</p>

<h3 id="analysis-for-convex-functions">Analysis for Convex Functions</h3>
<p>In this section, we assume that the function \(f\), which we want to optimize, is convex with L-lipschitz gradient. As a result of the convexity, we can show that the difference between \(f(x^*)\) and \(f(x^{(t)})\) is \(O(\frac{1}{t})\). Before we state the theorem, we prove an auxilary lemma.</p>

<blockquote>
  <p>For any convex function \(f\in\mathcal C^1(\mathbb R^n)\), we have</p>

\[f(x) - f(y) \le ||\nabla f(x)||_2\cdot ||x-y||_2.\]
</blockquote>

<p><em>Proof:</em> From the first order condition, if \(f\) is convex, we have</p>

\[f(y) \ge f(x) + \nabla f(x)^T(y-x).\]

<p>Arranging the terms and using the basic property of norms, we have</p>

\[f(x) - f(y) \le \nabla f(x)^T(x-y) \le ||\nabla f(x)||\cdot ||x-y||.\]

<p align="right">&#11036;</p>

<p>With the help of the previous lemma, we can show the convergence bound for convex functions now.</p>

<blockquote>
  <p>Let \(f\in\mathcal C^2(\mathbb R^n)\) be convex with L-Lipschitz gradient and \(x^*\) be any minimizer of \(f\). With step size \(h = \frac{1}{L}\), the sequence \(x^{(k)}\) in Gradient Descent satises</p>

\[f(x^{(k)}) - f(x^{*}) \le \frac{2LD^2}{k+4},\]

  <p>where 
\(D = \max_{f(x) \le f(x^{(0)})}||x-x^*||_2.\)</p>
</blockquote>

<p><em>Proof:</em> Let \(\epsilon_k = f(x^{(k)}) - f(x^*)\). Since \(f\) is L-lipschitz gradient, we have</p>

\[\epsilon_{k+1} \le \epsilon_k - \frac{1}{2L}||\nabla f(x^{(k)})||_2^2.\]

<p>Then from the previous lemma, we have</p>

\[\epsilon_k \le ||\nabla f(x^{(k)})||_2\cdot ||\cdot ||x^{(k)} - x^*||_2 \le ||\nabla f(x^{(k)})||_2\cdot D.\]

<p>Therefore, we have</p>

\[\epsilon_{k+1} \le \epsilon_k - \frac{1}{2L}\left(\frac{\epsilon_k}{D}\right)^2.\]

<p>Moreover, we can also bound \(\epsilon_0\) by the following inequality,</p>

\[\epsilon_0 = f(x^{(0)}) - f(x^*) \le \nabla f(x^*)^T(x^{(0)} - x^*) + \frac{L}{2}||x^{(0)} - x^*||^2 \le \frac{LD^2}{2}.\]

<p>Then we have</p>

\[\frac{1}{\epsilon_{k+1}} - \frac{1}{\epsilon_k} = \frac{\epsilon_k-\epsilon_{k+1}}{\epsilon_{k+1}\epsilon_k} \ge \frac{\epsilon_k-\epsilon_{k+1}}{\epsilon_k^2} \ge \frac{1}{2LD^2}.\]

<p>Then</p>

\[\frac{1}{\epsilon_k} \ge \frac{1}{\epsilon_0} + \frac{k}{2LD^2} \ge \frac{k+4}{2LD^2},\]

<p>which completes the proof.</p>

<p align="right">&#11036;</p>

<h3 id="strongly-convex-functions">Strongly Convex Functions</h3>
<p>In the previous analysis for general smooth functions and convex functions, we show that the gradient descent algorithm needs \(O(\frac{1}{\epsilon^2})\) and \(O(\frac{1}{\epsilon})\) iterations to converge to an \(\epsilon\)-approximation(Note: Here, \(\epsilon\)-approximation does not mean the formal definition of \(\epsilon\)-approximation in approximation algorithms. In the sense of convergence of general functions, it means the gradient is small, and in the sense of convergence of convex functions, it means the error to the optimal is small). Now we introduce the strongly convex functions, and analysis the convergence of gradient descent algorithm for strongly convex functions. In particular, we will show that it will use \(poly(\log\frac{1}{\epsilon})\) iterations to converge to an \(\epsilon\)-approximation.</p>

<blockquote>
  <p><strong>Strongly convex</strong>: We call a function \(f \in \mathcal C^1(\mathbb R^n)\) is \(\mu\)-strongly convex if for any \(x, y \in \mathbb R^n\),</p>

\[f(y) \ge f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}||y-x||^2.\]
</blockquote>

<p>The following theorem gives an equivalent definition for strongly convex funtions if the function is twice differentiable.</p>

<blockquote>
  <p><strong>Strongly convex</strong>: Let \(f \in \mathcal C^2(\mathbb R^n)\). Then \(f\) is \(\mu\)-strongly convex if and only if</p>

\[\nabla^2 f(x)\succeq \mu\cdot I,\forall x\in\mathbb R^n.\]
</blockquote>

<p>We will not prove this theorem. This theorem can be shown by the Taylor’s expansion in multi-variable case.</p>

<p>With the definition of strongly convex functions, we can now prove the following theorem.</p>

<blockquote>
  <p>Let \(f\in\mathcal C^2(\mathbb R^n)\) be \(\mu\)-strongly convex with L-Lipschitz gradient and \(x^*\) be any minimizer of \(f\). With step size \(h = \frac{1}{L}\), the sequence \(x^{(k)}\) in GradientDescent satises</p>

\[f(x^{(k)}) - f(x^{*}) \le \left(1-\frac{\mu}{L}\right)^k(f(x^{(0)}) - f(x^{*})).\]
</blockquote>

<p><em>Proof:</em> From the previous analysis, since \(f\) is L-lipschitz gradient, we have</p>

\[\epsilon_{k+1} \le \epsilon_k - \frac{1}{2L}||\nabla f(x^{(k)})||_2^2,\]

<p>where \(\epsilon_k = f(x^{(k)}) - f(x^*)\).</p>

<p>Then since \(f\) is \(\mu\)-strongly convex, we have</p>

\[f(x^*) \ge f(x^{(k)}) + \nabla f(x^{(k)})^T(x^* - x^{(k)}) + \frac{\mu}{2}||x^* - x^{(k)}||^2.\]

<p>Rearranging the terms, we have</p>

<p>\begub{align}
f(x^{(k)}) - f(x^<em>) \le&amp; \nabla f(x^{(k)})^T(x^{(k)} - x^</em>) - \frac{\mu}{2}||x^* - x^{(k)}||^2 \newline
\le&amp; \max_{\Delta}(\nabla f(x^{(k)})^T\Delta - \frac{\mu}{2}\Delta^2)\newline
=&amp; \frac{1}{2\mu}||\nabla f(x^{(k)})||^2.
\end{align}</p>

<p>Putting the above inequality into the first inequality in the proof, we have</p>

<p>\begub{align}
\epsilon_{k+1} \le&amp; \epsilon_k - \frac{1}{2L}||\nabla f(x^{(k)})||_2^2 \newline
\le&amp; \epsilon_k - \frac{\mu}{L}\epsilon_k \newline
\le&amp; \left(1-\frac{\mu}{L}\right)^{k+1}\epsilon_0.
\end{align}</p>

<p align="right">&#11036;</p>]]></content><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><category term="opttheory" /><category term="optimization" /><summary type="html"><![CDATA[In this post, we will review the most basic and the most intuitive optimization method – the gradient decent method – in optimization.]]></summary></entry><entry><title type="html">Theory of Optimization: Preliminaries and Basic Properties</title><link href="https://haoyuzhao123.github.io/opttheory/opttheory1/" rel="alternate" type="text/html" title="Theory of Optimization: Preliminaries and Basic Properties" /><published>2019-01-31T00:00:00-08:00</published><updated>2019-01-31T00:00:00-08:00</updated><id>https://haoyuzhao123.github.io/opttheory/opttheory1</id><content type="html" xml:base="https://haoyuzhao123.github.io/opttheory/opttheory1/"><![CDATA[<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p>Recently, I find an interesting course taught by Prof. Yin Tat Lee at UW. The course is called `Theory of Optimization and Continuous Algorithms’, and the lecture notes are available under the homepage of this course<a href="http://yintat.com/teaching/cse535-winter19/">uw-cse535-winter19</a>. As a great fan of optimization theory and algorithm design, I think I will follow this course and write a bunch of blogs to record my study of this course.</p>

<p>Most of the materials in this series of blogs will follow the lecture notes of the course, and and interesting optimization book <a href="https://arxiv.org/abs/1405.4980">Convex Optimization: Algorithms and Complexity</a> by Sebastien Bubeck.</p>

<p>Since this is the first blog about this course, I will present the preliminaries of the optimization theory, and some basic knowledge about convex optimization, including some basic properties of convex functions.</p>

<!--more-->
<h2 id="preliminaries">Preliminaries</h2>
<p>First, we will introduce some of the preliminaries about theory of optimization. All of the definitions will be used later.</p>
<h3 id="functions">Functions</h3>
<p>First, we will introduce the notion of lipschitz and differentiable</p>
<blockquote>
  <p><strong>Lipschitz</strong>: A function \(f:V\to W\) is L-lipschitz if</p>

\[||f(x) - f(y)||_W \le L\cdot ||x - y||_V,\]

  <p>where the norms 
\(||\cdot||_{W}\)
and
\(||\cdot||_{V}\)
are \(l_2\) norms if unspecified.</p>
</blockquote>

<blockquote>
  <p><strong>Differentiable</strong>: A function \(f\in\mathcal C^k(\mathbb R^n)\) is k-differentiable if its \(k^{th}\) derivative is continuous.</p>
</blockquote>

<h3 id="linear-algebra">Linear Algebra</h3>

<p>The notion of positive semi-definite and the norm of matrices are also important in the world of optimization. We will introduce these notions.</p>

<blockquote>
  <p><strong>Positive semi-definite</strong>: A symmetric matrix \(A\) is positive semi-definite (PSD) if \(x^T Ax \ge 0\) for all \(x \in \mathbb R^n\). We write \(A \succeq B\) if \(A - B\) is PSD.</p>
</blockquote>

<blockquote>
  <p><strong>Trace, norm of matrix</strong>: For any matrix A, we define</p>

\[trA = \sum A_{ii}, ||A||_F^2 = tr(A^TA), ||A||_{op} = \max_{||x||_2\le 1}||Ax||_2.\]

  <p>For symmetric matrix \(A\), we have 
\(trA = \sum \lambda_i, ||A||_F = \sum \lambda_i^2\)
and 
\(||A||_{op} = \max_i | \lambda_i |\), 
where 
\(\lambda_i\) 
are the eigenvalues of matrix \(A\).</p>
</blockquote>

<h3 id="calculus">Calculus</h3>
<p>As far as I know, we can get some intuitions of the design and proofs of some optimization algorithm through the Taylor’s expansion.</p>
<blockquote>
  <p><strong>Taylor’s Remainder Theorem</strong>: For any \(g \in \mathcal C^{k+1}(\mathbb R)\), and any \(x\) and \(y\), there is a \(\xi\in [x, y]\) such that</p>

\[g(y) = \sum_{j=0}^k g^{(j)}(x)\frac{(y-x)^j}{j!} + g^{(k+1)}(\xi)\frac{(y-x)^{k+1}}{(k+1)!}.\]
</blockquote>

<h3 id="probability">Probability</h3>
<blockquote>
  <p><strong>KL-divergence</strong>: The KL-divergence of a density \(\rho\) with respect to another density \(\nu\) is defined as</p>

\[D_{KL}(\rho || \nu ) = \int \rho(x) \log \frac{\rho(x)}{\nu(x)}dx.\]
</blockquote>

<blockquote>
  <p><strong>Ito’s lemma</strong>: For any process \(x_t \in \mathbb R^n\) satisfying \(dx_i = \mu(x_t)dt + \sigma(x_i)dW_t\), where \(\mu(x_t)\in\mathbb R^n\) and \(\sigma(x_t)\in\mathbb R^{n\times m}\), we have that</p>

\[df(x_t) = \nabla f(x_t)^T\mu(x_t)dt + \nabla f(x_t)^T\sigma(x_t)dW_t + \frac{1}{2}tr(\sigma(x_t)^T\nabla^2f(x_t)\sigma(x_t))dt.\]
</blockquote>

<h2 id="introductionbasic-properties-of-convex-functions">Introduction–Basic properties of convex functions</h2>
<p>In this section, we will introduce some basic properties of convex functions, and we will also give some examples.</p>

<h3 id="definitions">Definitions</h3>
<p>Here will briefly define the convexity of sets and functions.</p>
<blockquote>
  <p><strong>Convex set</strong>: A set \(K\) in \(\mathbb R^n\) is convex if for every pair of points \(x,y\in K\), we have \([x,y] \subseteq K\).</p>
</blockquote>

<blockquote>
  <p><strong>Convex function</strong>: A function \(f:\mathbb R^n\to\mathbb R\) is</p>

  <ol>
    <li>Convex: if for any \(\lambda\in [0,1]\), we have \(f(\lambda x + (1-\lambda) y) \le \lambda f(x) + (1-\lambda) f(y)\).</li>
    <li>Concave: if for any \(\lambda\in [0,1]\), we have \(f(\lambda x + (1-\lambda) y) \ge \lambda f(x) + (1-\lambda) f(y)\).</li>
    <li>Logconcave: if \(f\) is nonnegative and \(\log f\) is concave.</li>
    <li>Quasiconvex: if the level sets of \(f\), defined by \(\{x:f(x)\le t\}\) are convex for all \(t\).</li>
  </ol>
</blockquote>

<h3 id="seperation-theorems">Seperation theorems</h3>
<p>Why is convex functions so useful? Maybe convexity is not only general enough to model different kinds of problem, and also simple enough to be solved efficiently. In this section, we will introduce the seperation theorem for convex sets, which allows us to do binary search to find a point in a convex set. This is the basis for polynomial-time algorithms for optimizing general convex functions.</p>

<blockquote>
  <p><strong>Seperation of a point and a closed convex set</strong>: Let \(K\) be a closed convex set in \(\mathbb R^n\) and \(y\notin K\). There is a non-zero \(\theta\in\mathbb R^n\) such that</p>

\[\langle \theta, y \rangle &gt; \max_{x\in K}\langle \theta, x \rangle\]
</blockquote>

<p><em>Proof:</em> Let \(x^*\) be a point in \(K\) closest to \(y\), i.e.</p>

\[x^* \in \arg\min_{x\in K}||x-y||^2.\]

<p>(Such a minimizer always exists for closed convex sets, this is sometimes calle Hilbert’s projection theorem). Using convexity of \(K\), for any \(x \in K\) and any \(0 \le t \le 1\) , we have that</p>

\[||(1 - t)x^* + tx - y||^2 \ge ||x^* - y||^2.\]

<p>Expand the LHS, we have</p>

<p>\begin{align}
||(1 - t)x^* + tx - y||^2 =&amp; ||x^* - y + t(x - x^<em>)||^2 \newline
=&amp; ||x^</em> - y||^2 + 2t\langle x^* - y, x - x^*\rangle +O(t^2).
\end{align}</p>

<p>Taking \(t\to 0^{+}\), we have \(\langle x^* - y, x - x^*\rangle\) for all \(x\in K\).</p>

<p>Taking \(\theta = y - x^*\), we have</p>

\[\langle \theta, y \rangle &gt;\langle \theta, x^* \rangle \ge \langle \theta, x \rangle,\forall x\in K.\]

<p align="right">&#11036;</p>

<p>From the above theorem, we know that a polytope is as general as a convex set. As a corollary, we have</p>

<blockquote>
  <p><strong>Intersection of halfspaces</strong>: Any closed convex set \(K\) can be written as the intersection of halfspaces as follows:</p>

\[K = \bigcap_{\theta\in\mathbb R^n}\left\{x:\langle \theta,x\rangle\le\max_{y\in K}\langle\theta,y\rangle\right\}.\]

  <p>In other words, any convex set is a limit of a sequence of polyhedra. Next, we will prove a `seperation theorem’ for convex functions. Based on this theorem, one can design polynomial algorithms for optimizing convex functions.</p>
</blockquote>

<blockquote>
  <p><strong>First order definition</strong>: Let \(f\in\mathcal C^1(\mathbb R^n)\) be convex. Then for any \(x,y\in\mathbb R^n\), we have</p>

\[f(y) \ge f(x) + \nabla f(x)^T(y-x).\]
</blockquote>

<p><em>Proof:</em> Fix any \(x,y\in\mathbb R^n\). Let \(g(t) = f((1-t)x + ty)\). Since \(f\) is convex, so is \(g\)(any convex function is also convex when restricted to a line), and we have</p>

\[g(t) \le (1-t)g(0) + tg(1),\]

<p>which implies that</p>

\[g(1) \ge g(0) + \frac{g(t) - g(0)}{t}.\]

<p>Taking \(t\to 0^{+}\), we have \(g(1) \ge g(0) + g'(0)\). Applying the chain rule, we have</p>

\[f(y) \ge f(x) + \nabla f(x)^T(y-x).\]

<p align="right">&#11036;</p>

<p>Actually, the above theorem can be viewed as an equivalent definition of the convex functions. It is easy to show that, when</p>

\[f(y) \ge f(x) + \nabla f(x)^T(y-x),\]

<p>holds for all \(x,y\in K\)(\(K = \text{dom}f\), which is the domain of \(f\), is convex), we have \(f\) is convex.</p>

<p>Then from the previous theorem, we have the first order optimality condition, which is shown as follow</p>

<blockquote>
  <p><strong>First order condition for unconstraint problems</strong>: Let \(f\in\mathcal C^1(\mathbb R^n)\) be convex. Then \(x\) is the minimizer of \(\min_{x\in\mathbb R^n}f(x)\) if and only if \(\nabla f(x) = 0\).</p>
</blockquote>

<p><em>Proof:</em> If \(\nabla f(x) = 0\), then</p>

\[f(x - \epsilon \nabla f(x)) = f(x) - (\epsilon + o(\epsilon))||f(x)||^2 &lt; f(x),\]

<p>for small enough \(\epsilon\). Hence, such a point cannot be the minimizer.</p>

<p>On the other hand, if \(\nabla f(x) = 0\), the previous theorem shows that</p>

\[f(y) \ge f(x) + \nabla f(x)^T(y - x) = f(x), \text{for all } y.\]

<p align="right">&#11036;</p>]]></content><author><name>Haoyu Zhao</name><email>thomaszhao1998@gamil.com</email></author><category term="opttheory" /><category term="optimization" /><summary type="html"><![CDATA[Recently, I find an interesting course taught by Prof. Yin Tat Lee at UW. The course is called `Theory of Optimization and Continuous Algorithms’, and the lecture notes are available under the homepage of this courseuw-cse535-winter19. As a great fan of optimization theory and algorithm design, I think I will follow this course and write a bunch of blogs to record my study of this course. Most of the materials in this series of blogs will follow the lecture notes of the course, and and interesting optimization book Convex Optimization: Algorithms and Complexity by Sebastien Bubeck. Since this is the first blog about this course, I will present the preliminaries of the optimization theory, and some basic knowledge about convex optimization, including some basic properties of convex functions.]]></summary></entry></feed>