<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>DevMemo – machine learning</title><link>https://devmemo.gitlab.io/categories/machine-learning/</link><description>Recent content in machine learning on DevMemo</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Fri, 26 May 2023 23:09:35 -0700</lastBuildDate><atom:link href="https://devmemo.gitlab.io/categories/machine-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>Machine_learning: Fundamentals</title><link>https://devmemo.gitlab.io/machine_learning/fundamentals/</link><pubDate>Fri, 26 May 2023 23:09:35 -0700</pubDate><guid>https://devmemo.gitlab.io/machine_learning/fundamentals/</guid><description>
&lt;h2 id="what-is-the-goal-of-a-machine-learning-model">What Is The Goal Of A Machine Learning Model?&lt;/h2>
&lt;p>Let&amp;rsquo;s assume we have a machine learning model $f$. $x_i$ is its inputs, $w_i$ is its paramemeters, and $y$ is its prediction. We have&lt;/p>
&lt;p>$$y = f_{model}(x_1, x_2, x_3, &amp;hellip;, x_m, w_1, w_2, w_3, &amp;hellip;, w_k)$$&lt;/p>
&lt;p>The goal of machine learning is to tune $w_i$ in order to predict the $y$ that is close enough to the real label. To quantize the difference, we use a loss function.&lt;/p>
&lt;p>$$L = l(y - \tilde{y})$$&lt;/p>
&lt;p>So mathmatically, we want to minimize $L$.&lt;/p>
&lt;h2 id="how-do-we-train-a-model">How Do We Train A Model?&lt;/h2>
&lt;p>In order to minimize the loss $L$, we&amp;rsquo;ll need to figure out how to tune all the parameters in the model, i.e., $w_i$.&lt;/p>
&lt;p>There are many ways we can use to tune the parameters.&lt;/p>
&lt;ul>
&lt;li>If our model function $f$ is simple enough, e.g., a linear function like $f(x) = w x + b$, we might be able to calculate the theoretic values of $w_i$ to minimize the loss $L$. This solution only works if $f$ is simple enough.&lt;/li>
&lt;li>We can ramdomly assign initial values to $w_i$, and then randomly adjust each $w_i$ by adding a small delta to see if $L$ goes smaller. If $L$ goes smaller, we take the change and continue trying to update the variables until we are not able to make any improvement. This can apply to a general model function $f$, but the parameter tuning is not efficient enough.&lt;/li>
&lt;/ul>
&lt;p>However, these solutions would not work in case of neural networks, since neural networks are typically complicated enough, and require more efficient parameter tuning strategies. In the world of neural networks, the most effective ways to tune parameters today is &lt;strong>backward propagation with gradient descent&lt;/strong>.&lt;/p>
&lt;p>The idea behind it is simple, if we can calculate the derivative of loss $L$ with respect to $w_i$, (i.e., the gradient, $g_i$), we can use $g_i$ to tune $w_i$. We can simply update $w_i$ based on $g_i$.&lt;/p>
&lt;p>$$w_i = w_i - \alpha \times \frac{\partial L}{\partial w_i}$$&lt;/p>
&lt;p>Here $\alpha$ is a small number close to $0$, which is typically called &lt;strong>learning rate&lt;/strong>. We use that to make sure we don&amp;rsquo;t over tune the parameters.&lt;/p>
&lt;h2 id="a-linear-regression-model">A Linear Regression Model&lt;/h2>
&lt;p>Let&amp;rsquo;s use the above idea to build a linear regression model.&lt;/p>
&lt;p>$$y = w x + b$$
$$L = (y - \tilde{y})^2$$&lt;/p>
&lt;p>There are two paremeters in this model, $w$ and $b$. First, let&amp;rsquo;s calculate the partial derivatives of $L$ with respect to each of them.&lt;/p>
&lt;p>$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial w} = 2 (y - \tilde{y}) x$$
$$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial b} = 2 (y - \tilde{y})$$&lt;/p>
&lt;p>Now, we can implement the linear regression model based on that.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> random
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>learning_rate &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.0001&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>epochs &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">10&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Define our model.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>w, b &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">f&lt;/span>(x, w, b):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> x &lt;span style="color:#f92672">*&lt;/span> w &lt;span style="color:#f92672">+&lt;/span> b
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Generating training and eval datasets.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>target_w, target_b &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>training_data &lt;span style="color:#f92672">=&lt;/span> []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>training_data_size &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">10000&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>eval_data &lt;span style="color:#f92672">=&lt;/span> []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>eval_data_size &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> i &lt;span style="color:#f92672">in&lt;/span> range(training_data_size):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> x &lt;span style="color:#f92672">=&lt;/span> random&lt;span style="color:#f92672">.&lt;/span>randrange(&lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#ae81ff">100&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> y &lt;span style="color:#f92672">=&lt;/span> f(x, target_w, target_b)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> training_data&lt;span style="color:#f92672">.&lt;/span>append((x, y))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> i &lt;span style="color:#f92672">in&lt;/span> range(eval_data_size):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> x &lt;span style="color:#f92672">=&lt;/span> random&lt;span style="color:#f92672">.&lt;/span>randrange(&lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#ae81ff">100&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> y &lt;span style="color:#f92672">=&lt;/span> f(x, target_w, target_b)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> eval_data&lt;span style="color:#f92672">.&lt;/span>append((x, y))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Train our model.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;Training model...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> epoch &lt;span style="color:#f92672">in&lt;/span> range(epochs):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> losses &lt;span style="color:#f92672">=&lt;/span> []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> x, y &lt;span style="color:#f92672">in&lt;/span> training_data:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Run forward path.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> p &lt;span style="color:#f92672">=&lt;/span> f(x, w, b)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> loss &lt;span style="color:#f92672">=&lt;/span> (p &lt;span style="color:#f92672">-&lt;/span> y) &lt;span style="color:#f92672">**&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> losses&lt;span style="color:#f92672">.&lt;/span>append(loss)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Run backward path.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> w &lt;span style="color:#f92672">-=&lt;/span> learning_rate &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#f92672">*&lt;/span> (p &lt;span style="color:#f92672">-&lt;/span> y) &lt;span style="color:#f92672">*&lt;/span> x
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> b &lt;span style="color:#f92672">-=&lt;/span> learning_rate &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#f92672">*&lt;/span> (p &lt;span style="color:#f92672">-&lt;/span> y)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34;Epoch &lt;/span>&lt;span style="color:#e6db74">{}&lt;/span>&lt;span style="color:#e6db74">: loss=&lt;/span>&lt;span style="color:#e6db74">{:.6f}&lt;/span>&lt;span style="color:#e6db74"> (w=&lt;/span>&lt;span style="color:#e6db74">{:.3f}&lt;/span>&lt;span style="color:#e6db74">, b=&lt;/span>&lt;span style="color:#e6db74">{:.3f}&lt;/span>&lt;span style="color:#e6db74">)&amp;#34;&lt;/span>&lt;span style="color:#f92672">.&lt;/span>format(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> epoch, sum(losses) &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">1.0&lt;/span> &lt;span style="color:#f92672">/&lt;/span> len(losses), w, b
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Evaluate our model.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">Evaluating model...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>losses &lt;span style="color:#f92672">=&lt;/span> []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> x, y &lt;span style="color:#f92672">in&lt;/span> eval_data:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Run forward path.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> p &lt;span style="color:#f92672">=&lt;/span> f(x, w, b)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> loss &lt;span style="color:#f92672">=&lt;/span> (p &lt;span style="color:#f92672">-&lt;/span> y) &lt;span style="color:#f92672">**&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> losses&lt;span style="color:#f92672">.&lt;/span>append(loss)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;Final loss is: &lt;/span>&lt;span style="color:#e6db74">{:.6f}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#f92672">.&lt;/span>format(sum(losses) &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">1.0&lt;/span> &lt;span style="color:#f92672">/&lt;/span> len(losses)))
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Running the above code will give us:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>Training model...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 0: loss=5.941978 (w=2.011, b=0.402)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 1: loss=0.074746 (w=2.007, b=0.637)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 2: loss=0.027469 (w=2.004, b=0.780)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 3: loss=0.010095 (w=2.002, b=0.867)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 4: loss=0.003710 (w=2.001, b=0.919)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 5: loss=0.001363 (w=2.001, b=0.951)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 6: loss=0.000501 (w=2.001, b=0.970)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 7: loss=0.000184 (w=2.000, b=0.982)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 8: loss=0.000068 (w=2.000, b=0.989)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Epoch 9: loss=0.000025 (w=2.000, b=0.993)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Evaluating model...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Final loss is: 0.000014
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This training took 10 epochs. A epoch here means using the full dataset to train the model once. As you can see, the loss becomes really small after a few epochs of training. The parameter $w$ and $b$ got tuned to $2.0$ and $0.993$, which was very close to the target values, i.e., $2.0$ and $1.0$.&lt;/p>
&lt;h2 id="what-is-missing">What Is Missing?&lt;/h2>
&lt;p>Although the above code implements the idea we talked about this note, the implementation is not practical for a decent size model. The gaps are:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>No autograd&lt;/strong>: For large models, we typically want to auto calculate the gradients (e.g., $\frac{\partial L}{\partial w_i}$). We don&amp;rsquo;t have a solution to that yet.&lt;/li>
&lt;li>&lt;strong>No distributed training&lt;/strong>: The whole training process above takes one CPU core on a single machine. Large models typically require more training resources.&lt;/li>
&lt;li>&lt;strong>No GPU/TPU support&lt;/strong>: Modern ML model frameworks leverage hardward accelerators, e.g., GPU/TPU, which makes training a lot faster.&lt;/li>
&lt;/ul>
&lt;p>These are going to be our topics next.&lt;/p></description></item></channel></rss>