<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>DevMemo – ai</title><link>https://devmemo.gitlab.io/tags/ai/</link><description>Recent content in ai on DevMemo</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sat, 17 Jun 2023 09:15:30 -0700</lastBuildDate><atom:link href="https://devmemo.gitlab.io/tags/ai/index.xml" rel="self" type="application/rss+xml"/><item><title>Machine_learning: Autograd</title><link>https://devmemo.gitlab.io/machine_learning/autograd/</link><pubDate>Mon, 29 May 2023 00:14:02 -0700</pubDate><guid>https://devmemo.gitlab.io/machine_learning/autograd/</guid><description>
&lt;p>Remember in the note of &lt;a href="https://devmemo.gitlab.io/machine_learning/fundamentals">fundamentals&lt;/a>, we mentioned that the machine learning model is&lt;/p>
&lt;p>$$y = f_{model}(x_1, x_2, x_3, &amp;hellip;, x_m, w_1, w_2, w_3, &amp;hellip;, w_k)$$&lt;/p>
&lt;p>and we would like to minimize the loss&lt;/p>
&lt;p>$$L = loss(y, \tilde{y})$$&lt;/p>
&lt;p>In order to do that, we&amp;rsquo;ll need to calculate $\frac{\partial L}{\partial w_i}$ for each $i$ from $1$ to $k$. We were able to do it manually. In this note we discuss how to calculate them automatically.&lt;/p>
&lt;h2 id="problem-definition">Problem Definition&lt;/h2>
&lt;p>Although we can make autograd to tackle the above issue directly, we would like to make it more general.&lt;/p>
&lt;p>More generally, the goal of autograd is, &lt;strong>given a function implemented in python (or any other programming language), return its gradient function in the same programming language&lt;/strong>.&lt;/p>
&lt;p>To explain this in a mathematically way, consider the given function is&lt;/p>
&lt;p>$$f(x_1, x_2, x_3, &amp;hellip; x_m) = (y_1, y_2, y_3, &amp;hellip; y_n)$$&lt;/p>
&lt;p>our goal is to calculate $\frac{\partial y_j}{\partial x_i}$ for each $(i, j)$ pair. In other words, the &lt;strong>Jacobian Matrix&lt;/strong>.&lt;/p>
&lt;p>$$
J_f =
\begin{pmatrix}
\frac{\partial y_1}{\partial x_1} &amp;amp; &amp;hellip; &amp;amp; \frac{\partial y_1}{\partial x_m} \\
\vdots &amp;amp; \ddots &amp;amp; \vdots \\
\frac{\partial y_n}{\partial x_1} &amp;amp; &amp;hellip; &amp;amp; \frac{\partial y_n}{\partial x_m} \\
\end{pmatrix}
$$&lt;/p>
&lt;p>For example, if $f$ is&lt;/p>
&lt;p>$$f(x_1, x_2) = x_1 * x_2 + ln(x_1)$$&lt;/p>
&lt;p>autograd should return us&lt;/p>
&lt;p>$$J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$$&lt;/p>
&lt;p>Note that in this general form, $x_i$ can either be an input (e.g, $x_i$ in the $f_{model}$), or a weight (e.g., $w_i$ in the $f_{model}$).&lt;/p>
&lt;h2 id="solutions">Solutions&lt;/h2>
&lt;p>There are many potential solutions for autograd.&lt;/p>
&lt;h3 id="numerical-differentiation">Numerical Differentiation&lt;/h3>
&lt;p>The simplest way would be using numerical differentiation, i.e.,&lt;/p>
&lt;p>$$\frac{\partial y}{\partial x_j} = \frac{f(x_1, &amp;hellip;, x_j + \delta, &amp;hellip;, x_m) - f(x_1, &amp;hellip;, x_j, &amp;hellip;, x_m)}{\delta}$$&lt;/p>
&lt;p>However, this solution has a few technical challenges, making it unsuitable for ML model training purpose:&lt;/p>
&lt;ul>
&lt;li>It&amp;rsquo;s hard to decide what is the right value for the $\delta$.
&lt;ul>
&lt;li>If it&amp;rsquo;s too large, the result might not be close to the gradient at all.&lt;/li>
&lt;li>If it&amp;rsquo;s too small, the result is not going to be accurate since computer only has limit precision.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>It&amp;rsquo;s computational heavily.&lt;/li>
&lt;/ul>
&lt;h3 id="symbolic-differentiation">Symbolic Differentiation&lt;/h3>
&lt;p>Another solution would be to represent the model in symbolics, and then derive derivatives based on the &lt;a href="https://en.wikipedia.org/wiki/Chain_rule">chain rule&lt;/a>.&lt;/p>
&lt;p>This is also not quite feasible, since&lt;/p>
&lt;ul>
&lt;li>The form of the functions quickly get complicated.&lt;/li>
&lt;li>It&amp;rsquo;s not quite straightforward to convert the python source code into the symbolic representation. This is even more challenging if the source code has some control flow logics in it.&lt;/li>
&lt;/ul>
&lt;h3 id="graph-computation">Graph Computation&lt;/h3>
&lt;p>Graph computation means we represent the target function in a computation graph, and then derive the derivatives from the graph with the &lt;a href="https://en.wikipedia.org/wiki/Chain_rule">chain rule&lt;/a>.&lt;/p>
&lt;p>For example, if $f(x_1, x_2) = x_1 * x_2 + ln(x_1)$, the computation graph looks like this&lt;/p>
&lt;div id="anonymous">
&lt;div class="graphviz">
digraph {
rankdir="LR"
node [shape="circle"]
x_1 [label="x_1" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
x_2 [label="x_2" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
y [label="y" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
v_0 [label="v_0" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
v_1 [label="v_1" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
v_2 [label="v_2" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
v_3 [label="v_3" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
v_4 [label="v_4" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
op1 [label="mul" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
op2 [label="ln" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
op3 [label="add" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
x_1 -> v_0
x_2 -> v_1
v_0 -> op1
v_1 -> op1
op1 -> v_2
v_0 -> op2
op2 -> v_3
v_2 -> op3
v_3 -> op3
op3 -> v_4
v_4 -> y
}
&lt;/div>
&lt;/div>
&lt;script type="module">
import { Graphviz } from "https://cdn.jsdelivr.net/npm/@hpcc-js/wasm/dist/index.js";
const hpccWasm = await Graphviz.load();
document.querySelectorAll('.graphviz').forEach(div => {
const dot = div.textContent.trim();
div.innerHTML = hpccWasm.layout(dot, "svg", "dot");
});
&lt;/script>
&lt;p>In this graph, circles means data (or tensors), boxes mean ops.&lt;/p>
&lt;p>To calculate $J_f$, we just need to get $\frac{\partial y}{\partial x_1}$ and $\frac{\partial y}{\partial x_2}$.
We have two options to get them, forward mode and reverse mode.&lt;/p>
&lt;h4 id="forward-mode">Forward Mode&lt;/h4>
&lt;p>In the forward mode, to calculate $\frac{\partial y}{\partial x_1}$, we first calculate $\frac{\partial v_0}{\partial x_1}$,
then $\frac{\partial v_1}{\partial x_1}$,
then $\frac{\partial v_2}{\partial x_1}$,
&amp;hellip;
finally $\frac{\partial v_4}{\partial x_1}$, which is $\frac{\partial y}{\partial x_1}$&lt;/p>
&lt;p>This is the calculation process for evaluation point $x_1 = e, x_2 = 1$.&lt;/p>
&lt;div class="sw-table-container dark-scrollbar">
&lt;table class="sw-table" id="%!s(&lt;nil>)">
&lt;thead>
&lt;tr>
&lt;th>$i$&lt;/th>
&lt;th>$v_i$&lt;/th>
&lt;th>$\frac{\partial v_i}{\partial x_1}$&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>0&lt;/td>
&lt;td>$e$&lt;/td>
&lt;td>$1$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>$1$&lt;/td>
&lt;td>$0$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>$e$&lt;/td>
&lt;td>$1 (= \frac{\partial v_0}{\partial x_1}v_1 + v_0 \frac{\partial v_1}{\partial x_1})$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3&lt;/td>
&lt;td>$1$&lt;/td>
&lt;td>$\frac{1}{e} (= \frac{1}{v_0} \frac{\partial v_0}{\partial x_1})$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4&lt;/td>
&lt;td>$e + 1$&lt;/td>
&lt;td>$1 + \frac{1}{e}(= \frac{\partial v_2}{\partial x_1} + \frac{\partial v_3}{\partial x_1})$&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;/div>
&lt;p>Similarly, we can also calculate $\frac{\partial y}{\partial x_2}$.&lt;/p>
&lt;div class="sw-table-container dark-scrollbar">
&lt;table class="sw-table" id="%!s(&lt;nil>)">
&lt;thead>
&lt;tr>
&lt;th>$i$&lt;/th>
&lt;th>$v_i$&lt;/th>
&lt;th>$\frac{\partial v_i}{\partial x_2}$&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>0&lt;/td>
&lt;td>$e$&lt;/td>
&lt;td>$0$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>$1$&lt;/td>
&lt;td>$1$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>$e$&lt;/td>
&lt;td>$e (= \frac{\partial v_0}{\partial x_2} v_1 + v_0 \frac{\partial v_1}{\partial x_2})$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3&lt;/td>
&lt;td>$1$&lt;/td>
&lt;td>$0 (= \frac{1}{v_0} 0)$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4&lt;/td>
&lt;td>$e + 1$&lt;/td>
&lt;td>$e (= \frac{\partial v_2}{\partial x_2} + \frac{\partial v_3}{\partial x_2})$&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;/div>
&lt;p>Based on the above calculation, the $J_f$ at evaluation point $(e, 1)$ is:&lt;/p>
&lt;p>$$J_f(e, 1) = (1 + \frac{1}{e}, e)$$&lt;/p>
&lt;p>which matches our early analysis, i.e., $J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$.&lt;/p>
&lt;p>Note that in the forward mode, we walk through each node twice. One for $x_1$, one for $x_2$. Actually, each time we walk through every node, we get a column in $J_f$. So, for $f: R^m \rightarrow R^n$, forward mode is preferred if $n &amp;gt; m$.&lt;/p>
&lt;h4 id="reverse-mode">Reverse Mode&lt;/h4>
&lt;p>In the reverse mode, we first calculate $\frac{\partial y}{\partial v_4}$, then $\frac{\partial y}{\partial v_3}$, then $\frac{\partial y}{\partial v_2}$, then $\frac{\partial y}{\partial v_1}$, then $\frac{\partial y}{\partial v_0}$.&lt;/p>
&lt;p>At evaluation point $(e, 1)$, we have&lt;/p>
&lt;div class="sw-table-container dark-scrollbar">
&lt;table class="sw-table" id="%!s(&lt;nil>)">
&lt;thead>
&lt;tr>
&lt;th>$i$&lt;/th>
&lt;th>$v_i$&lt;/th>
&lt;th>$\frac{\partial y}{\partial v_i}$&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>0&lt;/td>
&lt;td>$e$&lt;/td>
&lt;td>$1 + \frac{1}{e} (= \frac{\partial y}{\partial v_2} \frac{\partial v_2}{\partial v_0} + \frac{\partial y}{\partial v_3} \frac{\partial v_3}{\partial v_0} = \frac{\partial y}{\partial v_2} + \frac{\partial y}{\partial v_3} \frac{1}{e})$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>$1$&lt;/td>
&lt;td>$e (= \frac{\partial y}{\partial v_2} \frac{\partial v_2}{\partial v_1} = \frac{\partial y}{\partial v_2} v_0 )$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>$e$&lt;/td>
&lt;td>$1 (= \frac{\partial y}{\partial v_4} \frac{\partial v_4}{\partial v_2} = \frac{\partial y}{\partial v_4} ) $&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3&lt;/td>
&lt;td>$1$&lt;/td>
&lt;td>$1 (= \frac{\partial y}{\partial v_4} \frac{\partial v_4}{\partial v_3} = \frac{\partial y}{\partial v_4} ) $&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4&lt;/td>
&lt;td>$e + 1$&lt;/td>
&lt;td>$1 (= \frac{\partial y}{\partial v_4})$&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;/div>
&lt;p>so our Jacobian Matrix is&lt;/p>
&lt;p>$$J_f(e, 1) = (1 + \frac{1}{e}, e)$$&lt;/p>
&lt;p>which also matches our early analysis, i.e., $J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$.&lt;/p>
&lt;p>In the reverse mode, we walk through each node only once. Each time we walk through every node, we get a row in $J_f$. So, for $f: R^m \rightarrow R^n$, reverse mode is preferred if $n &amp;lt; m$.&lt;/p>
&lt;p>In a real machine learning case, $m$ is typically way bigger than $n$. $m$ is the number of inputs we send to the model, while $n$ is the prediction. In most cases, $n = 1$, which makes backward pass way more suitable for machine learning.&lt;/p>
&lt;h2 id="implementation">Implementation&lt;/h2>
&lt;p>Based on the discussion on section &lt;a href="#solutions">Solutions&lt;/a>, we know the graph computation reverse mode is most suitable for machine learning. Now, let&amp;rsquo;s discuss how we can implement that in python.&lt;/p>
&lt;p>Overall, the overall idea of autograd implementation is&lt;/p>
&lt;ul>
&lt;li>When grad function is called, it immediately returns a closure function, which will calculate the gradients.&lt;/li>
&lt;li>The implementation of that function is going to be two steps:
&lt;ul>
&lt;li>Step 1: Trace the python code into computation graph.&lt;/li>
&lt;li>Step 2: Run the backward pass to calculate gradients.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Next, we&amp;rsquo;ll discuss the details of the implementation of the closure function.&lt;/p>
&lt;h3 id="trace-the-python-code-into-computation-graph">Trace The Python Code Into Computation Graph&lt;/h3>
&lt;p>To trace the python code into computation graph, we run the user python code via python interpreter directly. But instead of really executing those math functions (e.g., matmul), we override them and return graph nodes instead. Upstream nodes&amp;rsquo; outputs will be downstream nodes&amp;rsquo; inputs. By doing so, these nodes will form a graph, which is our computation graph.&lt;/p>
&lt;p>The following graph shows a computation graph&lt;/p>
&lt;div id="tracing-graph-forward">
&lt;div class="graphviz">
digraph {
rankdir="LR"
node [shape="circle"]
x_1 [label="x_1" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
x_2 [label="x_2" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
y [label="y" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
op1 [label="mul" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
op2 [label="ln" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
op3 [label="add" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
x_1 -> op1
x_2 -> op1
op2 -> op3
op1 -> op3
x_1 -> op2
op3 -> y
}
&lt;/div>
&lt;/div>
&lt;script type="module">
import { Graphviz } from "https://cdn.jsdelivr.net/npm/@hpcc-js/wasm/dist/index.js";
const hpccWasm = await Graphviz.load();
document.querySelectorAll('.graphviz').forEach(div => {
const dot = div.textContent.trim();
div.innerHTML = hpccWasm.layout(dot, "svg", "dot");
});
&lt;/script>
&lt;p>Note that any control flow in the function is going be evaluated eagerly, meaning the computation graph won&amp;rsquo;t contain the control flow at all.&lt;/p>
&lt;h3 id="run-the-backward-pass">Run The Backward Pass&lt;/h3>
&lt;p>To run the backward pass, we just need to run the computation graph reversely.&lt;/p>
&lt;div id="tracing-graph-backward">
&lt;div class="graphviz">
digraph {
rankdir="RL"
node [shape="circle"]
x_1 [label="x_1_grad" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
x_2 [label="x_2_grad" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
y [label="y_grad" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"]
op1 [label="mul_grad_fn" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
op2 [label="ln_grad_fn" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
op3 [label="add_grad_fn" shape="box" style="filled" fillcolor="0.1 0.5 0.9"]
op1 -> x_1
op1 -> x_2
op3 -> op2
op3 -> op1
op2 -> x_1
y -> op3
}
&lt;/div>
&lt;/div>
&lt;script type="module">
import { Graphviz } from "https://cdn.jsdelivr.net/npm/@hpcc-js/wasm/dist/index.js";
const hpccWasm = await Graphviz.load();
document.querySelectorAll('.graphviz').forEach(div => {
const dot = div.textContent.trim();
div.innerHTML = hpccWasm.layout(dot, "svg", "dot");
});
&lt;/script>
&lt;p>Here each of the &lt;code>xxx_grad_fn&lt;/code> returns its Jacobian matrix. In other words, they are the gradient functions.&lt;/p>
&lt;p>The returned gradients are going to be used by the caller to update the gradients.&lt;/p></description></item><item><title>Machine_learning: Parallelism</title><link>https://devmemo.gitlab.io/machine_learning/parallelism/</link><pubDate>Sat, 17 Jun 2023 09:15:30 -0700</pubDate><guid>https://devmemo.gitlab.io/machine_learning/parallelism/</guid><description>
&lt;p>There are many solutions to distributed training. They largely fall into two categories, synchronized training and asynchronized training.&lt;/p>
&lt;h2 id="synchronized-training">Synchronized Training&lt;/h2>
&lt;p>Synchronized training means all trainer nodes are synchronized when training the model.&lt;/p>
&lt;h3 id="data-parallelism">Data Parallelism&lt;/h3>
&lt;p>Data parallelism is the simplest way to do distributed model training. The idea is pretty simple, i.e., shard the data based on the batch dimension, and use a trainer to run the forward path of the model for each shard, and then do an all reduce to collect the gradients to update weights. The following diagram shows the idea.&lt;/p>
&lt;div id="data-parallelism">
&lt;div class="graphviz">
digraph {
data_shard1 [label="Data (Shard 1)" shape=box style="filled" fillcolor="0.1 0.5 0.9"]
data_shard2 [label="Data (Shard 2)" shape=box style="filled" fillcolor="0.1 0.5 0.9"]
data_shard3 [label="Data (Shard 3)" shape=box style="filled" fillcolor="0.1 0.5 0.9"]
data_shard4 [label="Data (Shard 4)" shape=box style="filled" fillcolor="0.1 0.5 0.9"]
subgraph cluster_node_1 {
color=lightgrey;
node1_model [label="Model" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node1_loss [label="Loss" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node1_gradients [label="Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node1_avg_gradients [label="Avg Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node1_weights [label="Weights" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
}
subgraph cluster_node_2 {
color=lightgrey;
node2_model [label="Model" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node2_loss [label="Loss" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node2_gradients [label="Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node2_avg_gradients [label="Avg Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node2_weights [label="Weights" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
}
subgraph cluster_node_3 {
color=lightgrey;
node3_model [label="Model" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node3_loss [label="Loss" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node3_gradients [label="Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node3_avg_gradients [label="Avg Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node3_weights [label="Weights" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
}
subgraph cluster_node_4 {
color=lightgrey;
node4_model [label="Model" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node4_loss [label="Loss" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node4_gradients [label="Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node4_avg_gradients [label="Avg Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
node4_weights [label="Weights" shape=box style="filled" fillcolor="0.3 0.5 0.9"]
}
data_shard1 -> node1_model
node1_model -> node1_loss
node1_loss -> node1_gradients
node1_gradients -> node1_avg_gradients
node1_gradients -> node2_avg_gradients [constraint=false]
node1_gradients -> node3_avg_gradients [constraint=false]
node1_gradients -> node4_avg_gradients [constraint=false]
node1_avg_gradients -> node1_weights
data_shard2 -> node2_model
node2_model -> node2_loss
node2_loss -> node2_gradients
node2_gradients -> node1_avg_gradients [constraint=false]
node2_gradients -> node2_avg_gradients
node2_gradients -> node3_avg_gradients [constraint=false]
node2_gradients -> node4_avg_gradients [constraint=false]
node2_avg_gradients -> node2_weights
data_shard3 -> node3_model
node3_model -> node3_loss
node3_loss -> node3_gradients
node3_gradients -> node1_avg_gradients [constraint=false]
node3_gradients -> node2_avg_gradients [constraint=false]
node3_gradients -> node3_avg_gradients
node3_gradients -> node4_avg_gradients [constraint=false]
node3_avg_gradients -> node3_weights
data_shard4 -> node4_model
node4_model -> node4_loss
node4_loss -> node4_gradients
node4_gradients -> node1_avg_gradients [constraint=false]
node4_gradients -> node2_avg_gradients [constraint=false]
node4_gradients -> node3_avg_gradients [constraint=false]
node4_gradients -> node4_avg_gradients
node4_avg_gradients -> node4_weights
}
&lt;/div>
&lt;/div>
&lt;script type="module">
import { Graphviz } from "https://cdn.jsdelivr.net/npm/@hpcc-js/wasm/dist/index.js";
const hpccWasm = await Graphviz.load();
document.querySelectorAll('.graphviz').forEach(div => {
const dot = div.textContent.trim();
div.innerHTML = hpccWasm.layout(dot, "svg", "dot");
});
&lt;/script>
&lt;p>The implementation requires us to deploy our code onto multiple nodes. This programming model is called SPMD (Single Program Multiple Data). The model itself is unchanged. The only thing changed here is the training loop. There are a few more steps in the training loop, e.g., all-reduce to get the average gradients.&lt;/p>
&lt;h3 id="model-parallelism">Model Parallelism&lt;/h3>
&lt;p>Data parallelism works in most of the cases. However, if you have a really large model, and it doesn&amp;rsquo;t fit on one machine&amp;rsquo;s memory, you&amp;rsquo;ll have to run it on multiple nodes. In this case, we&amp;rsquo;ll need model parallelism.&lt;/p>
&lt;p>There are two solutions to the model parallelism.&lt;/p>
&lt;h4 id="tensor-parallelism">Tensor Parallelism&lt;/h4>
&lt;p>Tensor parallelism is a very elegant solution. It doesn&amp;rsquo;t require you to change the code of your model. All you need to do is to define the sharding strategy of your tensors.&lt;/p>
&lt;p>The idea was first proposed by a google research paper &lt;a href="https://arxiv.org/abs/2105.04663">GSPMD&lt;/a>. What it does is basically to run your model assuming your tensors are sharded in a certain way. The layout of the tensors are calculated along your model computation graph.&lt;/p>
&lt;p>Data parallelism is basically a special case of tensor parallelism, i.e., sharding on the batch dimension.&lt;/p>
&lt;h4 id="model-pipeline">Model Pipeline&lt;/h4>
&lt;p>Another way to do model parallelism is to place your model graph on multiple nodes. This requires model developers to change the code of the model, which is more complicated from the model engineers&amp;rsquo; perspectives.&lt;/p>
&lt;h2 id="asynchronized-training">ASynchronized Training&lt;/h2>
&lt;p>Asynchronized training is another way to train model. In asynchronized training, each worker train its model without communicating with other worker nodes.&lt;/p>
&lt;h3 id="parameter-server">Parameter Server&lt;/h3>
&lt;p>Parameter Server is an asynchronized training strategy. It was first introduced by &lt;a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf">this paper&lt;/a>. The idea is basically to place variables on variable servers, and each worker pulls the variables, run forward and backward pass, and update the weights and eventually send them back to variable servers.&lt;/p></description></item></channel></rss>