DevMemo – ai

Machine_learning: Autograd

Mon, 29 May 2023 00:14:02 -0700

Remember in the note of fundamentals, we mentioned that the machine learning model is

$$y = f_{model}(x_1, x_2, x_3, …, x_m, w_1, w_2, w_3, …, w_k)$$

and we would like to minimize the loss

$$L = loss(y, \tilde{y})$$

In order to do that, we’ll need to calculate $\frac{\partial L}{\partial w_i}$ for each $i$ from $1$ to $k$. We were able to do it manually. In this note we discuss how to calculate them automatically.

Problem Definition

Although we can make autograd to tackle the above issue directly, we would like to make it more general.

More generally, the goal of autograd is, given a function implemented in python (or any other programming language), return its gradient function in the same programming language.

To explain this in a mathematically way, consider the given function is

$$f(x_1, x_2, x_3, … x_m) = (y_1, y_2, y_3, … y_n)$$

our goal is to calculate $\frac{\partial y_j}{\partial x_i}$ for each $(i, j)$ pair. In other words, the Jacobian Matrix.

$$ J_f = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & … & \frac{\partial y_1}{\partial x_m} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_n}{\partial x_1} & … & \frac{\partial y_n}{\partial x_m} \\ \end{pmatrix} $$

For example, if $f$ is

$$f(x_1, x_2) = x_1 * x_2 + ln(x_1)$$

autograd should return us

$$J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$$

Note that in this general form, $x_i$ can either be an input (e.g, $x_i$ in the $f_{model}$), or a weight (e.g., $w_i$ in the $f_{model}$).

Solutions

There are many potential solutions for autograd.

Numerical Differentiation

The simplest way would be using numerical differentiation, i.e.,

$$\frac{\partial y}{\partial x_j} = \frac{f(x_1, …, x_j + \delta, …, x_m) - f(x_1, …, x_j, …, x_m)}{\delta}$$

However, this solution has a few technical challenges, making it unsuitable for ML model training purpose:

It’s hard to decide what is the right value for the $\delta$.
- If it’s too large, the result might not be close to the gradient at all.
- If it’s too small, the result is not going to be accurate since computer only has limit precision.
It’s computational heavily.

Symbolic Differentiation

Another solution would be to represent the model in symbolics, and then derive derivatives based on the chain rule.

This is also not quite feasible, since

The form of the functions quickly get complicated.
It’s not quite straightforward to convert the python source code into the symbolic representation. This is even more challenging if the source code has some control flow logics in it.

Graph Computation

Graph computation means we represent the target function in a computation graph, and then derive the derivatives from the graph with the chain rule.

For example, if $f(x_1, x_2) = x_1 * x_2 + ln(x_1)$, the computation graph looks like this

digraph { rankdir="LR" node [shape="circle"] x_1 [label="x_1" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] x_2 [label="x_2" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] y [label="y" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] v_0 [label="v_0" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] v_1 [label="v_1" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] v_2 [label="v_2" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] v_3 [label="v_3" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] v_4 [label="v_4" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] op1 [label="mul" shape="box" style="filled" fillcolor="0.1 0.5 0.9"] op2 [label="ln" shape="box" style="filled" fillcolor="0.1 0.5 0.9"] op3 [label="add" shape="box" style="filled" fillcolor="0.1 0.5 0.9"] x_1 -> v_0 x_2 -> v_1 v_0 -> op1 v_1 -> op1 op1 -> v_2 v_0 -> op2 op2 -> v_3 v_2 -> op3 v_3 -> op3 op3 -> v_4 v_4 -> y }

In this graph, circles means data (or tensors), boxes mean ops.

To calculate $J_f$, we just need to get $\frac{\partial y}{\partial x_1}$ and $\frac{\partial y}{\partial x_2}$. We have two options to get them, forward mode and reverse mode.

Forward Mode

In the forward mode, to calculate $\frac{\partial y}{\partial x_1}$, we first calculate $\frac{\partial v_0}{\partial x_1}$, then $\frac{\partial v_1}{\partial x_1}$, then $\frac{\partial v_2}{\partial x_1}$, … finally $\frac{\partial v_4}{\partial x_1}$, which is $\frac{\partial y}{\partial x_1}$

This is the calculation process for evaluation point $x_1 = e, x_2 = 1$.

$i$	$v_i$	$\frac{\partial v_i}{\partial x_1}$
0	$e$	$1$
1	$1$	$0$
2	$e$	$1 (= \frac{\partial v_0}{\partial x_1}v_1 + v_0 \frac{\partial v_1}{\partial x_1})$
3	$1$	$\frac{1}{e} (= \frac{1}{v_0} \frac{\partial v_0}{\partial x_1})$
4	$e + 1$	$1 + \frac{1}{e}(= \frac{\partial v_2}{\partial x_1} + \frac{\partial v_3}{\partial x_1})$

Similarly, we can also calculate $\frac{\partial y}{\partial x_2}$.

$i$	$v_i$	$\frac{\partial v_i}{\partial x_2}$
0	$e$	$0$
1	$1$	$1$
2	$e$	$e (= \frac{\partial v_0}{\partial x_2} v_1 + v_0 \frac{\partial v_1}{\partial x_2})$
3	$1$	$0 (= \frac{1}{v_0} 0)$
4	$e + 1$	$e (= \frac{\partial v_2}{\partial x_2} + \frac{\partial v_3}{\partial x_2})$

Based on the above calculation, the $J_f$ at evaluation point $(e, 1)$ is:

$$J_f(e, 1) = (1 + \frac{1}{e}, e)$$

which matches our early analysis, i.e., $J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$.

Note that in the forward mode, we walk through each node twice. One for $x_1$, one for $x_2$. Actually, each time we walk through every node, we get a column in $J_f$. So, for $f: R^m \rightarrow R^n$, forward mode is preferred if $n > m$.

Reverse Mode

In the reverse mode, we first calculate $\frac{\partial y}{\partial v_4}$, then $\frac{\partial y}{\partial v_3}$, then $\frac{\partial y}{\partial v_2}$, then $\frac{\partial y}{\partial v_1}$, then $\frac{\partial y}{\partial v_0}$.

At evaluation point $(e, 1)$, we have

$i$	$v_i$	$\frac{\partial y}{\partial v_i}$
0	$e$	$1 + \frac{1}{e} (= \frac{\partial y}{\partial v_2} \frac{\partial v_2}{\partial v_0} + \frac{\partial y}{\partial v_3} \frac{\partial v_3}{\partial v_0} = \frac{\partial y}{\partial v_2} + \frac{\partial y}{\partial v_3} \frac{1}{e})$
1	$1$	$e (= \frac{\partial y}{\partial v_2} \frac{\partial v_2}{\partial v_1} = \frac{\partial y}{\partial v_2} v_0 )$
2	$e$	$1 (= \frac{\partial y}{\partial v_4} \frac{\partial v_4}{\partial v_2} = \frac{\partial y}{\partial v_4} ) $
3	$1$	$1 (= \frac{\partial y}{\partial v_4} \frac{\partial v_4}{\partial v_3} = \frac{\partial y}{\partial v_4} ) $
4	$e + 1$	$1 (= \frac{\partial y}{\partial v_4})$

so our Jacobian Matrix is

$$J_f(e, 1) = (1 + \frac{1}{e}, e)$$

which also matches our early analysis, i.e., $J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$.

In the reverse mode, we walk through each node only once. Each time we walk through every node, we get a row in $J_f$. So, for $f: R^m \rightarrow R^n$, reverse mode is preferred if $n < m$.

In a real machine learning case, $m$ is typically way bigger than $n$. $m$ is the number of inputs we send to the model, while $n$ is the prediction. In most cases, $n = 1$, which makes backward pass way more suitable for machine learning.

Implementation

Based on the discussion on section Solutions, we know the graph computation reverse mode is most suitable for machine learning. Now, let’s discuss how we can implement that in python.

Overall, the overall idea of autograd implementation is

When grad function is called, it immediately returns a closure function, which will calculate the gradients.
The implementation of that function is going to be two steps:
- Step 1: Trace the python code into computation graph.
- Step 2: Run the backward pass to calculate gradients.

Next, we’ll discuss the details of the implementation of the closure function.

Trace The Python Code Into Computation Graph

To trace the python code into computation graph, we run the user python code via python interpreter directly. But instead of really executing those math functions (e.g., matmul), we override them and return graph nodes instead. Upstream nodes’ outputs will be downstream nodes’ inputs. By doing so, these nodes will form a graph, which is our computation graph.

The following graph shows a computation graph

Note that any control flow in the function is going be evaluated eagerly, meaning the computation graph won’t contain the control flow at all.

Run The Backward Pass

To run the backward pass, we just need to run the computation graph reversely.

digraph { rankdir="RL" node [shape="circle"] x_1 [label="x_1_grad" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] x_2 [label="x_2_grad" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] y [label="y_grad" shape="circle" style="filled" fillcolor="0.3 0.5 0.9"] op1 [label="mul_grad_fn" shape="box" style="filled" fillcolor="0.1 0.5 0.9"] op2 [label="ln_grad_fn" shape="box" style="filled" fillcolor="0.1 0.5 0.9"] op3 [label="add_grad_fn" shape="box" style="filled" fillcolor="0.1 0.5 0.9"] op1 -> x_1 op1 -> x_2 op3 -> op2 op3 -> op1 op2 -> x_1 y -> op3 }

Here each of the xxx_grad_fn returns its Jacobian matrix. In other words, they are the gradient functions.

The returned gradients are going to be used by the caller to update the gradients.

Machine_learning: Parallelism

Sat, 17 Jun 2023 09:15:30 -0700

There are many solutions to distributed training. They largely fall into two categories, synchronized training and asynchronized training.

Synchronized Training

Synchronized training means all trainer nodes are synchronized when training the model.

Data Parallelism

Data parallelism is the simplest way to do distributed model training. The idea is pretty simple, i.e., shard the data based on the batch dimension, and use a trainer to run the forward path of the model for each shard, and then do an all reduce to collect the gradients to update weights. The following diagram shows the idea.

digraph { data_shard1 [label="Data (Shard 1)" shape=box style="filled" fillcolor="0.1 0.5 0.9"] data_shard2 [label="Data (Shard 2)" shape=box style="filled" fillcolor="0.1 0.5 0.9"] data_shard3 [label="Data (Shard 3)" shape=box style="filled" fillcolor="0.1 0.5 0.9"] data_shard4 [label="Data (Shard 4)" shape=box style="filled" fillcolor="0.1 0.5 0.9"] subgraph cluster_node_1 { color=lightgrey; node1_model [label="Model" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node1_loss [label="Loss" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node1_gradients [label="Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node1_avg_gradients [label="Avg Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node1_weights [label="Weights" shape=box style="filled" fillcolor="0.3 0.5 0.9"] } subgraph cluster_node_2 { color=lightgrey; node2_model [label="Model" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node2_loss [label="Loss" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node2_gradients [label="Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node2_avg_gradients [label="Avg Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node2_weights [label="Weights" shape=box style="filled" fillcolor="0.3 0.5 0.9"] } subgraph cluster_node_3 { color=lightgrey; node3_model [label="Model" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node3_loss [label="Loss" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node3_gradients [label="Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node3_avg_gradients [label="Avg Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node3_weights [label="Weights" shape=box style="filled" fillcolor="0.3 0.5 0.9"] } subgraph cluster_node_4 { color=lightgrey; node4_model [label="Model" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node4_loss [label="Loss" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node4_gradients [label="Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node4_avg_gradients [label="Avg Gradients" shape=box style="filled" fillcolor="0.3 0.5 0.9"] node4_weights [label="Weights" shape=box style="filled" fillcolor="0.3 0.5 0.9"] } data_shard1 -> node1_model node1_model -> node1_loss node1_loss -> node1_gradients node1_gradients -> node1_avg_gradients node1_gradients -> node2_avg_gradients [constraint=false] node1_gradients -> node3_avg_gradients [constraint=false] node1_gradients -> node4_avg_gradients [constraint=false] node1_avg_gradients -> node1_weights data_shard2 -> node2_model node2_model -> node2_loss node2_loss -> node2_gradients node2_gradients -> node1_avg_gradients [constraint=false] node2_gradients -> node2_avg_gradients node2_gradients -> node3_avg_gradients [constraint=false] node2_gradients -> node4_avg_gradients [constraint=false] node2_avg_gradients -> node2_weights data_shard3 -> node3_model node3_model -> node3_loss node3_loss -> node3_gradients node3_gradients -> node1_avg_gradients [constraint=false] node3_gradients -> node2_avg_gradients [constraint=false] node3_gradients -> node3_avg_gradients node3_gradients -> node4_avg_gradients [constraint=false] node3_avg_gradients -> node3_weights data_shard4 -> node4_model node4_model -> node4_loss node4_loss -> node4_gradients node4_gradients -> node1_avg_gradients [constraint=false] node4_gradients -> node2_avg_gradients [constraint=false] node4_gradients -> node3_avg_gradients [constraint=false] node4_gradients -> node4_avg_gradients node4_avg_gradients -> node4_weights }

The implementation requires us to deploy our code onto multiple nodes. This programming model is called SPMD (Single Program Multiple Data). The model itself is unchanged. The only thing changed here is the training loop. There are a few more steps in the training loop, e.g., all-reduce to get the average gradients.

Model Parallelism

Data parallelism works in most of the cases. However, if you have a really large model, and it doesn’t fit on one machine’s memory, you’ll have to run it on multiple nodes. In this case, we’ll need model parallelism.

There are two solutions to the model parallelism.

Tensor Parallelism

Tensor parallelism is a very elegant solution. It doesn’t require you to change the code of your model. All you need to do is to define the sharding strategy of your tensors.

The idea was first proposed by a google research paper GSPMD. What it does is basically to run your model assuming your tensors are sharded in a certain way. The layout of the tensors are calculated along your model computation graph.

Data parallelism is basically a special case of tensor parallelism, i.e., sharding on the batch dimension.

Model Pipeline

Another way to do model parallelism is to place your model graph on multiple nodes. This requires model developers to change the code of the model, which is more complicated from the model engineers’ perspectives.

ASynchronized Training

Asynchronized training is another way to train model. In asynchronized training, each worker train its model without communicating with other worker nodes.

Parameter Server

Parameter Server is an asynchronized training strategy. It was first introduced by this paper. The idea is basically to place variables on variable servers, and each worker pulls the variables, run forward and backward pass, and update the weights and eventually send them back to variable servers.