Below are some notes I wrote down while developing tiny_pytorch
library.
- DL frameworks:
- Caffe (Only C++):
- Define a computation in terms of
layers
- Each layer implements forward and backward methods
- The change would be inplace
- The backward pass implementation is natural and intuitive from Hinton’s paper
- Define a computation in terms of
- Tensorflow:
- Construct static graph before we can execute it
- Make it easy for optimizations such as fusing operations together, reusing memory, execute only what is needed at run-time
- Hard to debug and experiment with the output of each step
- It is not intuitive and has its own programming language
- Pytorch:
- Construct dynamic computation graph called define by run
- Very easy to debug and experiment with computation graph
- Can mix python control flow and computations
- Typically slower. New advancement allowed for JIT compilation to speed up the execution.
- Caffe (Only C++):
- Initialization:
- The effect of weights initialization persists throughout training
- It affects the relative norms of the activations and the gradients over time
- Weights don’t change much from their initial values after training
- It may change is some direction/dimension more but overall the weights don’t change much relative to the initial values especially for deep NN
- Proper weight initializations speed up training and converge to much lower error rates
- We can judge the effect of initialization by computing the norm of both the weights and the gradients at every layer over all iterations
- The effect of weights initialization persists throughout training
- Layer Normalization:
- It fixes the problem of varying norms of layer activations as well as solve the problem where we have only one sample in the batch that batch normalization have
- The main drawback is that is it is hard to drive the network to low loss especially for fully connected networks because after applying the layer norm the difference between activations for different examples will be lost because it forces the mean to be zero and the standard deviation to be 1, but relative norms are very important feature to discriminate between classes
- Batch Normalization:
- Helps trains much faster by normalizing across all examples and maintain different relative norms between activations within layers
- Regularization:
- The premise is that the magnitude of the parameters is a good proxy for complexity. Smoother functions don’t change dramatically for small changes and thus they have smaller weights. As a result, we can limit the magnitude of the parameters to be small by adding a penalty to the loss function such as L2 regularization so the weights are smaller.
- Hardware acceleration:
- We need to pay attention to memory alignment because hardware can only load data into caches in one-go as a cache line of 64 bytes if they are aligned. Otherwise, we need more than one load.
- Most BLAS libraries are implemented using Fortran that uses column-major format to store ND-arrays.
- Stride format is more general and we can use it to get any format including row-major or column-major format.
- Example: With 2d array A:
- row-major: stride[0] = A.shape[1], stride[1] = 1
- column-major: stride[0] = 1, stride[1] = A.shape[1]
- Advantages:
- We can have slices that use the same underlying memory by with different offsets to beging the slice as well as shape and stride. The same thing can be used with
transpose
by swapping the strides and change the shape. Finally,broadcast
can be done by inserting a stride equal to 0.
- We can have slices that use the same underlying memory by with different offsets to beging the slice as well as shape and stride. The same thing can be used with
- Disadvantages:
- Memory access becomes not contiguous which makes vectorization hard. Some linear algebra libraries require arrays to be compact.
- Example: With 2d array A:
- Parallelization: We can use
OpenMP
to parallelize the operation by executing it using multiple threads on different cores in parallel.- Example:
#pragma use omp parallel for
before the loop.
- Example:
- Hardware acceleration implementation:
- CPU/GPU or any other device let us create flat array which is a consecutive slots of memory together. Therefore, we need shape, stride, and offset to create ND arrays view of flat array
- All operations leads to creating new output array to store the results regardless of the backend device
- Changing the device of an array involves copying the data from one device to the other. In needle, we are using numpy as a bridge. So we first create a numpy array of the input array and then create the final array on the intended device
- With shape, stride, and offset: we can create different views of the same array that all share the same underlying memory. The output array is just a different NDArray data structure with the same underlying array.
- Slice
- Transpose
- Broadcast
- When we do slicing, the resulting view may not be contiguous (compact). This means we can’t pass the array handler because the operation will be applied to all the elements of the original array. Therefore, we need to first check if the input array is compact before doing the operation.
- The array is compact if offset = 0 & strides correspond to the row-major order. Otherwise, we need to make it compact by creating new array with the sliced data (creating a copy with new memory slots).
- Some frameworks such as Numpy and Pytorch allows users to do operations on non-contiguous arrays and they handle that in their implementations w/o the need to create new contiguous array. However, in some cases they may still force us to have contiguous array such as in matmul operations to have faster access to the elements and speed up operations.
- Training larger models: large datasets require large models which have large capacity. This will put pressure on compute devices and put them to the limit.
- Shared memory on each core is typically 64KB.
- Global memory of GPU becomes the bottleneck for large models because it typically has 10GB for most devices and we can’t fit most larger models in 10GB.
- Techniques to save memory:
- Inference: Use few buffers that will hold the activations of the layers. Every few layers, we reuse the same buffers we used in earlier computations. So instead of allocating N buffers, we can use 2 or 3 buffers that will be reused in the forward computations.
- Training: Because each activation is needed for the backward pass to compute the gradients, we can’t just deallocate the buffers. As a workaround we can use checkpointing. This Technique involves checkpointing
K
layers/activations which is typicallysqrt(N)
and not store the whole N activations. During the backward pass, we recompute the values needed by running forward pass again for small segments at a time. This would make sublinear memory allocation and allows us to have bigger models at the expense of more computations due to segments forward pass done during gradient computation. We can pick activations that don’t require heavy computations such as ReLU as non-stored activations as opposed to something like convolution which is computation-heavy.
- Parallel and Distributed Training: We can leverage multiple GPU devices that are possibly distributed across several worker nodes to train the model. Parallelism can either be by model partitioning or data partitioning:
- Model partitioning: break up the computation graph into multiple parts and pipeline the computations such that different worker nodes will be performing different parts of the computation at the same time and do the communication through send/recv at the boundaries of the graph.
- Data partitioning: every worker runs the same replica of the model but using different micro batches and then the gradients can be summed from all workers since they are independent.
- We can either use parameter server that receives all the gradients from all workers and then sum them up and perform the updates on the weights which then is sent to all workers so that they can be used in next micro batches. Each worker can start sending the gradients as soon as they are ready before waiting for all the gradients to be finished. This will increase the parallelism and allows workers to send the gradients while other gradient computation is being performed.
- Or we can use allreduce() which sums up all gradients from all workers and send them back so each worker then can perform the updates
- GANs: The main goal is to generate images by the generator model that looks real and hard to tell they are fake, which means making the distribution of the generated images as close as possible to the distribution of the real images. This is done by training a discriminator and a generator which are trained through iterative process:
- Generator’s input is random vector and tries to generate image that looks real by maximizing the negative log-likelihood of the discriminator loss. This means make it harder for the discriminator to predict that it is fake and tries to bring the probability to 1.
- Discriminator’s input is both fake images generated by the generator and real images where fake images would have label = 0 and real images would have label = 1 and tries to minimize the negative log-likelihood. So the role of the discriminator is to guide the generator to generate better images and make it focus on what it takes to generate fake images that can’t be differentiated from real images.
- It is called adversarial because the generator is able to learn subtle corner cases that looks the same for the human eye but are actually different for each distribution
- Deconvolution or Conv2dTranspose is the opposite of convolution when we take a small vector and convert it to a much bigger space such as image
- Sequence Modeling:
- RNNs: the whole past compacted in the last hidden state that makes learning harder due to the long history which may lead to exploding/vanishing gradients. Also, the far past such as x1 will have less weight if at all compared to xt even with LSTM.
- Direct prediction: We use all the past inputs to predict the next output; no hidden state. This is a drawback because we don’t have compact representation (latent state) of the past.
- Temporal CNN: It works well for tasks such as speech generation but suffers from the relatively small receptive field. This means hidden state at any time step t can only capture few of the past history and this is not good. We can increase filter size or use dilation but still is not optimal.
- We can use attention that take all the history and weight them (attention matrix) so we are sure we’re incorporating all the past
- Model Deployment:
- Considerations:
- Application env restrictions: model-size, no-python
- Leverage existing HW accelerators such as mobile GPUs, NPUs, accelerated CPU instructions
- Integrate with applications
- Considerations: