General
Check if your model can overfit
1
example then few examples. The model should easily be able to overfit few examples. If not, there would be something wrong with the optimization or other parts of the training code.Check the worst mistakes that the model make. In the case of binary classification, check the positive examples that were mispredicted with probabilities close to
0
or negative example that were mispredicted with probabilities close to1
. Also the check the examples that model is least confident about such as examples with probabilities close to0.5
. This would highlight issues with data preprocessing and labeling mistakes.Look at the output of all transformations before you plug them into the pipeline because you may loose a lot of the characteristics of the input in some transformations.
32x32
Images have very different characteristics than large images. Once you get below96x96
they behave differently. This means what works well; for example on CIFAR 10, most likely will not work well on Imagenet. 128 x 128 pixel images do generalize well to bigger images, and conclusions made on such images hold well to large images and much faster to train.A big part of a getting good at using deep learning in your domain is knowing how to create small, workable, useful datasets… Try to come up with a toy problem or two that will give you insight into your full problem.
Augmentation can be done on all kind of data as long as the label would almost not change after the augmentations. Therefore, we need to make sure that the output of augmented data first makes sense and second doesn’t change the label.
Monitor histogram of activations and gradients at each layer. This would shed some light on whether some activations are saturated and their gradients may be close to
0
and not learning anything.Monitor the magnitude of the parameter updates relative to the parameters. Ideally, we want the ratio to be close to 1%, not \(0.001%\) (no learning) or \(50%\) (changes super fast and may overshoot).
Gradient accumulation: to avoid out-of-memory errors (OOM) while training on GPUs, break larger batches into smaller batches and accumulate gradients by back-propagating after every batch to gather gradients before running optimization step. It should give identical results as if we train with larger batches unless we have layers that kind of depending on the batch size in the forward pass such as
BatchNorm
.- For example, in pytorch, instead of calling
optimizer.step()
for every batch, we call it every few batches.
- For example, in pytorch, instead of calling
When using mixed precision, it is better to increase batch size to utilize the GPU since every batch will occupy much smaller memory. As a result, there will be less updates for each epoch because the model will see less number of batches in every epoch. This means that 1) we need to increase the number of epochs and 2) increase the learning rate.
Mixup: linear combination of 2 random examples from training dataset using lambda parameter that is drawn from
Beta
distribution (α
,α
). The output vector will also be linear combination of two examples labels. This will force model to be more robust and learn linear combination of examples instead of memorizing them. As a result, model becomes less sensitive to corrupted labels and noise in the training data. This method can also be applied to tabular data. However, because it is harder for the model to learn to differentiate between the two examples and the weight of each, we need to train for much longer to get good results.
NLP
With LLM, generally the more compute the better the results. We can define compute, roughly, as the number of parameters x number of tokens. Therefore, we can make the model bigger and keep the number of tokens fixed OR keep the model size the same and increase the number of tokens which means that we have to train for longer. There is a trade-off that depends on the task.
Improving logloss for LLMs is correlated with improved performance on downstream tasks
Even though loss scales smoothly with compute, individual downstream tasks may scale in an emergent fashion. This means some tasks’ loss may be flat, others maybe inversely scaled, etc.
Tabular
- Rule of thumb to pick embedding size for categorical feature (fastai): \[min(600, round(1.6 * {number\_categories}^{0.56}))\]