Deep Learning Tips & Tricks

Practical heuristics for training deep learning models — gathered from research papers, experiments, and debugging sessions.
Modified

March 15, 2026

Practical heuristics for training deep learning models — gathered from research papers, experiments, and debugging sessions. A living collection, organized by theme so you can find what you need mid-experiment.

General

Training

  • Learning rate: A good default for Adam/AdamW with reasonable model size is 3e-4. As the model’s complexity decreases (smaller models), you can get away with higher learning rates.
Overfit a tiny dataset first

Check if your model can overfit 1 example, then a few examples. The model should easily be able to overfit few examples. If not, there is something wrong with the optimization or other parts of the training code. Always validate this before scaling to the full dataset.

  • Mixup: Linear combination of 2 random training examples using a lambda drawn from a Beta(α, α) distribution. The output vector is also a linear combination of the two examples’ labels. This forces the model to be more robust and learn linear combinations of examples instead of memorizing them, making it less sensitive to corrupted labels and noise. Can also be applied to tabular data, though you need to train for much longer since the model must learn to differentiate between examples and their relative weights.

  • Mixed precision: When using mixed precision training, increase batch size to utilize the GPU since each batch occupies much smaller memory. Fewer batches per epoch means: (1) increase the number of epochs and (2) increase the learning rate.

  • Overfit, then regularize: First verify your model has enough capacity to overfit the full training data, then add regularization (dropout, weight decay, data augmentation, early stopping) to close the generalization gap. If the model can’t even overfit, the architecture or optimization is the problem — no amount of regularization will help.

  • Batch size and learning rate: When changing batch size, scale the learning rate proportionally. Double the batch size → double the LR (linear scaling) or use \(\sqrt{}\) scaling for more stability. Smaller batches have noisier gradients and need a lower LR; larger batches are more stable and tolerate higher LR.

  • LR schedule — CosineAnnealing1Cycle: Train with a high LR for a long time (explore the loss landscape), then a low LR for a long time at the end (converge into a good minimum). Start with a quick warmup from a small LR. Pair with cyclical momentum: high LR + low momentum (explore without overshooting), low LR + high momentum (converge steadily within a basin).

  • Initialization matters deeply: Weights don’t change much from their initial values, so proper initialization is critical. The initial loss should match what you’d expect from the task. For language modeling, the first-iteration probability of each token should be roughly uniform (\(1/vocab\_{sz}\)) since there is no reason to assign higher probability to some tokens. The loss on each token would then be \(\approx -\log(1/vocab\_sz)\) because the probability distribution is diffused.

Data

  • Inspect worst predictions: Check the positive examples mispredicted with probabilities close to 0 and negative examples mispredicted with probabilities close to 1. Also check examples the model is least confident about (probabilities close to 0.5). This highlights issues with data preprocessing and labeling mistakes.

  • Verify transformations: Look at the output of all transformations before plugging them into the pipeline — you may lose important characteristics of the input in some transformations.

  • Image resolution matters: 32x32 images have very different characteristics than larger images. Below 96x96, behavior changes significantly — what works on CIFAR-10 will most likely not work on ImageNet. 128x128 pixel images generalize well to bigger images, and conclusions made on them hold well for larger images while being much faster to train.

  • Create toy problems: A big part of getting good at applied deep learning is knowing how to create small, workable, useful datasets. Try to come up with a toy problem or two that gives you insight into your full problem.

  • Augmentation: Can be done on all kinds of data as long as the label would almost not change after the augmentation. Make sure the augmented output first makes sense and second doesn’t change the label.

  • Augmentation encodes invariances, not information: Data augmentation doesn’t add new information — it shows different aspects of the same distribution to make known invariances explicit (e.g., a horizontally flipped cat is still a cat), making it easier for the model to learn them rather than having to discover them from data alone.

  • Compose augmentation transforms: Successive image transforms (rotate, resize, crop) compound interpolation errors since each operation resamples the image. Fix: resize the image larger first, compose all geometric transforms into a single operation including the final resize — all on GPU.

  • Distribution mismatch rule: It’s fine for the training distribution to differ from validation/test — what matters is that validation and test distributions match, since the validation set is your proxy for estimating test performance.

Monitoring & Debugging

  • Activation and gradient distributions: Monitor histograms of activations and gradients at each layer, including the percentage of saturated neurons. This reveals whether activations are saturated and gradients are close to 0 — meaning those layers aren’t learning.

  • Update-to-parameter ratio: Monitor the magnitude of parameter updates relative to the parameters themselves (the grad/data ratio). Ideally, the ratio should be close to 1% — not \(0.001\%\) (no learning) or \(50\%\) (changes too fast, may overshoot). Use this to determine whether the learning rate needs adjustment.

  • Three error gaps: Training error alone reveals high bias (underfitting). The gap between training and validation error reveals high variance (overfitting). The gap between validation and test error reveals overfitting to the validation set — often caused by excessive hyperparameter tuning.

  • Metrics diverge during training: Early in training, most metrics improve together. As training progresses, they typically diverge — optimizing RMSE may worsen MAPE, improving precision may hurt recall. Pick the metric that matches your actual objective and monitor others as guardrails.

Efficiency

  • Gradient accumulation: To avoid out-of-memory errors while training on GPUs, break larger batches into smaller ones and accumulate gradients by back-propagating after every batch before running the optimization step. This gives identical results to training with larger batches, unless you have layers that depend on batch size in the forward pass such as BatchNorm. In PyTorch, instead of calling optimizer.step() for every batch, call it every few batches.

NLP

  • Compute scaling: With LLMs, generally the more compute the better. Compute ≈ parameters × tokens, so you can make the model bigger and keep the number of tokens fixed, or keep the model size the same and increase the number of tokens (training for longer). The optimal trade-off depends on the task.

  • Logloss predicts downstream performance: Improving logloss for LLMs is correlated with improved performance on downstream tasks.

  • Emergent task scaling: Even though loss scales smoothly with compute, individual downstream tasks may scale in unpredictable ways — some plateau, others may scale inversely, etc.

  • Weight sharing: Tie the token embedding and the final linear layer (classifier / LM head) — semantically similar tokens should have similar probability when predicting the next token. This also provides huge efficiency gains: in GPT-2, each matrix has \(50257 \times 768 \approx 38.5M\) parameters, roughly \(1/3\) of the model. Gradient updates receive contributions from both the classifier and embedding branches.

  • OOV token handling: For tokens that don’t appear in the training data, their predicted probabilities should be very close to zero.

Tabular

  • Embedding size for categoricals (fastai rule of thumb): \[\min(600,\; \text{round}(1.6 \times \text{n\_categories}^{0.56}))\]

  • Encode nominal categoricals: Always encode nominal categorical features — handles typos and abbreviations that inflate cardinality, groups morphologically or semantically related categories, reduces cardinality, and avoids unknown categories at inference. Word embeddings often don’t help because category names lack sufficient context.

Back to top