flowchart LR
subgraph bad ["❌ Double Application"]
direction LR
z1["Logits z"] --> s1["nn.Softmax"] --> p1["Probs p"] --> ce1["CrossEntropyLoss\nlog-softmax(p)"]
end
subgraph good ["✅ Correct"]
direction LR
z2["Logits z"] --> ce2["CrossEntropyLoss\nlog-softmax(z) — fused, stable"]
end
A Common Mistake That’s Hard to See
If you’ve built a classifier in PyTorch, you’ve probably seen nn.Softmax and nn.CrossEntropyLoss in the same codebase. You may have even used them together — softmax at the end of the model, cross-entropy as the loss. The code runs, the loss decreases, the model converges. Everything looks fine.
But something is wrong. nn.CrossEntropyLoss already applies softmax internally. Applying it again in the model’s final layer means softmax is computed twice — and the gradients computed during backpropagation are the gradients of the wrong function. The model still learns, just more slowly, less stably, and to a worse optimum.
This post unpacks why — starting with what softmax actually does, then working through the numerical stability mechanism that motivates keeping raw logits, and finishing with a clear picture of when softmax belongs and when it doesn’t.
What Softmax Does
The softmax function takes a vector of raw scores — logits — and squashes them into a probability distribution:
\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}\]
The outputs are in \([0, 1]\) and sum to 1. For a ten-class classifier, softmax turns a vector like \([2.1,\ -0.3,\ 0.8,\ \ldots]\) into a proper probability distribution over the ten classes. This seems like exactly the right thing to do before computing a loss that expects probabilities.
The problem isn’t what softmax does. It’s where you do it — and whether the operation downstream already does it better.
The Log-Sum-Exp Trick
To understand why CrossEntropyLoss wants raw logits, we need to look at what it computes. Cross-entropy loss for the true class \(y\) is:
\[\mathcal{L} = -\log\left(\frac{e^{z_y}}{\sum_j e^{z_j}}\right) = -z_y + \log\sum_j e^{z_j}\]
The second term — \(\log\sum_j e^{z_j}\) — is the log-sum-exp (LSE), and it’s numerically dangerous. If any logit is large, the exponent overflows to inf before the log can bring it back down:
import torch
z = torch.tensor([1000.0, 1001.0, 1002.0])
torch.softmax(z, dim=0) # → tensor([nan, nan, nan])The standard fix is the log-sum-exp trick: subtract the maximum logit before exponentiating.
\[\log\sum_j e^{z_j} = c + \log\sum_j e^{z_j - c}, \quad c = \max_j z_j\]
Subtracting \(c = \max_j z_j\) keeps all terms in \([e^{-\infty},\ 1]\) — never overflowing, never underflowing. The mathematical result is identical; the numerical result is stable.
This is exactly what nn.CrossEntropyLoss does internally. It doesn’t apply softmax and then compute cross-entropy — it fuses both operations into one numerically stable pass using the LSE trick. Passing raw logits is what makes this possible.
If you apply softmax first, the loss function receives \(p_i = e^{z_i}/\sum e^{z_j}\) instead of raw logits and then applies its own log-softmax to those values — effectively computing \(\log(\text{softmax}(\text{softmax}(z)))\). The numbers are wrong and the gradients are wrong.
nn.CrossEntropyLoss (PyTorch) and tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) (TensorFlow) both expect raw logits. The loss handles the stable, fused computation internally. Don’t apply softmax to the final layer of a classifier.
Multi-Label Classification: The Wrong Prior
Softmax enforces competition between classes: increasing one class’s probability necessarily decreases the others. This is the correct structure for single-label tasks — exactly one class is true — and entirely the wrong structure for multi-label tasks, where multiple classes can be true simultaneously.
Consider a document classifier that assigns topics like “machine learning,” “software engineering,” and “career advice.” A document can belong to all three. Softmax forces these to compete: pushing “machine learning” up automatically pushes the others down. The model is fighting its own output structure.
There’s a deeper problem. Because softmax outputs always sum to 1, the model is structurally forced to predict high confidence for exactly one class — regardless of the input. If an image contains no objects from the training categories, softmax still redistributes its probability mass across the classes and picks a winner. If an image contains three objects, softmax still collapses to one. It has no way to say “multiple things are present” or “nothing relevant is here” — the sum-to-one constraint makes both answers impossible.
For multi-label classification, the correct output is sigmoid, applied independently per class:
\[\sigma(z_i) = \frac{1}{1 + e^{-z_i}}\]
Each output is an independent probability in \([0, 1]\) with no constraint that they sum to 1. Use nn.BCEWithLogitsLoss — which applies sigmoid internally with the same kind of numerical stability fusion — rather than sigmoid in the model followed by nn.BCELoss.
| Task | Loss function | Notes |
|---|---|---|
| Single-label classification | nn.CrossEntropyLoss |
Expects raw logits; applies log-softmax internally |
| Multi-label classification | nn.BCEWithLogitsLoss |
Expects raw logits; applies sigmoid internally |
| Binary classification | nn.BCEWithLogitsLoss |
Same as above |
| Probabilities at inference | Apply softmax after training |
Not during training |
Softmax and Overconfidence
Softmax is sensitive to the scale of the logits, not just their relative ordering. Logits \([3,\ 1,\ 0]\) and \([300,\ 100,\ 0]\) produce the same ranking but very different softmax outputs — the scaled version concentrates nearly all probability mass on the top class. As training progresses, logit magnitudes tend to grow, and softmax increasingly exaggerates these differences.
The result is systematic overconfidence: a model that outputs near-100% probability on examples it gets wrong. The Guo et al. 2017 calibration paper showed this is a consistent property of modern neural networks, not a training artifact.
The standard fix is temperature scaling: divide logits by a learned scalar \(T > 1\) before applying softmax at inference time.
\[p_i = \text{softmax}(z_i / T)\]
\(T > 1\) flattens the distribution (less confident); \(T < 1\) sharpens it. \(T\) is fit on a held-out validation set after training finishes. Crucially, this only works if the model was trained on raw logits — the scale information that temperature scaling adjusts is preserved through training and only consumed at inference.
Post-hoc calibration methods (temperature scaling, Platt scaling, isotonic regression) all operate on the raw logit magnitudes that accumulate through training. If your output layer applies softmax during training, the scale information is destroyed before calibration is attempted — the calibration methods have nothing useful to fit.
When Softmax Belongs
Removing softmax from the final classification layer doesn’t mean it’s always wrong — it means the structure it imposes (mutual exclusivity, sum-to-one) has to match what the computation actually needs.
Attention mechanisms. The scaled dot-product attention in Transformers applies softmax to produce a distribution over positions. This is exactly right: each query should distribute its weight across keys, and the competition structure is intentional. There’s no fused loss downstream computing log-softmax again.
Contrastive learning. Methods like CLIP apply softmax across the batch as part of the contrastive loss. The within-batch competition is the learning signal.
Inference-time probabilities. If downstream code requires calibrated probabilities — confidence thresholds, ensemble averaging, displaying to users — apply softmax to the final logits after the forward pass, outside the model:
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=-1)The pattern: softmax belongs when the distribution semantics genuinely fit the computation, and when nothing downstream is already computing a fused version of it.
Key Takeaways
Don’t apply softmax in your model’s final layer for classification.
nn.CrossEntropyLossexpects raw logits and applies a fused, numerically stable log-softmax internally using the log-sum-exp trick. Pre-applying softmax computes gradients of the wrong function.The numerical instability is real and silent. Large logits overflow naive softmax — you get
nanlosses and corrupted gradients, often without a clear error. The fused implementation avoids this entirely.Multi-label tasks need sigmoid, not softmax. Softmax enforces mutual exclusivity. For tasks where multiple labels are simultaneously valid, use
nn.BCEWithLogitsLosswith raw logits.Overconfidence is a logit scale problem. Softmax exaggerates differences as magnitudes grow through training. Temperature scaling is the standard fix — but only if raw logit scale is preserved through training.
Softmax has legitimate uses. Attention weights, contrastive losses, and inference-time probability outputs are correct applications. The question is always whether competition semantics fit the problem, and whether a fused stable implementation already handles the math downstream.
Resources
- PyTorch Documentation — CrossEntropyLoss — Documents why raw logits are expected and how log-softmax is fused internally.
- On Calibration of Modern Neural Networks — Guo et al. on systematic softmax overconfidence and temperature scaling as the practical fix.
- Deep Learning Book — Chapter 6 — Goodfellow et al. on output units and loss function design for classification.