
This post proves that the combination of softmax and cross-entropy loss ensures significant gradients by making the gradient the difference between predicted probabilities and actual labels, which aids effective learning.
Cross-Entropy Loss
The cross-entropy loss for a single example, given the true label y (which is typically one-hot encoded), is:
where
Softmax
Note that since the true label
By the definition of softmax:
Calculate
The term
For
For
The derivation applies the quotient rule with no additional tricks.
Calcualte
Applying the chain rule
For
For
To put them into vectorized form:
The Key Takeaway
The gradient of the cross-entropy loss with respect to the logits is the difference between the probability and its label. Therefore, the gradient never vanishes.
- Post title:Softmax + Cross-Entropy Loss = Significant Gradient
- Post author:Lutao Dai
- Create time:2024-08-10 22:42:00
- Post link:https://lutaodai.github.io/2024-08-11-softmax-cross-entropy-loss/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.