Softmax + Cross-Entropy Loss = Significant Gradient
Lutao Dai

This post proves that the combination of softmax and cross-entropy loss ensures significant gradients by making the gradient the difference between predicted probabilities and actual labels, which aids effective learning.



Cross-Entropy Loss

The cross-entropy loss for a single example, given the true label y (which is typically one-hot encoded), is:

where is the number of classes; is the index of the true class. , assuming one-hot encoded, equals 1 if and 0 otherwise. is the model's output.

Softmax

Note that since the true label is one-hot encoded, the summation over classes disappears and only the term with is left in the calculation.

By the definition of softmax: where is the logit

Calculate

The term is special since it appears in both the numerator and denominator. Therefore, yields different results when and .

For :

For :

The derivation applies the quotient rule with no additional tricks.

Calcualte

Applying the chain rule

For :

For :

To put them into vectorized form:

The Key Takeaway

The gradient of the cross-entropy loss with respect to the logits is the difference between the probability and its label. Therefore, the gradient never vanishes.

  • Post title:Softmax + Cross-Entropy Loss = Significant Gradient
  • Post author:Lutao Dai
  • Create time:2024-08-10 22:42:00
  • Post link:https://lutaodai.github.io/2024-08-11-softmax-cross-entropy-loss/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.