This post proves that the combination of softmax and cross-entropy loss ensures significant gradients by making the gradient the difference between predicted probabilities and actual labels, which aids effective learning.

Cross-Entropy Loss

The cross-entropy loss for a single example, given the true label y (which is typically one-hot encoded), is:

where is the number of classes; is the index of the true class. , assuming one-hot encoded, equals 1 if and 0 otherwise. is the model's output.

Softmax

Note that since the true label is one-hot encoded, the summation over classes disappears and only the term with is left in the calculation.

By the definition of softmax: where is the logit

Calculate

The term is special since it appears in both the numerator and denominator. Therefore, yields different results when and .

For :

The derivation applies the quotient rule with no additional tricks.

Calcualte

Applying the chain rule

For :

To put them into vectorized form:

The Key Takeaway

The gradient of the cross-entropy loss with respect to the logits is the difference between the probability and its label. Therefore, the gradient never vanishes.