One of the concepts early on in deep learning is that of softmax. Softmax is a "soft approximation" of argmax. Argmax tells you which item in your data is the largest, and it's one hot notation spits out a vector with a "1" for the largest item. The softmax function maps real numbers onto a range (0,1), and ensures that they add up to 1, thus allowing for probabilistic interpretations. A lot depends on the β you pick.

For instance, you might select β = argmax(Zi) in one hot notation

The sum of the exponents is `95210529798199420000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000`

and the max is . For instance, you might select β = argmax(Zi) in one hot notation

The sum of the exponents is and the max is . Go ahead and change the Zi values in either of the tables above and see how the spread changes. As you can see, a larger β accentuates small differences and makes you overconfident.

The softmax feeds into loss calculations that are defined as the negative logarithm of softmax. To understand the spread of loss, see the chart below:

As you can see above, the loss tends to infinity when softmax tends to zero. This is basically a big ding for being absolutely sure about the wrong answer. Conversely, if you are absolutely sure about the right answer, the loss is zero. Libraries like Pytorch give you easy access to Softmax, but it is just as important to understand the math and the intuition behind this approach.