Concise and Practical AI/ML

Explore

Neuralnet

Backpropagation

Backpropagation is different from feedforward, it is dynamic programming with optimisation inside while feedforward is only calculating next layer from previous layer without optimisation.

The Sample Network

The same network in

Feedforward⁠

is used in describing this backpropagation process.

Diagram

⁠

Backprop Start

Final loss value, e = fe(u1,u2,y1,y2), fe is the loss function, te is the derivative of loss function.

⁠

Gradient of Loss Function

The loss gradient function is:

⁠

Gradient of Loss Function wrt Each Weight

The activation of each neuron is f and its derivative is t. Variables of neurons other than the neuron making respect to will be removed from gradient calculations; the removes are usually unrelated variables or constants those make zero derivative. The 4 neurons are called N1, N2, N3, N4.

Weights we1, we2, at loss node connected from output nodes are always 1.

Gradient for Weight w5

⁠

Gradient for Weight w6

⁠

(Fix: This w6 goes with u1-y1, not u2-y2)

Backprop Formula

Backprop algorithm goes thru’ a dynamic programming process with a dynamic programming formula.

With the gradient calculation above. Adding the weight at loss node for the connection from N3, as in auto-differentiation concept, every node in graph should behave the same and always having weights of edges connecting in. Value of this kind of weight is constant 1 always because value 1 doesn't change anything multiplying thru'.

⁠

The above formula for gradient of the mentioned weight can be re-written as:

⁠

The left part of the right side actually forms the dynamic programming intermediate value v after trying to do manual derivations further to see the similarity of gradient formulas for weights:

⁠

The v value at loss node has different values for different output neurons. Only at that loss node there are multiple v values, at all neurons inside there's only 1 v value per neuron:

⁠

Further manual derivations of other weights would lead to the below dynamic programming formula for backprop. The feed direction is from left to right. The backpropagation process is from right to left back.

⁠

In order to get gradient for a weight, multiply v of the neuron containing that weight with input (x or h, depending on layer, all called x in this gw below) of that weight:

⁠

Bias has the input of constant 1 always, thus gradient for bias is just:

⁠

Update Weights and Biases

Use the gradients calculated above to update weights and biases.

w = w - r*g

b = b - r*g

Due to the habit of using minus sign in the weight update formula, the delta (subtraction between u and y) must be u-y and NOT y-u because u is supposed to offshoot (higher value) than the true value y so during weight update it should be minus.

Feedforward & Backprop Similarities

Dynamic programming

Feedforward is preparation part (no optimisation) of backprop dynamic programming (with optimisation).

Dynamic programming formulas

Very similar, both have sum of values times weights, bias has no use during backprop as it is removed due to being unrelated to the weight having gradient calculated.

Dot-product: d = sum(xw) + b

Feedforward: h = f(d)

Backprop: v = sum(vw) * t(d)

⁠