Backpropagation is different from feedforward, it is dynamic programming with optimisation inside while feedforward is only calculating next layer from previous layer without optimisation.
The Sample Network
The same network in is used in describing this backpropagation process. Diagram
Backprop Start
Final loss value, e = fe(u1,u2,y1,y2), fe is the loss function, te is the derivative of loss function.
Gradient of Loss Function
The loss gradient function is:
Gradient of Loss Function wrt Each Weight
The activation of each neuron is f and its derivative is t. Variables of neurons other than the neuron making respect to will be removed from gradient calculations; the removes are usually unrelated variables or constants those make zero derivative. The 4 neurons are called N1, N2, N3, N4.
Weights we1, we2, at loss node connected from output nodes are always 1.
Gradient for Weight w5
Gradient for Weight w6
(Fix: This w6 goes with u1-y1, not u2-y2)
Backprop Formula
Backprop algorithm goes thru’ a dynamic programming process with a dynamic programming formula.
With the gradient calculation above. Adding the weight at loss node for the connection from N3, as in auto-differentiation concept, every node in graph should behave the same and always having weights of edges connecting in. Value of this kind of weight is constant 1 always because value 1 doesn't change anything multiplying thru'.
The above formula for gradient of the mentioned weight can be re-written as:
The left part of the right side actually forms the dynamic programming intermediate value v after trying to do manual derivations further to see the similarity of gradient formulas for weights:
The v value at loss node has different values for different output neurons. Only at that loss node there are multiple v values, at all neurons inside there's only 1 v value per neuron:
Further manual derivations of other weights would lead to the below dynamic programming formula for backprop. The feed direction is from left to right. The backpropagation process is from right to left back.
In order to get gradient for a weight, multiply v of the neuron containing that weight with input (x or h, depending on layer, all called x in this gw below) of that weight:
Bias has the input of constant 1 always, thus gradient for bias is just:
Update Weights and Biases
Use the gradients calculated above to update weights and biases.
w = w - r*g
b = b - r*g
Due to the habit of using minus sign in the weight update formula, the delta (subtraction between u and y) must be u-y and NOT y-u because u is supposed to offshoot (higher value) than the true value y so during weight update it should be minus.
Feedforward & Backprop Similarities
Dynamic programming
Feedforward is preparation part (no optimisation) of backprop dynamic programming (with optimisation). Dynamic programming formulas
Very similar, both have sum of values times weights, bias has no use during backprop as it is removed due to being unrelated to the weight having gradient calculated. Dot-product: d = sum(xw) + b Backprop: v = sum(vw) * t(d)