Gallery
Concise and Practical AI/ML
Share
Explore
Reinforcement Learning

Bellman Equation

Bellman equation is the core thing in reinforcement learning which is used to optimise the model. Bellman equation is of hard rock solid importance just as the weight update formula in supervised learning to update weights.

Bellman Equation

The equation is (this is a chain equation, Snext will contain Snextnext and so on):
image.png
where V is value of state, S is state, Q is q-function, A is action, D is discount.

Q-learning

Q-learning is the process of reinforcement learning based on q-function. Q-learning can run based on Q-table or Q-network (deep Q-network is the same). The process is based on the following optimisation assignment; the following assignment is the core of Q-learning just as the weight update assignment in supervised learning.

Important Notes ⚠️

The Q in this book has 2 different presentation, when it goes with parentheses (...) it is q-function or q-network call; when it goes with square brackets [...] it is q-table cell. And R(S,A) is the result of Q(S,A).

The Assignment in Wikipedia

The assignment formula is:
image.png

The Assignment as Self-update

Q[S,A] can be moved inside the right part of the right side of the assignment, and is:
image.png
It is easy to remember that this self-update assignment is += instead of -= because reinforcement learning is about maximising rewards while supervised learning is about minimising loss. B is learning rate as also r in supervised learning.
The part which takes place of the gradient in supervised learning is:
image.png

Temporal Difference

The part in the self-update assignment above which said to be the same part as supervised learning gradient is called Temporal Difference:
image.png
where the left part of the minus is the
Reward of current action + Max possible future rewards
and the right part of the minus is the
Current value of q-table or q-network output.

Final Self-update Assignment Formula

Re-mention about supervised learning weight update, where w is weight, r is learning rate, g is gradient:
w = w - r*g
Reinforcement learning q-value update, where Q is q-value, B is learning rate, M is temporal difference:
Q = Q + B*M

Choices for Q-learning

Q-learning can run on Q-table or Q-network (aka Deep Q-network).
Q-table is the classic solution which utilises huge amount of computer memory for large problem can be considered as non-practical in real use cases.
Q-network doesn’t take that huge RAM, it learns based on the combination of neurons and layers; the total number of combinations can be exponentially large thus Q-network is very effective.

Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.