Code Files
The Network
Q-network returns the q-value just as by a q-function. It returns the q-value instead of the action to do. Another name of Q-network is DQN (Deep Q-Network) but it’s just Q-network, deep is of course in multiple layers.
Q-network
Q-network is a type of q-function, just as q-table is also a type of q-function. Q-network is not the magic to learn in RL itself, q-network is just the replacement for q-table with much higher storage effeciency with tiny size and stores all cases.
The learning process is still the explore and exploit, and Bellman’s q-target formula. Larger q-network is actually is for storing large possibilities of input, eg. 10x10 black-and-white input is 2^100 cases of environment state. Larger q-network doesn’t mean better learning for more rewards.
Q-learning on Q-network
Based on the same q-value update formula as in q-table:
For each update:
Feed to the current q-network to get current q-value. Train the q-network to the new q-value. Q-network Init
Don’t init params (weights, biases) to all zeros as in Q-table
Q-network Update
The update formulas for q-table are:
Formula 1 (temporal diff):
td = qTarget - qNow = (rewardNow + discount*qNextMax) - qNow Formula 2 (q-update): q += rr * td (Bellman Update)
Intuitively, one may think q-network replaces the q-table exactly in the role of q-function. Thus, input and the target q (expected value) for training q-network will be:
Input (StatePair & Action) → Output (rr * td) But it’s not that way because q-network has optimizer, and optimizer already indirectly knows qNow(network result), just supplying qTarget to it is enough, and it has learning rate by itself too:
qTarget = rewardNow + discount*qNextMax rr = learningRateOfOptimizer Applying Bellman update q in place of qTarget will make q-network broken, coz both of these are applied:
Bellman update (algorithmic optimization) Optimizer update (gradient-based) When to Train
Unlike q-value being updated after every action as in q-table. Q-network takes much much less RAM but the call to get output is slow compared to constant time to getting q-value from table, and the fit (training) is extremely slow compare to setting q-value in q-table.
There are 2 options of when to train the q-network:
• Train after an action
• Train after a run through the whole episode
Train after an action as in q-table shouldn't be a choice, it's very slow training, unless having enough hardware resources. Train after a whole single run is better, faster for training.