This book has reinforcement learning notation written in all capital letters to avoid confusion with the notation used in the core supervised learning, for the notation in supervised learning see . Reinforcement Learning uses that Neuralnet Alphabet too if it has a neural network inside, eg. Q-network. Environment
Environment (Env) is a set of data which presents the world where the actor is in. Environment has data called environment state.
Actor
Actor (Act) is a virtual entity inside environment. Actor has own state properties called actor state.
State
State (S) is the data present the form of environment at different moments of time including present, also including the data of actor, S = Se + Sa.
Timepoint
Timepoint (T) is a point in time of the action sequence. State usually written with T to specify the timepoint that a state is.
Action
Action (A) is a choice to do something in the environment according to a state.
Reward
Reward (R) is the amount of value obtained by certain action in a certain state.
Discount
Discount (D) is the rate of discount reduction, discount is smaller and smaller in further steps in future. The nearer to present the larger discount (to encourage people to go and buy quick). Discount can’t NOT be large or equal 1, the further in future the smaller the impact of guessed-out rewards.
Learning-rate
Learning rate (B) is the ratio to choose how much to change when updating params. The notation should be R but already used by the term reward. Common notation uses alpha (A) but already used by term action.
Temporal Difference
Temporal Difference (M, is for minus/subtract) is the difference (minus) between the current reward together with max future rewards, and, the current q-value.
Value
Value (V) is the value of certain state. It is the total possible accumulated reward in future starting from that state. To maximise future rewards and don’t need to care about accumulated rewards in the past. It is important to choose the best action.
Q-function
Q-function (Q), Q-network, also called value function, returns the reward if taking action A when being in state S (states of both environment and actor).
Policy
Policy (P) function, policy network is the chosen guideline to perform an action. Policy can be a function, a network, when it is a network it is called policy network.
Episode
Episode (E) is a sequence of action from time 0 to the last timepoint for the model to learn.