Concepts

This book has reinforcement learning notation written in all capital letters to avoid confusion with the notation used in the core supervised learning, for the notation in supervised learning see

Neuralnet Alphabet⁠

. Reinforcement Learning uses that Neuralnet Alphabet too if it has a neural network inside, eg. Q-network.

Environment

Environment (Env) is a set of data which presents the world where the actor is in. Environment has data called environment state.

Actor

Actor (Act) is a virtual entity inside environment. Actor has own state properties called actor state.

State

State (S) is the data present the form of environment at different moments of time including present, also including the data of actor, S = Se + Sa.

Timepoint

Timepoint (T) is a point in time of the action sequence. State usually written with T to specify the timepoint that a state is.

Action

Action (A) is a choice to do something in the environment according to a state.

Reward

Reward (R) is the amount of value obtained by certain action in a certain state.

Discount

Discount (D) is the rate of discount reduction, discount is smaller and smaller in further steps in future. The nearer to present the larger discount (to encourage people to go and buy quick). Discount can’t NOT be large or equal 1, the further in future the smaller the impact of guessed-out rewards.

Learning-rate

Learning rate (B) is the ratio to choose how much to change when updating params. The notation should be R but already used by the term reward. Common notation uses alpha (A) but already used by term action.

Temporal Difference

Temporal Difference (M, is for minus/subtract) is the difference (minus) between the current reward together with max future rewards, and, the current q-value.

Value

Value (V) is the value of certain state. It is the total possible accumulated reward in future starting from that state. To maximise future rewards and don’t need to care about accumulated rewards in the past. It is important to choose the best action.

Q-function

Q-function (Q), Q-network, also called value function, returns the reward if taking action A when being in state S (states of both environment and actor).

Policy

Policy (P) function, policy network is the chosen guideline to perform an action. Policy can be a function, a network, when it is a network it is called policy network.