Monte Carlo Search
Randomise and find some good samples. Do Monte Carlo for the best value instead of the ratio of number of points in all points.
Explore and Exploit
Explore and exploit is the mainstream strategy in reinforcement learning.
Explore
Consider out there in the wild of unknown cases, there can be better options for action to take. Make a random action.
Explore Rate
Exploration has a rate that reduces in time (the max time, max epochs intended for training) to make the model converge. However, in incremental learning, this explore rate usually never reduces to zero and leave a little bit of exploration.
Exploit
Use the known q-function or q-table, q-network to find the max value with the learnt cases to take action.
Single Agent System
Multi-agent System