Below is a formal representation of MDPs. Notice how it is simple yet profound. The Agent is an abstract entity interacting with the environment, much like us humans! Although, of course, our reality is much more sophisticated, colorful and nuanced.
You'll notice that the variables along the arrows have a little subscript t which represents each moment in time, or a "step/increment" in time t. The step changes at the dotted line are very similar to taking/using a turn in a board game.
Q-Learning is a strong reinforcement learning strategy specialized in its ability to learn optimal policies in stochastic environments, making it a powerful tool for a wide range of applications, especially where modeling the environment is complex or impractical.
Just like in the Multi-Armed Bandits scenario, but now we consider a state change after each action. Actions have consequences!
Some background topics:
Q-Learning is a model-free reinforcement learning algorithm that seeks to find the best action to take given the current state. It's known for its ability to compare the expected utility of the available actions without requiring a model of the environment. Essentially, Q-learning focuses on learning a policy that tells an agent what action to take under what circumstances. It works by learning an action-value function that gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.
Here's how it operates:
By continually updating and exploring state-action pairs, or Q-values, Q-learning can find the optimal action-selection policy for any given finite Markov Decision Process (MDP).
By continually updating Q-values, Q-learning can find the optimal action-selection policy for any given finite Markov decision process (MDP)
In our crawling robot example, Q-learning helps the robot decide the most effective way to move forward. The robot is placed in various position (states) and must choose actions that will move it towards its goal. The Q-learning algorithm enables the robot to learn which movements (actions) yield the highest rewards through trial and error.
Here's a simplified breakdown of the mechanics behind the robot's movement:
The Q-learning agent is not included in the provided code, but it would typically function as follows:
doAction
method, which in turn provides the new state and resulting reward.Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a))
, where s
is the previous state, a
is the action taken, s'
is the new state, alpha
is the learning rate, and gamma
is the discount factor.Using special tricks with the parameters, randomness, and speeding up the timescale...