Home

Markov Decision Processes (MDPs)

Learn about Markov Decision Processes (MDPs)

Primer for Encyclopedic Knowledge

Below is a formal representation of MDPs. Notice how it is simple yet profound. The Agent is an abstract entity interacting with the environment, much like us humans! Although, of course, our reality is much more sophisticated, colorful and nuanced.

You'll notice that the variables along the arrows have a little subscript t which represents each moment in time, or a "step/increment" in time t. The step changes at the dotted line are very similar to taking/using a turn in a board game.

MDP

Q-Learning

Q-Learning is a strong reinforcement learning strategy specialized in its ability to learn optimal policies in stochastic environments, making it a powerful tool for a wide range of applications, especially where modeling the environment is complex or impractical.

Just like in the Multi-Armed Bandits scenario, but now we consider a state change after each action. Actions have consequences!

Some background topics:

Q-Learning is a model-free reinforcement learning algorithm that seeks to find the best action to take given the current state. It's known for its ability to compare the expected utility of the available actions without requiring a model of the environment. Essentially, Q-learning focuses on learning a policy that tells an agent what action to take under what circumstances. It works by learning an action-value function that gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.

Here's how it operates:

By continually updating Q-values, Q-learning can find the optimal action-selection policy for any given finite Markov decision process (MDP)

Example: Crawling Robot

crawler

Breaking down the Crawling Robot

In our crawling robot example, Q-learning helps the robot decide the most effective way to move forward. The robot is placed in various position (states) and must choose actions that will move it towards its goal. The Q-learning algorithm enables the robot to learn which movements (actions) yield the highest rewards through trial and error.

Here's a simplified breakdown of the mechanics behind the robot's movement:

Learning Q-values with Q-learning

The Q-learning agent is not included in the provided code, but it would typically function as follows:

  1. Initialization: A Q-learning agent begins by initializing a table of Q-values for each state-action pair, typically starting at zero.
  2. Exploration/Exploitation: The agent must decide whether to explore new actions randomly or exploit its current knowledge to select the best action based on existing Q-values.
  3. Action Execution: Upon selecting an action, the agent performs it using the doAction method, which in turn provides the new state and resulting reward.
  4. Q-value Update: The agent updates the Q-value for the prior state-action pair with the new data using the Bellman equation: Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a)), where s is the previous state, a is the action taken, s' is the new state, alpha is the learning rate, and gamma is the discount factor.
  5. Iteration: This process is repeated over many iterations, allowing the agent to gradually improve its Q-values and, by extension, its policy.

Using special tricks with the parameters, randomness, and speeding up the timescale...

crawler