Home

Multi-Armed Bandit Agents

Explore the concepts of reinforcement learning through multi-armed bandits, Thompson sampling, and upper confidence bound strategies.

Introduction to Multi-Armed Bandits

In Reinforcement Learning, there is typically a goal-directed agent that seeks to explore its interaction with the environment, guided by rewards. In this simulation, agents are equipped with various action-selection rules (ASRs). These ASRs are modeled as a map called policy. The idea is to explore the actions in the state space, observe and record the best actions (via quantified reward response to said action), and ultimately, balance exploration with exploitation. The following is a (hopefully) intuitive approach to understanding the exploration-vs-exploitation dilemma in reinforcement learning.

Agent action selection rules:

Summary of ASRs

MAB represents a sequential decision problem: initially unknown reward distributions associated with each arm/choice. Remember, we have treated each arm like a unique slot machine- we'd like to find the best machine to play with our finite cash & time.

The idea is: we need some way to model uncertainty of what the "true" reward rates of each slot machine arm are! We know, thanks to the Law of Large Numbers, that as we take more and more samples, we have less uncertainty.

Thompson Sampling Explained

Scott Nelson

We know that there is some unknown, but "fixed" probability of reward for each action/arm. Using Probability Density Functions, we use area/distribution over possible likelihoods that have less variance the more samples are collected. We represent our belief in the best action by fitting some PDF to estimate this true reward, where PDFs can be sampled from to obtain a possible probability. Therefore, the more an arm is chosen/sampled, the closer the PDF P(R|a) is to its true probability, where R ∈ {0,1} ∀ arms a.

In other words, the curves of options are estimates of reward. Random choice/sample refines confidence in expectation of the reward/outcome probability

Understanding Upper Confidence Bound

Despite the elegant representation of the options, there still will be uncertainty about the accuracy of the action-value estimates. Plus, the environment may be dynamic, adapting to our experience live. Online learning like this is orders of magnitude harder.

The greedy actions take the seemingly best actions in the present moment. So, we need exploration. Epsilon greedy action forces non-greedy actions to be tried, but indiscriminately, with no preference for the nearly greedy or particularly uncertain, mysterious actions. It would behoove us to take these aspects into account, selecting near-greedy actions according to their potential for being the optimal choice.

Scott Nelson

Here, we can say that the action for Upper Confidence Bound (UCB) should be the argmax (a value at a at which f(a) takes its maximal value of the expression that follows). ln t denotes the natural logarithm of t (the number that e ≈ 2.71828 would have to be raised to in order to equal t), Nt(a) denotes the number of times that action a has been selected prior to time t (the denominator), and the number c > 0 controls the degree of exploration. If Nt(a) = 0, then a is considered to be a maximizing action

The idea is that the upper confidence bound action selection method is that the square root term is a measure of the uncertainty of a's value, with c determining the confidence level. We want to 'max' over the true value of action a. When a is selected, the Nt(a) increments, and as it is in the denominator, decreasing the uncertainty term. Using natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that hvae already been selected frequently, will be selected with decreasing frequency over time.

Interactive Simulations

Engage with an interactive simulation that demonstrates these concepts in a dynamic and educational way.