There are two important parts for the implementation: on one hand, we have to implement an environment that simulates the reward of the arms. The skeleton of this class is given below:

class Arm(object):
    def __init__(self, params):
        ## passes the required parameters
        ## this could be the reward probability, or other parameter (mean, sigma) of a lottery
        ## that gives the rewards.

    def step(self):
        ## gives a reward

From the point of view of the agent, we need to implement the decision making mechanism. The skeleton of this class would be:

class Bandit(object):
    def __init__(self, *args, **kwargs):

    def reset(self):

    def choose_action(self):
        ## Chooses a pure action, i.e. an arm

    def update(self, action, reward):
        #### Updates the values of N and Q

As for the $$\epsilon$$ greedy part:

import random
import numpy as np

def argmax(array):
    return array.index(max(array))

class EpsilonGreedy:
    def __init__(self, epsilon, n_actions):
        self.Q = np.zeros(n_actions)
        self.N = np.zeros(n_actions)
        self.epsilon = epsilon     
        self.n_actions = n_actions

    def choose_action(self):
        if random.random()>self.epsilon:
            a = argmax(self.Q)
            a = random.randint(0,self.n_actions-1)
        return a
    def update(self, action, reward):
        n = self.N[action]
        self.Q[action] = self.Q[action]+1/(self.n_actions)*(reward-self.Q[action])

In reality, we can also change the value of $$\epsilon$$ at every iteration, that is $$\epsilon = \epsilon_t$$, to reduce the exploration rate. As the value of $$\epsilon$$ goes to zero, we would approach the greedy policy. Under a good $$\epsilon$$ reduction schedule, we would stop the exploration after identifying the good actions. As examples of cooling schedules we have $$\epsilon_t = 1/t$$ or $$\epsilon_t = \epsilon^t$$.

Other action selection methods

Besides $$\epsilon$$-greedy, other action selection methods can be used. We will briefly describe some of them.

Softmax action selection

It turns out that $$\epsilon$$-greedy has the (sometimes unpleasant) property of choosing a really bad action with the same probability as choosing some not-too-bad action. To avoid this, we can choose our action via a mixed strategy $$\pi_t$$ (that is, randomizing), using the softmax function:

$$\pi_t(a_i):= \frac{e^{Q_t(a_i)/\tau}}{\sum^k_{m=1}e^{Q_t(a_m)/\tau}}.$$

Here $$\tau$$ is a parameter, called the temperature. In the limit, when $$\tau \to 0$$, softmax action selection behaves like $$\epsilon$$-greedy.

For both softmax and $$\epsilon$$-greedy action selection methods, we can decrease the parameters to zero as the number of steps of the episode decreases. This process is called annealing and it causes the bandit
algorithm to explore less over time.

Upper Confidence Bound action selection

In this method, we want to be sure of the performance of an action before choosing it. Formally, regret-based estimates are used to give us upper bounds on the true action values. The action selection mechanism is:

$$A_t := \mathrm{argmax}_{i = 1, \ldots ,k} \left[ Q_t(a_i) + 2 \cdot C\sqrt{\frac{\log t}{N_t(a_i)}} \right].$$

where $$C$$ is an upper bound for the reward. This gives us a way to control randomness, but it comes at a price. UCB suffers from back-pedaling, which roughly means “I know this is bad, but I’m not completely sure about how bad it is”. So you end up choosing bad actions because you are not completely sure of their true value. It also happens that UCB algorithms may not become strictly greedy.

Reinforcement Comparison

Last but not least, a method closely related to actor-critic algorithms in reinforcement learning. The motivation behind is simple: How much reward is a good reward? Once we decide on this, we can choose our actions accordingly. The algorithm works as follows:

  • Set $$p_t(a_i)$$ a learned preference for taking action $$a_i$$.
  • The policy
    $$\pi_t(a_i):= \frac{e^{p_t(a_i)}}{\sum^k_{m=1}e^{p_t(a_m)}}$$ is a
    mixed strategy.
  • The preference is updated as:
    $$p_t(a_i) = p_{t-1}(a) + \alpha(R_{t-1}-\bar{R}_{t-1})(1_{A_{t-1}=a_i}-\pi_{t-1}(a_i))$$


$$\bar{R}_{t-1} := \frac{1}{t-1}\sum_{j=1}^{t-1}R_j.$$ Note that this reduces to softmax selection for a particular choice of the $$p$$ function.

Diagnosing a bandit algorithm

There are different metrics to see how a bandit algorithm is performing.

  • Method 1: Measure how often is the best arm chosen
    • That sounds ok, but what if the arms have similar rewards? This metric does not really make sense.
  • Method 2: Average reward at each point in time (is it really
    getting better?)

    • However, this would yield bad average reward for algorithms that explore a lot. So we should be aware of this when we tweak the hyperparameters of our algorith ($$\epsilon$$ in this case).
  • Method 3: Cumulative rewards.
    • This is a “Big-picture” metric: we can see if the cost of exploration was worth.


Let’s illustrate a couple of application domains.

Clinical trials

In clinical trials, the goal is to evaluate $$k$$ possible treatments for a disease. Patients are allocated into $$k$$ groups and the reward is 0 or 1, depending on success of the treatment. This is a typical example where one would rather use softmax than $$\epsilon$$-greedy, because it’s not the same to choose a random treatment than a second most likely to work. Also, annealing should come in, because in later stages of the trial, a greater fraction of the subjects should be assigned to better treatments.

Internet advertising

Each time you visit a website the publisher must choose to display one of $$k$$ ads, and the reward for the publisher is 0/1 (no click/click). However, not all users are alike, and they are not alike to themselves in time. Ads should be in context of users’ activity. For this, a different class of algorithms, called contextual bandits is used. A contextual bandit receives an external signal to know in which context you are, and find, for each context, the best actions for that context. This is almost a reinforcement learning problem, except that the context has no influence in the dynamics to the next state.

A major issue in real life deployments is that reward functions are messy and ill-specified. An amusing example of this can be seen in, where an agent learns to maximize its reward function as the score on a videogame. However, the agent learns the periodicity in which some bonus-giving tokens are regenerated on screen, and turns over and over in circles instead of finishing the episode.


Interested? Check out Part 1! Or the book in Leanpub.