Multi-armed bandit
In probability theory, the multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration-exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines, who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduling.
In the problem, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines. The trade-off between exploration and exploitation is also faced in machine learning. In practice, multi-armed bandits have been used to model problems such as managing research projects in a large organization like a science foundation or a pharmaceutical company. In early versions of the problem, the gambler begins with no initial knowledge about the machines.
Herbert Robbins in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "some aspects of the sequential design of experiments". A theorem, the Gittins index, first published by John C. Gittins, gives an optimal policy for maximizing the expected discounted reward.
Empirical motivation
The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge and optimize their decisions based on existing knowledge. The agent attempts to balance these competing tasks in order to maximize their total value over the period of time considered. There are many practical applications of the bandit model, for example:- clinical trials investigating the effects of different experimental treatments while minimizing patient losses,
- adaptive routing efforts for minimizing delays in a network,
- financial portfolio design
The model has also been used to control dynamic allocation of resources to different projects, answering the question of which project to work on, given uncertainty about the difficulty and payoff of each possibility.
Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it.
The version of the problem now commonly analyzed was formulated by Herbert Robbins in 1952.
The multi-armed bandit model
The multi-armed bandit can be seen as a set of real distributions, each distribution being associated with the rewards delivered by one of the levers. Let be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards. The horizon is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state Markov decision process. The regret after rounds is defined as the expected difference between the reward sum associated with an optimal strategy and the sum of the collected rewards:where is the maximal reward mean,, and is the reward in round t.
A zero-regret strategy is a strategy whose average regret per round tends to zero with probability 1 when the number of played rounds tends to infinity. Intuitively, zero-regret strategies are guaranteed to converge to a optimal strategy if enough rounds are played.
Variations
A common formulation is the Binary multi-armed bandit or Bernoulli multi-armed bandit, which issues a reward of one with probability, and otherwise a reward of zero.Another formulation of the multi-armed bandit has each arm representing an independent Markov machine. Each time a particular arm is played, the state of that machine advances to a new one, chosen according to the Markov state evolution probabilities. There is a reward depending on the current state of the machine. In a generalisation called the "restless bandit problem", the states of non-played arms can also evolve over time. There has also been discussion of systems where the number of choices increases over time.
Computer science researchers have studied multi-armed bandits under worst-case assumptions, obtaining algorithms to minimize regret in both finite and infinite time horizons for both stochastic and non-stochastic arm payoffs.
Bandit strategies
A major breakthrough was the construction of optimal population selection strategies, or policies in the work described below.Optimal solutions
- UCBC : The algorithm adapts UCB for a new setting such that it can incorporate both clustering and historical information. The algorithm incorporates the historical observations by utilizing both in the computation of the observed mean rewards and the uncertainty term. The algorithm incorporates the clustering information by playing at two levels: first picking a cluster using a UCB-like strategy at each time step, and subsequently picking an arm within the cluster, again using a UCB-like strategy.
Later in "Optimal adaptive policies for Markov decision processes" Burnetas and Katehakis studied the much larger model of Markov Decision Processes under partial information, where the transition law and/or the expected one period rewards may depend on unknown parameters. In this work the explicit form for a class of adaptive policies that possess uniformly maximum convergence rate properties for the total expected finite horizon reward, were constructed under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of these policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations. These inflations have recently been called the optimistic approach in the work of Tewari and Bartlett, Ortner Filippi, Cappé, and Garivier, and Honda and Takemura.
When optimal solutions to multi-arm bandit tasks are used to derive the value of animals' choices, the activity of neurons in the amygdala and ventral striatum encodes the values derived from these policies, and can be used to decode when the animals make exploratory versus exploitative choices. Moreover, optimal policies better predict animals' choice behavior than alternative strategies. This suggests that the optimal solutions to multi-arm bandit problems are biologically plausible, despite being computationally demanding.
Approximate solutions
Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the four broad categories detailed below.Semi-uniform strategies
Semi-uniform strategies were the earliest strategies discovered to approximately solve the bandit problem. All those strategies have in common a greedy behavior where the best lever is always pulled except when a random action is taken.- Epsilon-greedy strategy: The best lever is selected for a proportion of the trials, and a lever is selected at random for a proportion. A typical parameter value might be, but this can vary widely depending on circumstances and predilections.
- Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase. For trials in total, the exploration phase occupies trials and the exploitation phase trials. During the exploration phase, a lever is randomly selected ; during the exploitation phase, the best lever is always selected.
- Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish.
- Adaptive epsilon-greedy strategy based on value differences : Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress instead of manual tuning. High fluctuations in the value estimates lead to a high epsilon ; low fluctuations to a low epsilon. Further improvements can be achieved by a softmax-weighted action selection in case of exploratory actions.
- Adaptive epsilon-greedy strategy based on Bayesian ensembles : An adaptive epsilon adaptation strategy for reinforcement learning similar to VBDE, with monotone convergence guarantees. In this framework, the epsilon parameter is viewed as the expectation of a posterior distribution weighting a greedy agent and uniform learning agent. This posterior is approximated using a suitable Beta distribution under the assumption of normality of observed rewards. In order to address the possible risk of decreasing epsilon too quickly, uncertainty in the variance of the learned reward is also modeled and updated using a normal-gamma model..
- Contextual-Epsilon-greedy strategy: Similar to the epsilon-greedy strategy, except that the value of is computed regarding the situation in experiment processes, which lets the algorithm be Context-Aware. It is based on dynamic exploration/exploitation and can adaptively balance the two aspects by deciding which situation is most relevant for exploration or exploitation, resulting in highly explorative behavior when the situation is not critical and highly exploitative behavior at critical situation.
Probability matching strategies
Probability matching strategies also admit solutions to so-called contextual bandit problems.
Pricing strategies
Pricing strategies establish a price for each lever. For example, as illustrated with the POKER algorithm, the price can be the sum of the expected reward plus an estimation of extra future rewards that will gain through the additional knowledge. The lever of highest price is always pulled.Strategies with ethical constraints
- Behavior Constrained Thompson Sampling : In this paper the authors detail a novel online agent that learns a set of behavioral constraints by observation and uses these learned constraints as a guide when making decisions in an online setting while still being reactive to reward feedback. To define this agent, the solution was to adopt a novel extension to the classical contextual multi-armed bandit setting and provide a new algorithm called Behavior Constrained Thompson Sampling that allows for online learning while obeying exogenous constraints. The agent learns a constrained policy that implements the observed behavioral constraints demonstrated by a teacher agent, and then uses this constrained policy to guide the reward-based online exploration and exploitation.
Contextual bandit
A particularly useful version of the multi-armed bandit is the contextual multi-armed bandit problem. In this problem, in each iteration an agent has to choose between arms. Before making the choice, the agent sees a d-dimensional feature vector,associated with the current iteration. The learner uses these context vectors along with the rewards of the arms played in the past to make the choice of the arm to play in
the current iteration. Over time, the learner's aim is to collect enough information about how the context vectors and rewards relate to each other, so that it can predict the next best arm to play by looking at the feature vectors.
Approximate solutions for contextual bandit
Many strategies exist that provide an approximate solution to the contextual bandit problem, and can be put into two broad categories detailed below.Online linear bandits
- LinUCB ' algorithm: the authors assume a linear dependency between the expected reward of an action and its context and model the representation space using a set of linear predictors.
- LinRel algorithm: Similar to LinUCB, but utilizes Singular-value decomposition rather than Ridge regression to obtain an estimate of confidence.
- HLINUCBC ''': proposed in the paper Optimal Exploitation of Clustering and History Information in Multi-Armed Bandit it extends the LINUCB idea with both historical and clustering information.
Online non-linear bandits
- UCBogram algorithm: The nonlinear reward functions are estimated using a piecewise constant estimator called a regressogram in Nonparametric regression. Then, UCB is employed on each constant piece. Successive refinements of the partition of the context space are scheduled or chosen adaptively.
- Generalized linear algorithms: The reward distribution follows a generalized linear model, an extension to linear bandits.
- NeuralBandit algorithm: In this algorithm several neural networks are trained to modelize the value of rewards knowing the context, and it uses a multi-experts approach to choose online the parameters of multi-layer perceptrons.
- KernelUCB algorithm: a kernelized non-linear version of linearUCB, with efficient implementation and finite-time analysis.
- Bandit Forest algorithm: a random forest is built and analyzed w.r.t the random forest built knowing the joint distribution of contexts and rewards.
- Oracle-based algorithm: The algorithm reduces the contextual bandit problem into a series of supervised learning problem, and does not rely on typical realizability assumption on the reward function.
Constrained contextual bandit
- Context Attentive Bandits or Contextual Bandit with Restricted Context : The authors in consider a novel formulation of the multi-armed bandit model, which is called contextual bandit with restricted context, where only a limited number of features can be accessed by the learner at every iteration. This novel formulation is motivated by different online problems arising in clinical trials, recommender systems and attention modeling. Herein, they adapt the standard multi-armed bandit algorithm known as Thompson Sampling to take advantage of the restricted context setting, and propose two novel algorithms, called the Thompson Sampling with Restricted Context and the Windows Thompson Sampling with Restricted Context,for handling stationary and nonstationary environments, respectively..
A. Badanidiyuru et al. first studied contextual bandits with budget constraints, also referred to as Resourceful Contextual Bandits, and show that a regret is achievable. However, their work focuses on a finite set of policies, and the algorithm is computationally inefficient.
A simple algorithm with logarithmic regret is proposed in:
- UCB-ALP algorithm: The framework of UCB-ALP is shown in the right figure. UCB-ALP is a simple algorithm that combines the UCB method with an Adaptive Linear Programming algorithm, and can be easily deployed in practical systems. It is the first work that show how to achieve logarithmic regret in constrained contextual bandits. Although is devoted to a special case with single budget constraint and fixed cost, the results shed light on the design and analysis of algorithms for more general CCB problems.
Adversarial bandit
Example: Iterated prisoner's dilemma
An example often considered for adversarial bandits is the iterated prisoner's dilemma. In this example, each adversary has two arms to pull. They can either Deny or Confess. Standard stochastic bandit algorithms don't work very well with this iterations. For example, if the opponent cooperates in the first 100 rounds, defects for the next 200, then cooperate in the following 300, etc. Then algorithms such as UCB won't be able to react very quickly to these changes. This is because after a certain point sub-optimal arms are rarely pulled to limit exploration and focus on exploitation. When the environment changes the algorithm is unable to adapt or may not even detect the change.Approximate solutions
Exp3Seldin, Y., Szepesvári, C., Auer, P. and Abbasi-Yadkori, Y., 2012, December. Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments. In EWRL (pp. 103–116).
Algorithm
Parameters: RealInitialisation: for
For each t = 1, 2,..., T
1. Set
2. Draw randomly according to the probabilities
3. Receive reward
4. For set:
Explanation
Exp3 chooses an arm at random with probability it prefers arms with higher weights, it chooses with probability γ to uniformly randomly explore. After receiving the rewards the weights are updated. The exponential growth significantly increases the weight of good arms.Regret analysis
The regret of the Exp3 algorithm is at mostFollow the perturbed leader (FPL) algorithm
Algorithm
Parameters: RealInitialisation:
For each t = 1,2,...,T
1. For each arm generate a random noise from an exponential distribution
2. Pull arm :
Add noise to each arm and pull the one with the highest value
3. Update value:
The rest remains the same
Explanation
We follow the arm that we think has the best performance so far adding exponential noise to it to provide exploration.Exp3 vs FPL
Infinite-armed bandit
In the original specification and in the above variants, the bandit problem is specified with a discrete and finite number of arms, often indicated by the variable. In the infinite armed case, introduced by Agarwal, the "arms" are a continuous variable in dimensions.Non-stationary bandit
Garivier and Moulines derive some of the first results with respect to bandit problems where the underlying model can change during play. A number of algorithms were presented to deal with this case, including Discounted UCB and Sliding-Window UCB.Another work by Burtini et al. introduces a weighted least squares Thompson sampling approach, which proves beneficial in both the known and unknown non-stationary cases. In the known non-stationary case, the authors in produce an alternative solution, a variant of UCB named Adjusted Upper Confidence Bound which assumes a stochastic model and provide upper-bounds of the regret.
Other variants
Many variants of the problem have been proposed in recent years.Dueling bandit
The dueling bandit variant was introduced by Yue et al. to model the exploration-versus-exploitation tradeoff for relative feedback.In this variant the gambler is allowed to pull two levers at the same time, but they only get a binary feedback telling which lever provided the best reward. The difficulty of this problem stems from the fact that the gambler has no way of directly observing the reward of their actions.
The earliest algorithms for this problem are InterleaveFiltering, Beat-The-Mean.
The relative feedback of dueling bandits can also lead to voting paradoxes. A solution is to take the Condorcet winner as a reference.
More recently, researchers have generalized algorithms from traditional MAB to dueling bandits: Relative Upper Confidence Bounds, Relative EXponential weighing,
Copeland Confidence Bounds, Relative Minimum Empirical Divergence, and Double Thompson Sampling.