Bias–variance tradeoff

In statistics and machine learning, the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa. The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:

The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs.
The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs.

The bias–variance decomposition is a way of analyzing a learning algorithm's expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself.
This tradeoff applies to all forms of supervised learning: classification, regression, and structured output learning, though, it does not apply in all learning algorithms. It has also been invoked to explain the effectiveness of heuristics in human learning.
It is important to note that the bias-variance tradeoff is not universal. For example, both bias and variance decrease when increasing the width of a neural network. This means that it is not necessary to control the size of a neural network to control variance. This does not contradict the bias-variance decomposition because the bias-variance decomposition does not imply a bias-variance tradeoff.

Motivation

The bias-variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit but may underfit their training data, failing to capture important regularities.
Models with high variance are usually more complex, enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate – despite their added complexity. In contrast, models with higher bias tend to be relatively simple but may produce lower variance predictions when applied beyond the training set.

Bias–variance decomposition of squared error

Suppose that we have a training set consisting of a set of points and real values associated with each point. We assume that there is a function with noise, where the noise,, has zero mean and variance.
We want to find a function, that approximates the true function as well as possible, by means of some learning algorithm based on a training dataset . We make "as well as possible" precise by measuring the mean squared error between and : we want to be minimal, both for and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the contain noise ; this means we must be prepared to accept an irreducible error in any function we come up with.
Finding an that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function we select, we can decompose its expected error on an unseen sample as follows:
where
and
The expectation ranges over different choices of the training set, all sampled from the same joint distribution. The three terms represent:

the square of the bias of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function using a learning method for linear models, there will be error in the estimates due to this assumption;
the variance of the learning method, or, intuitively, how much the learning method will move around its mean;
the irreducible error.

Since all three terms are non-negative, this forms a lower bound on the expected error on unseen samples.
The more complex the model is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger.

Derivation

The derivation of the bias–variance decomposition for squared error proceeds as follows. For notational convenience, we abbreviate, and we drop the subscript on our expectation operators. First, recall that, by definition, for any random variable, we have
Rearranging, we get:
Since is deterministic, i.e. independent of,
Thus, given and , implies
Also, since
Thus, since and are independent, we can write
Finally, MSE loss function is obtained by taking the expectation value over :

Application to regression

The bias–variance decomposition forms the conceptual basis for regression regularization methods such as Lasso and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

Application to classification

The bias–variance decomposition was originally formulated for least-squares regression. For the case of classification under the 0-1 loss, it is possible to find a similar decomposition. Alternatively, if the classification problem can be phrased as probabilistic classification, then the expected squared error of the predicted probabilities with respect to the true probabilities can be decomposed as before.

Application to reinforcement learning

Even though the bias–variance decomposition does not directly apply in reinforcement learning, a similar tradeoff can also characterize generalization. When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias and a term due to overfitting. The asymptotic bias is directly related to the learning algorithm while the overfitting term comes from the fact that the amount of data is limited.

Approaches

and feature selection can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance; for example,

linear and Generalized linear models can be regularized to decrease their variance at the cost of increasing their bias.
In artificial neural networks, the variance increases and the bias decreases as the number of hidden units increase, although this classical assumption has been the subject of recent debate. Like in GLMs, regularization is typically applied.
In k-nearest neighbor models, a high value of leads to high bias and low variance.
In instance-based learning, regularization can be achieved varying the mixture of prototypes and exemplars.
In decision trees, the depth of the tree determines the variance. Decision trees are commonly pruned to control variance.

One way of resolving the trade-off is to use mixture models and ensemble learning. For example, boosting combines many "weak" models in an ensemble that has lower bias than the individual models, while bagging combines "strong" learners in a way that reduces their variance.
Model validation methods such as cross-validation can be used to tune models so as to optimize the trade-off.

''k''-nearest neighbors

In the case of -nearest neighbors regression, when the expectation is taken over the possible labeling of a fixed training set, a closed-form expression exists that relates the bias–variance decomposition to the parameter :
where are the nearest neighbors of in the training set. The bias is a monotone rising function of, while the variance drops off as is increased. In fact, under "reasonable assumptions" the bias of the first-nearest neighbor estimator vanishes entirely as the size of the training set approaches infinity.

Application to human learning

While widely discussed in the context of machine learning, the bias-variance dilemma has been examined in the context of human cognition, most notably by Gerd Gigerenzer and co-workers in the context of learned heuristics. They have argued that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterised training-sets provided by experience by adopting high-bias/low variance heuristics. This reflects the fact that a zero-bias approach has poor generalisability to new situations, and also unreasonably presumes precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations.
Geman et al. argue that the bias-variance dilemma implies that abilities such as generic object recognition cannot be learned from scratch, but require a certain degree of “hard wiring” that is later tuned by experience. This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...