Law of large numbers


In probability theory,[] the law of large numbers is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.
The LLN is important because it guarantees stable long-term results for the averages of some random events. For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Any winning streak by a player will eventually be overcome by the parameters of the game. It is important to remember that the law only applies when a large number of observations is considered. There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be "balanced" by the others.

Examples

For example, a single roll of a fair, six-sided die produces one of the numbers 1, 2, 3, 4, 5, or 6, each with equal probability. Therefore, the expected value of the average of the rolls is:
According to the law of large numbers, if a large number of six-sided dice are rolled, the average of their values is likely to be close to 3.5, with the precision increasing as more dice are rolled.
It follows from the law of large numbers that the empirical probability of success in a series of Bernoulli trials will converge to the theoretical probability. For a Bernoulli random variable, the expected value is the theoretical probability of success, and the average of n such variables is precisely the relative frequency.
For example, a fair coin toss is a Bernoulli trial. When a fair coin is flipped once, the theoretical probability that the outcome will be heads is equal to. Therefore, according to the law of large numbers, the proportion of heads in a "large" number of coin flips "should be" roughly. In particular, the proportion of heads after n flips will almost surely converge to as n approaches infinity.
Although the proportion of heads approaches 1/2, almost surely the absolute difference in the number of heads and tails will become large as the number of flips becomes large. That is, the probability that the absolute difference is a small number approaches zero as the number of flips becomes large. Also, almost surely the ratio of the absolute difference to the number of flips will approach zero. Intuitively, the expected number of heads grows, but at a slower rate than the number of flips, as the number of flips grows.
Another good example of the LLN is the Monte Carlo method. These methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The larger the number of repetitions, the better the approximation tends to be. The reason that this method is important is mainly that, sometimes, it is difficult or impossible to use other approaches.

Limitation

The average of the results obtained from a large number of trials may fail to converge in some cases. For instance, the average of the results from the Cauchy distribution or some Pareto distributions will not converge as x becomes larger; the reason is heavy tails. The Cauchy distribution and the Pareto distribution represent two cases: the Cauchy distribution does not have an expectation, whereas the expectation of the Pareto distribution is infinite. Another example is where the random numbers equal the tangent of an angle uniformly distributed between −90° and +90°. The median is zero, but the expected value does not exist, and indeed the average of n such variables has the same distribution as one such variable. It does not converge in probability toward zero as n goes to infinity.

History

The Italian mathematician Gerolamo Cardano stated without proof that the accuracies of empirical statistics tend to improve with the number of trials. This was then formalized as a law of large numbers. A special form of the LLN was first proved by Jacob Bernoulli. It took him over 20 years to develop a sufficiently rigorous mathematical proof which was published in his Ars Conjectandi in 1713. He named this his "Golden Theorem" but it became generally known as "Bernoulli's Theorem". This should not be confused with Bernoulli's principle, named after Jacob Bernoulli's nephew Daniel Bernoulli. In 1837, S.D. Poisson further described it under the name "la loi des grands nombres". Thereafter, it was known under both names, but the "law of large numbers" is most frequently used.
After Bernoulli and Poisson published their efforts, other mathematicians also contributed to refinement of the law, including Chebyshev, Markov, Borel, Cantelli and Kolmogorov and Khinchin. Markov showed that the law can apply to a random variable that does not have a finite variance under some other weaker assumption, and Khinchin showed in 1929 that if the series consists of independent identically distributed random variables, it suffices that the expected value exists for the weak law of large numbers to be true. These further studies have given rise to two prominent forms of the LLN. One is called the "weak" law and the other the "strong" law, in reference to two different modes of convergence of the cumulative sample means to the expected value; in particular, as explained below, the strong form implies the weak.

Forms

There are two different versions of the law of large numbers that are described below. They are called the strong law of large numbers and the weak law of large numbers. Stated for the case where X1, X2,... is an infinite sequence of i.i.d. Lebesgue integrable random variables with expected value E = E =...= µ, both versions of the law state that - with virtual certainty - the sample average
converges to the expected value
Based on the assumption of finite variance and no correlation between random variables, the variance of the average of n random variables
Sometimes an assumption of finite variance is not necessary. Large or infinite variance will make the convergence slower, but the LLN holds anyway. This assumption is often used because it makes the proofs easier and shorter.
Mutual independence of the random variables can be replaced by pairwise independence in both versions of the law.
The difference between the strong and the weak version is concerned with the mode of convergence being asserted. For interpretation of these modes, see Convergence of random variables.

Weak law

The weak law of large numbers states that the sample average converges in probability towards the expected value
That is, for any positive number ε,
Interpreting this result, the weak law states that for any nonzero margin specified, no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value; that is, within the margin.
As mentioned earlier, the weak law applies in the case of i.i.d. random variables, but it also applies in some other cases. For example, the variance may be different for each random variable in the series, keeping the expected value constant. If the variances are bounded, then the law applies, as shown by Chebyshev as early as 1867. In fact, Chebyshev's proof works so long as the variance of the average of the first n values goes to zero as n goes to infinity. As an example, assume that each random variable in the series follows a Gaussian distribution with mean zero, but with variance equal to, which is not bounded. At each stage, the average will be normally distributed. The variance of the sum is equal to the sum of the variances, which is asymptotic to. The variance of the average is therefore asymptotic to and goes to zero.
There are also examples of the weak law applying even though the expected value does not exist.

Strong law

The strong law of large numbers states that the sample average converges almost surely to the expected value
That is,
What this means is that the probability that, as the number of trials n goes to infinity, the average of the observations converges to the expected value, is equal to one.
The proof is more complex than that of the weak law. This law justifies the intuitive interpretation of the expected value of a random variable when sampled repeatedly as the "long-term average".
Almost sure convergence is also called strong convergence of random variables. This version is called the strong law because random variables which converge strongly are guaranteed to converge weakly. However the weak law is known to hold in certain conditions where the strong law does not hold and then the convergence is only weak. See #Differences between the weak law and the strong law.
The strong law of large numbers can itself be seen as a special case of the pointwise ergodic theorem.
The strong law applies to independent identically distributed random variables having an expected value. This was proved by Kolmogorov in 1930. It can also apply in other cases. Kolmogorov also showed, in 1933, that if the variables are independent and identically distributed, then for the average to converge almost surely on something, it is necessary that they have an expected value.
If the summands are independent but not identically distributed, then
provided that each Xk has a finite second moment and
This statement is known as Kolmogorov's strong law, see e.g..
An example of a series where the weak law applies but not the strong law is when Xk is plus or minus with probability 1/2 for each. The variance of Xk is then Kolmogorov's strong law does not apply because the partial sum in his criterion up to k=n is asymptotic to and this is unbounded.
If we replace the random variables with Gaussian variables having the same variances, namely then the average at any point will also be normally distributed. The width of the distribution of the average will tend toward zero, but for a given ε, there is probability which does not go to zero with n, while the average sometime after the nth trial will come back up to ε. Since the width of the distribution of the average is not zero, it must have a positive lower bound p, which means there is a probability of at least p that the average will attain ε after n trials. It will happen with probability p/2 before some m which depends on n. But even after m, there is still a probability of at least p that it will happen.

Differences between the weak law and the strong law

The weak law states that for a specified large n, the average is likely to be near μ. Thus, it leaves open the possibility that happens an infinite number of times, although at infrequent intervals..
The strong law shows that this almost surely will not occur. In particular, it implies that with probability 1, we have that for any the inequality holds for all large enough n.
The strong law does not hold in the following cases, but the weak law does.
1. Let X be an exponentially distributed random variable with parameter 1. The random variable has no expected value according to Lebesgue integration, but using conditional convergence and interpreting the integral as a Dirichlet integral, which is an improper Riemann integral, we can say:
2. Let x be geometric distribution with probability 0.5. The random variable does not have an expected value in the conventional sense because the infinite series is not absolutely convergent, but using conditional convergence, we can say:
3. If the cumulative distribution function of a random variable is

Uniform law of large numbers

Suppose f is some function defined for θ ∈ Θ, and continuous in θ. Then for any fixed θ, the sequence will be a sequence of independent and identically distributed random variables, such that the sample mean of this sequence converges in probability to E. This is the pointwise convergence.
The uniform law of large numbers states the conditions under which the convergence happens uniformly in θ. If
  1. Θ is compact,
  2. f is continuous at each θ ∈ Θ for almost all xs, and measurable function of x at each θ.
  3. there exists a dominating function d such that E < ∞, and
  4. :
Then E is continuous in θ, and
This result is useful to derive consistency of a large class of estimators.

Borel's law of large numbers

Borel's law of large numbers, named after Émile Borel, states that if an experiment is repeated a large number of times, independently under identical conditions, then the proportion of times that any specified event occurs approximately equals the probability of the event's occurrence on any particular trial; the larger the number of repetitions, the better the approximation tends to be. More precisely, if E denotes the event in question, p its probability of occurrence, and Nn the number of times E occurs in the first n trials, then with probability one,
This theorem makes rigorous the intuitive notion of probability as the long-run relative frequency of an event's occurrence. It is a special case of any of several more general laws of large numbers in probability theory.
Chebyshev's inequality. Let X be a random variable with finite expected value μ and finite non-zero variance σ2. Then for any real number,

Proof of the weak law

Given X1, X2,... an infinite sequence of i.i.d. random variables with finite expected value E = E =... = µ < ∞, we are interested in the convergence of the sample average
The weak law of large numbers states:

Proof using Chebyshev's inequality assuming finite variance

This proof uses the assumption of finite variance . The independence of the random variables implies no correlation between them, and we have that
The common mean μ of the sequence is the mean of the sample average:
Using Chebyshev's inequality on results in
This may be used to obtain the following:
As n approaches infinity, the expression approaches 1. And by definition of convergence in probability, we have obtained

Proof using convergence of characteristic functions

By Taylor's theorem for complex functions, the characteristic function of any random variable, X, with finite mean μ, can be written as
All X1, X2,... have the same characteristic function, so we will simply denote this φX.
Among the basic properties of characteristic functions there are
These rules can be used to calculate the characteristic function of in terms of φX:
The limit eitμ is the characteristic function of the constant random variable μ, and hence by the Lévy continuity theorem, converges in distribution to μ:
μ is a constant, which implies that convergence in distribution to μ and convergence in probability to μ are equivalent Therefore,
This shows that the sample mean converges in probability to the derivative of the characteristic function at the origin, as long as the latter exists.

Consequences

The law of large numbers provides an expectation of an unknown distribution from a realization of the sequence, but also any feature of the probability distribution. By applying Borel's law of large numbers, one could easily obtain the probability mass function. For each event in the objective probability mass function, one could approximate the probability of the event's occurrence with the proportion of times that any specified event occurs. The larger the number of repetitions, the better the approximation. As for the continuous case:, for small positive h. Thus, for large n:
With this method, one can cover the whole x-axis with a grid and obtain a bar graph which is called a histogram.