Information content


In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.
The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.
The Shannon information is closely related to information theoretic entropy, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average." This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.
The information content can be expressed in various units of information, of which the most common is the "bit", as explained below.

Definition

's definition of self-information was chosen to meet several axioms:
  1. An event with probability 100% is perfectly unsurprising and yields no information.
  2. The less probable an event is, the more surprising it is and the more information it yields.
  3. If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.
The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly given an event with probability, the information content is defined as follows:
The base of the log is left unspecified, which corresponds to the scaling factor above. Different choices of base correspond to different units of information: if the logarithmic base is 2, the unit is named the "bit" or "shannon"; if the logarithm is the natural logarithm, the unit is called the "nat", short for "natural"; and if the base is 10, the units are called "hartleys", decimal "digits", or occasionally "dits".
Formally, given a random variable with probability mass function, the self-information of measuring as outcome is defined as
The Shannon entropy of the random variable above is defined as
by definition equal to the expected information content of measurement of.
The use of the notation for self-information above is not universal. Since the notation is also often used for the related quantity of mutual information, many authors use a lowercase for self-entropy instead, mirroring the use of the capital for the entropy.

Properties

Monotonically Decreasing Function of Probability

For a given probability space, the measurement of rarer events are intuitively more "surprising," and yield more information content, than more common values. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function.
While standard probabilities are represented by real numbers in the interval, self-informations are represented by extended real numbers in the interval. In particular, we have the following, for any choice of logarithmic base:
From this, we can get a few general properties:
The Shannon information is closely related to the log-odds. In particular, given some event, suppose that is the probability of occurring, and that is the probability of not occurring. Then we have the following definition of the log-odds:
This can be expressed as a difference of two Shannon informations:
In other words, the log-odds can be interpreted as the level of surprise if the event 'doesn't' happen, minus the level of surprise if the event 'does' happen.

Additivity of independent events

The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics, and sigma additivity in particular in measure and probability theory. Consider two independent random variables with probability mass functions and respectively. The joint probability mass function is
because and are independent. The information content of the outcome isSee below for an example.
The corresponding property for likelihoods is that the log-likelihood of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal, this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.

Examples

Fair coin toss

Consider the Bernoulli trial of tossing a fair coin. The probabilities of the events of the coin landing as heads and tails are one half each,. Upon measuring the variable as heads, the associated information gain isso the information gain of a fair coin landing as heads is 1 shannon. Likewise, the information gain of measuring tails is

Fair dice roll

Suppose we have a fair six-sided die. The value of a dice roll is a discrete uniform random variable with probability mass function The probability of rolling a 4 is, as for any other valid roll. The information content of rolling a 4 is thusof information.

Two independent, identically distributed dice

Suppose we have two independent, identically distributed random variables each corresponding to an independent fair 6-sided dice roll. The joint distribution of and is
The information content of the random variate is just as
as explained in.

Information from frequency of rolls

If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variables
for, then and the counts have the multinomial distribution
To verify this, the 6 outcomes correspond to the event and a total probability of. These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other combinations correspond to one die rolling one number and the other die rolling a different number, each having probability. Indeed,, as required.
Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number. Take for examples the events and for. For example, and.
The information contents are
Let be the event that both dice rolled the same value and be the event that the dice differed. Then and. The information contents of the events are

Information from sum of die

The probability mass or density function of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable has probability mass function, where represents the discrete convolution. The outcome has probability. Therefore, the information asserted is

General discrete uniform distribution

Generalizing the example above, consider a general discrete uniform random variable For convenience, define. The p.m.f. is In general, the values of the DURV need not be integers, or for the purposes of information theory even uniformly spaced; they need only be equiprobable. The information gain of any observation is

Special case: constant random variable

If above, degenerates to a constant random variable with probability distribution deterministically given by and probability measure the Dirac measure. The only value can take is deterministically, so the information content of any measurement of isIn general, there is no information gained from measuring a known value.

Categorical distribution

Generalizing all of the above cases, consider a categorical discrete random variable with support and given by
For the purposes of information theory, the values do not even have to be numbers at all; they can just be mutually exclusive events on a measure space of finite measure that has been normalized to a probability measure. Without loss of generality, we can assume the categorical distribution is supported on the set ; the mathematical structure is isomorphic in terms of probability theory and therefore information theory as well.
The information of the outcome is given
From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.

Relationship to entropy

The entropy is the expected value of the information content of the discrete random variable, with expectation taken over the discrete values it takes. Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies, where is the mutual information of with itself.

Derivation

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero.
For example, quoting a character of comedian George Carlin, Assuming one does not reside near the Earth's poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.
When the content of a message is known a priori with certainty, with probability of 1, there is no actual information conveyed in the message. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information.
Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of event,, depends only on the probability of that event.
for some function to be determined below. If, then. If, then.
Further, by definition, the measure of self-information is nonnegative and additive. If a message informing of event is the intersection of two independent events and, then the information of event occurring is that of the compound message of both independent events and occurring. The quantity of information of compound message would be expected to equal the sum of the amounts of information of the individual component messages and respectively:
Because of the independence of events and, the probability of event is
However, applying function results in
The class of function having the property such that
is the logarithm function of any base. The only operational difference between logarithms of different bases is that of different scaling constants.
Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that.
Taking into account these properties, the self-information associated with outcome with probability is defined as:
The smaller the probability of event, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of is bits. This is the most common practice. When using the natural logarithm of base, the unit will be the nat. For the base 10 logarithm, the unit of information is the hartley.
As a quick illustration, the information content associated with an outcome of 4 heads in 4 consecutive tosses of a coin would be 4 bits, and the information content associated with getting a result other than the one specified would be ~0.09 bits. See above for detailed examples.