Scoring rule


In decision theory, a score function, or scoring rule, measures the accuracy of probabilistic predictions. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive outcomes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one. A score can be thought of as either a measure of the "calibration" of a set of probabilistic predictions, or as a "cost function" or "loss function".
If a cost is levied in proportion to a proper scoring rule, the minimal expected cost corresponds to reporting the true set of probabilities. Proper scoring rules are used in meteorology, finance, and pattern classification where a forecaster or algorithm will attempt to minimize the average score to yield refined, calibrated probabilities.

Definition

Suppose and are two random variables defined on a sample space with and as their corresponding density functions, in which is a forecast target variable and is the random variable generated from a forecast schema. Also, assume that the, for is the realized value. A scoring rule is a function such as which calculates the distance between and.

Orientation

is positively oriented if for two different probabilistic forecasts, means that is a better probabilistic forecast than.

Expected score

Expected score is the expected value of the scoring rule over all possible values of the target variable. For example, for a continuous random variable we have

Expected loss

The expected score loss is the difference between the expected score for the target variable and the forecast:

Propriety

Assuming positive orientation, a scoring rule is considered to be strictly proper if the value of the expected score loss is positive for all possible forecasts. In other words, based on a strictly proper score rule a forecasting scheme must score best if it suggests the target variable as the forecast, and vice versa; i.e. based on a strictly proper score rule a forecasting scheme must score best if, and only if, it suggests the target variable as the forecast.

Non-probabilistic forecast accuracy measures

Although scoring rules are introduced in probabilistic forecasting literature, the definition is general enough to consider non-probabilistic measures such as mean absolute error or mean square error as some specific scoring rules. The main characteristic of such scoring rules is is just a function of the expected value of .

Example application of scoring rules

An example of probabilistic forecasting is in meteorology where a weather forecaster may give the probability of rain on the next day. One could note the number of times that a 25% probability was quoted, over a long period, and compare this with the actual proportion of times that rain fell. If the actual percentage was substantially different from the stated probability we say that the forecaster is poorly calibrated. A poorly calibrated forecaster might be encouraged to do better by a system. A bonus system designed around a proper scoring rule will incentivize the forecaster to report probabilities equal to his personal beliefs.
In addition to the simple case of a binary decision, such as assigning probabilities to 'rain' or 'no rain', scoring rules may be used for multiple classes, such as 'rain', 'snow', or 'clear'.
The image to the right shows an example of a scoring rule, the logarithmic scoring rule, as a function of the probability reported for the event that actually occurred. One way to use this rule would be as a cost based on the probability that a forecaster or algorithm assigns, then checking to see which event actually occurs.

Proper scoring rules

A probabilistic forecaster or algorithm will return a probability vector with a probability for each of the outcomes. One usage of a scoring function could be to give a reward of if the th event occurs. If a proper scoring rule is used, then the highest expected reward is obtained by reporting the true probability distribution. The use of a proper scoring rule encourages the forecaster to be honest to maximize the expected reward.
A scoring rule is strictly proper if it is uniquely optimized by the true probabilities. Optimized in this case will correspond to maximization for the quadratic, spherical, and logarithmic rules but minimization for the Brier Score. This can be seen in the image at right for the logarithmic rule. Here, Event 1 is expected to occur with probability of 0.8, and the expected score is shown as a function of the reported probability. The way to maximize the expected reward is to report the actual probability of 0.8 as all other reported probabilities will yield a lower expected score. This property holds because the logarithmic score is proper.

Examples of proper scoring rules

There are an infinite number of scoring rules, including entire parameterized families of proper scoring rules. The ones shown below are simply popular examples.

Logarithmic scoring rule

The logarithmic scoring rule is a local strictly proper scoring rule. This is also the negative of surprisal, which is commonly used as a scoring criterion in Bayesian inference; the goal is to minimize expected surprise. This scoring rule has strong foundations in information theory.
Here, the score is calculated as the logarithm of the probability estimate for the actual outcome. That is, a prediction of 80% that correctly proved true would receive a score of. This same prediction also assigns 20% likelihood to the opposite case, and so if the prediction proves false, it would receive a score based on the 20%:. The goal of a forecaster is to maximize the score and for the score to be as large as possible, and −0.22 is indeed larger than −1.6.
If one treats the truth or falsity of the prediction as a variable with value 1 or 0 respectively, and the expressed probability as, then one can write the logarithmic scoring rule as. Note that any logarithmic base may be used, since strictly proper scoring rules remain strictly proper under linear transformation. That is:
is strictly proper for all.

Brier/quadratic scoring rule

The quadratic scoring rule is a strictly proper scoring rule
where is the probability assigned to the correct answer and is the number of classes.
The Brier score, originally proposed by Glenn W. Brier in 1950, can be obtained by an affine transform from the quadratic scoring rule.
Where when the th event is correct and otherwise and is the number of classes.
An important difference between these two rules is that a forecaster should strive to maximize the quadratic score yet minimize the Brier score. This is due to a negative sign in the linear transformation between them.

Spherical scoring rule

The spherical scoring rule is also a strictly proper scoring rule

Interpretation of proper scoring rules

All proper scoring rules are equal to weighted sums of the losses in a set of simple two-alternative decision problems that use the probabilistic prediction, each such decision problem having a particular combination of associated cost parameters for false positive and false negative decisions. A strictly proper scoring rule corresponds to having a nonzero weighting for all possible decision thresholds. Any given proper scoring rule is equal to the expected losses with respect to a particular probability distribution over the decision thresholds; thus the choice of a scoring rule corresponds to an assumption about the probability distribution of decision problems for which the predicted probabilities will ultimately be employed, with for example the quadratic loss scoring rule corresponding to a uniform probability of the decision threshold being anywhere between zero and one. The accuracy score, which is zero or one depending on whether the predicted probability is on the appropriate side of 0.5, is a proper scoring rule but not a strictly proper scoring rule.

Comparison of proper scoring rules

Shown below on the left is a graphical comparison of the Logarithmic, Quadratic, and Spherical scoring rules for a binary classification problem. The x-axis indicates the reported probability for the event that actually occurred.
It is important to note that each of the scores have different magnitudes and locations. The magnitude differences are not relevant however as scores remain proper under affine transformation. Therefore, to compare different scores it is necessary to move them to a common scale. A reasonable choice of normalization is shown at the picture on the right where all scores intersect the points and. This ensures that they yield 0 for a uniform distribution, reflecting no cost or reward for reporting what is often the baseline distribution. All normalized scores below also yield 1 when the true class is assigned a probability of 1.

Characteristics

Positive-affine transformation

A strictly proper scoring rule, whether binary or multiclass, after a positive-affine transformation remains a strictly proper scoring rule. That is, if is a strictly proper scoring rule then with is also a strictly proper scoring rule.

Locality

A proper scoring rule is said to be local if its estimate for the probability of a specific event depends only on the probability of that event. This statement is vague in most descriptions but we can, in most cases, think of this as the optimal solution of the scoring problem "at a specific event" is invariant to all changes in the observation distribution that leave the probability of that event unchanged. All binary scores are local because the probability assigned to the event that did not occur is determined so there is no degree of flexibility to vary over.
Affine functions of the logarithmic scoring rule are the only strictly proper local scoring rules on a finite set that is not binary.

Decomposition

The expectation value of a proper scoring rule can be decomposed into the sum of three components, called uncertainty, reliability, and resolution, which characterize different attributes of probabilistic forecasts:
If a score is proper and negatively oriented, all three terms are positive definite.
The uncertainty component is equal to the expected score of the forecast which constantly predicts the average event frequency.
The reliability component penalizes poorly calibrated forecasts, in which the predicted probabilities do not coincide with the event frequencies.
The equations for the individual components depend on the particular scoring rule.
For the Brier Score, they are given by
where is the average probability of occurrence of the binary event, and is the conditional event probability, given, i.e.