Inter-rater reliability

In statistics, inter-rater reliability is the degree of agreement among raters. It is a score of how much homogeneity or consensus exists in the ratings given by various judges.
In contrast, intra-rater reliability is a score of the consistency in ratings given by the same person across multiple instances. Inter-rater and intra-rater reliability are aspects of test validity. Assessments of them are useful in refining the tools given to human judges, for example, by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.
There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are joint-probability of agreement, Cohen's kappa, Scott's pi and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient, intra-class correlation, and Krippendorff's alpha.

Concept

There are several operational definitions of "inter-rater reliability", reflecting different viewpoints about what is a reliable agreement between raters. There are three operational definitions of agreement:

Reliable raters agree with the "official" rating of a performance.
Reliable raters agree with each other about the exact ratings to be awarded.
Reliable raters agree about which performance is better and which is worse.

These combine with two operational definitions of behavior:

Statistics

Joint probability of agreement

The joint-probability of agreement is the simplest and the least robust measure. It is estimated as the percentage of the time the raters agree in a nominal or categorical rating system. It does not take into account the fact that agreement may happen solely based on chance. There is some question whether or not there is a need to 'correct' for chance agreement; some suggest that, in any case, any such adjustment should be based on an explicit model of how chance and error affect raters' decisions.
When the number of categories being used is small, the likelihood for 2 raters to agree by pure chance increases dramatically. This is because both raters must confine themselves to the limited number of options available, which impacts the overall agreement rate, and not necessarily their propensity for "intrinsic" agreement.
Therefore, the joint probability of agreement will remain high even in the absence of any "intrinsic" agreement among raters. A useful inter-rater reliability coefficient is expected to be close to 0, when there is no "intrinsic" agreement, and to increase as the "intrinsic" agreement rate improves. Most chance-corrected agreement coefficients achieve the first objective. However, the second objective is not achieved by many known chance-corrected measures.

Kappa statistics

Kappa is a way of measuring agreement or reliability, correcting for how often ratings might agree by chance. Cohen's kappa, which works for two raters, and Fleiss' kappa, an adaptation that works for any fixed number of raters, improve upon the joint probability in that they take into account the amount of agreement that could be expected to occur through chance. The original versions suffered from the same problem as the joint-probability in that they treat the data as nominal and assume the ratings have no natural ordering; if the data actually have a rank, then that information in the measurements was not fully taken advantage of.
Later extensions of the approach included versions that could handle "partial credit" and ordinal scales. These extensions converge with the family of intra-class correlations, so there is a conceptually related way of estimating reliability for each level of measurement from nominal to ordinal to interval, and ratio. There also are variants that can look at agreement by raters across a set of items as well as raters x cases.
Kappa is similar to a correlation coefficient in that it cannot go above +1.0 or below -1.0. Because it is used as a measure of agreement, only positive values would be expected in most situations; negative values would indicate systematic disagreement. Kappa can only achieve very high values when both agreement is good and the rate of the target condition is near 50%. Several authorities have offered "rules of thumb" for interpreting the level of agreement, many of which agree in the gist even though the words are not identical.

Correlation coefficients

Either Pearson's, Kendall's τ, or Spearman's can be used to measure pairwise correlation among raters using a scale that is ordered. Pearson assumes the rating scale is continuous; Kendall and Spearman statistics assume only that it is ordinal. If more than two raters are observed, an average level of agreement for the group can be calculated as the mean of the, τ, or values from each possible pair of raters.

Intra-class correlation coefficient

Another way of performing reliability testing is to use the intra-class correlation coefficient. There are several types of this and one is defined as, "the proportion of variance of an observation due to between-subject variability in the true scores". The range of the ICC may be between 0.0 and 1.0. The ICC will be high when there is little variation between the scores given to each item by the raters, e.g. if all raters give the same or similar scores to each of the items. The ICC is an improvement over Pearson's and Spearman's, as it takes into account the differences in ratings for individual segments, along with the correlation between raters.

Limits of agreement

Another approach to agreement is to calculate the differences between each pair of the two raters' observations. The mean of these differences is termed bias and the reference interval is termed limits of agreement. The limits of agreement provide insight into how much random variation may be influencing the ratings.
If the raters tend to agree, the differences between the raters' observations will be near zero. If one rater is usually higher or lower than the other by a consistent amount, the bias will be different from zero. If the raters tend to disagree, but without a consistent pattern of one rating higher than the other, the mean will be near zero. Confidence limits can be calculated for both the bias and each of the limits of agreement.
There are several formulae that can be used to calculate limits of agreement. The simple formula, which was given in the previous paragraph and works well for sample size greater than 60, is
For smaller sample sizes, another common simplification is
However, the most accurate formula is
Bland and Altman have expanded on this idea by graphing the difference of each point, the mean difference, and the limits of agreement on the vertical against the average of the two ratings on the horizontal. The resulting Bland–Altman plot demonstrates not only the overall degree of agreement, but also whether the agreement is related to the underlying value of the item. For instance, two raters might agree closely in estimating the size of small items, but disagree about larger items.
When comparing two methods of measurement, it is not only of interest to estimate both bias and limits of agreement between the two methods, but also to assess these characteristics for each method within itself. It might very well be that the agreement between two methods is poor simply because one of the methods has wide limits of agreement while the other has narrow. In this case, the method with the narrow limits of agreement would be superior from a statistical point of view, while practical or other considerations might change this appreciation. What constitutes narrow or wide limits of agreement or large or small bias is a matter of a practical assessment in each case.

Krippendorff’s alpha

Krippendorff's alpha is a versatile statistic that assesses the agreement achieved among observers who categorize, evaluate, or measure a given set of objects in terms of the values of a variable. It generalizes several specialized agreement coefficients by accepting any number of observers, being applicable to nominal, ordinal, interval, and ratio levels of measurement, being able to handle missing data, and being corrected for small sample sizes.
Alpha emerged in content analysis where textual units are categorized by trained coders and is used in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychometrics where individual attributes are tested by multiple methods, in observational studies where unstructured happenings are recorded for subsequent analysis, and in computational linguistics where texts are annotated for various syntactic and semantic qualities.

Disagreement

For any task in which multiple raters are useful, raters are expected to disagree about the observed target. By contrast, situations involving unambiguous measurement, such as simple counting tasks, often do not require more than one person performing the measurement.
Measurement involving ambiguity in characteristics of interest in the rating target are generally improved with multiple trained raters. Such measurement tasks often involve subjective judgment of quality. Examples include ratings of physician 'bedside manner', evaluation of witness credibility by a jury, and presentation skill of a speaker.
Variation across raters in the measurement procedures and variability in interpretation of measurement results are two examples of sources of error variance in rating measurements. Clearly stated guidelines for rendering ratings are necessary for reliability in ambiguous or challenging measurement scenarios.
Without scoring guidelines, ratings are increasingly affected by experimenter's bias, that is, a tendency of rating values to drift towards what is expected by the rater. During processes involving repeated measurements, correction of rater drift can be addressed through periodic retraining to ensure that raters understand guidelines and measurement goals.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...