Peirce's criterion

In robust statistics, Peirce's criterion is a rule for eliminating outliers from data sets, which was devised by Benjamin Peirce.

Outliers removed by Peirce's criterion

The problem of outliers

In data sets containing real-numbered measurements, the suspected outliers are the measured values that appear to lie outside the cluster of most of the other data values. The outliers would greatly change the estimate of location if the arithmetic average were to be used as a summary statistic of location. The problem is that the arithmetic mean is very sensitive to the inclusion of any outliers; in statistical terminology, the arithmetic mean is not robust.
In the presence of outliers, the statistician has two options. First, the statistician may remove the suspected outliers from the data set and then use the arithmetic mean to estimate the location parameter. Second, the statistician may use a robust statistic, such as the median statistic.
Peirce's criterion is a statistical procedure for eliminating outliers.

Uses of Peirce's criterion

The statistician and historian of statistics Stephen M. Stigler wrote the following about Benjamin Peirce:

"In 1852 he published the first significance test designed to tell an investigator whether an outlier should be rejected. The test, based on a likelihood ratio type of argument, had the distinction of producing an international debate on the wisdom of such actions."

Peirce's criterion is derived from a statistical analysis of the Gaussian distribution. Unlike some other criteria for removing outliers, Peirce's method can be applied to identify two or more outliers.

"It is proposed to determine in a series of observations the limit of error, beyond which all observations involving so great an error may be rejected, provided there are as many as such observations. The principle upon which it is proposed to solve this problem is, that the proposed observations should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations."

Hawkins provides a formula for the criterion.
Peirce's criterion was used for decades at the United States Coast Survey.

"From 1852 to 1867 he served as the director of the longitude determinations of the U. S. Coast Survey and from 1867 to 1874 as superintendent of the Survey. During these years his test was consistently employed by all the clerks of this, the most active and mathematically inclined statistical organization of the era."

Peirce's criterion was discussed in William Chauvenet's book.

Applications

An application for Peirce's criterion is removing poor data points from observation pairs in order to perform a regression between the two observations. Peirce's criterion does not depend on observation data, therefore making it a highly repeatable process that can be calculated independently of other processes. This feature makes Peirce's criterion for identifying outliers ideal in computer applications because it can be written as a call function.

Previous attempts

In 1855, B. A. Gould attempted to make Peirce's criterion easier to apply by creating tables of values representing values from Peirce's equations. A disconnect still exists between Gould's algorithm and the practical application of Peirce's criterion.
In 2003, S. M. Ross re-presented Gould's algorithm with a new example data set and work-through of the algorithm. This methodology still relies on using look-up tables, which have been updated in this work.
In 2008, an attempt to write a pseudo-code was made by a Danish geologist K. Thomsen. While this code provided some framework for Gould's algorithm, users were unsuccessful in calculating values reported by either Peirce or Gould.
In 2012, C. Dardis released the R package "Peirce" with various methodologies with comparisons of outlier removals. Dardis and fellow contributor Simon Muller successfully implemented Thomsen's pseudo-code into a function called "findx". The code is presented in the R implementation section below. References for the R package are available online as well as an unpublished review of the R package results.
In 2013, a re-examination of Gould's algorithm and the utilisation of advanced Python programming modules has made it possible to calculate the squared-error threshold values for identifying outliers.

Python implementation

In order to use Peirce's criterion, one must first understand the input and return values. Regression analysis results in residual errors. Therefore, each observation point has a residual error associated with a fitted curve. By taking the square, residual errors are expressed as positive values. If the squared error is too large it can cause problems with the regression parameters retrieved from the curve fitting.
It was Peirce's idea to statistically identify what constituted an error as "too large" and therefore being identified as an "outlier" which could be removed from the observations to improve the fit between the observations and a curve. K. Thomsen identified that three parameters were needed to perform the calculation: the number of observation pairs, the number of outliers to be removed, and the number of regression parameters used in the curve-fitting to get the residuals. The end result of this process is to calculate a threshold value whereby observations with a squared error smaller than this threshold should be kept and observations with a squared error larger than this value should be removed.
Because Peirce's criterion does not take observations, fitting parameters, or residual errors as an input, the output must be re-associated with the data. Taking the average of all the squared errors and multiplying it by the threshold squared error will result in the data-specific threshold value used to identify outliers.
The following Python code returns x-squared values for a given N and n in Table 1 and Table 2 of Gould 1855. Due to the Newton-method of iteration, look-up tables, such as N versus log Q and x versus log R are no longer necessary.

Python code

!/usr/bin/env python

import numpy
import scipy.special
def peirce_dev -> float:
"""Peirce's criterion

Returns the squared threshold error deviation for outlier identification
using Peirce's criterion based on Gould's methodology.

Arguments:
- int, total number of observations
- int, number of outliers to be removed
- int, number of model unknowns
Returns:
float, squared error threshold
"""
# Assign floats to input variables:
N = float
n = float
m = float
# Check number of observations:
if N > 1:
# Calculate Q :
Q = * ** ) / N
#
# Initialize R values
r_new = 1.0
r_old = 0.0 # <- Necessary to prompt while loop
#
# Start iteration to converge on R:
while abs > :
# Calculate Lamda
# :
ldiv = r_new ** n
if ldiv 0:
ldiv = 1.0e-6
Lamda = / ) **
# Calculate x-squared :
x2 = 1.0 + /n*
# If x2 goes negative, return 0:
if x2 < 0:
x2 = 0.0
r_old = r_new
else:
# Use x-squared to update R :
r_old = r_new
r_new = *
scipy.special.erfc / numpy.sqrt)
)
else:
x2 = 0.0
return x2

R implementation

Thomsen's code has been successfully written into the following function call, "findx" by C. Dardis and S. Muller in 2012 which returns the maximum error deviation,. To complement the Python code presented in the previous section, the R equivalent of "peirce_dev" is also presented here which returns the squared maximum error deviation,. These two functions return equivalent values by either squaring the returned value from the "findx" function or by taking the square-root of the value returned by the "peirce_dev" function. Differences occur with error handling. For example, the "findx" function returns NaNs for invalid data while "peirce_dev" returns 0. Also, the "findx" function does not support any error handling when the number of potential outliers increases towards the number of observations.
Just as with the Python version, the squared-error returned by the "peirce_dev" function must be multiplied by the mean-squared error of the model fit to get the squared-delta value. Use Δ2 to compare the squared-error values of the model fit. Any observation pairs with a squared-error greater than Δ2 are considered outliers and can be removed from the model. An iterator should be written to test increasing values of n until the number of outliers identified is less than those assumed.

R code

findx <- function

peirce_dev <- function

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...