Errors-in-variables models

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.
by a range of regression estimates in errors-in-variables models. Two regression lines bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable is on the abscissa. The steeper slope is obtained when the independent variable is on the ordinate. By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable.
In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For simple linear regression the effect is an underestimate of the coefficient, known as the attenuation bias. In non-linear models the direction of the bias is likely to be more complicated.

Motivational example

Consider a simple linear regression model of the form
where denotes the true but unobserved regressor. Instead we observe this value with an error:
where the measurement error is assumed to be independent of the true value.
If the ′s are simply regressed on the ′s, then the estimator for the slope coefficient is
which converges as the sample size increases without bound:
Variances are non-negative, so that in the limit the estimate is smaller in magnitude than the true value of, an effect which statisticians call attenuation or regression dilution. Thus the ‘naïve’ least squares estimator is inconsistent in this setting. However, the estimator is a consistent estimator of the parameter required for a best linear predictor of given : in some applications this may be what is required, rather than an estimate of the ‘true’ regression coefficient, although that would assume that the variance of the errors in observing remains fixed. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the ′s to the actually observed ′s, in a simple linear regression, is given by
It is this coefficient, rather than, that would be required for constructing a predictor of based on an observed which is subject to noise.
It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent. Jerry Hausman sees this as an iron law of econometrics: "The magnitude of the estimate is usually smaller than expected."

Specification

Usually measurement error models are described using the latent variables approach. If is the response variable and are observed values of the regressors, then it is assumed there exist some latent variables and which follow the model's “true” functional relationship, and such that the observed quantities are their noisy observations:
where is the model's parameter and are those regressors which are assumed to be error-free. Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of 's are zero.
The variables,, are all observed, meaning that the statistician possesses a data set of statistical units which follow the data generating process described above; the latent variables,,, and are not observed however.
This specification does not encompass all the existing errors-in-variables models. For example in some of them function may be non-parametric or semi-parametric. Other approaches model the relationship between and as distributional instead of functional, that is they assume that conditionally on follows a certain distribution.

Terminology and assumptions

The observed variable may be called the manifest, indicator, or proxy variable.
The unobserved variable may be called the latent or true variable. It may be regarded either as an unknown constant, or as a random variable.
The relationship between the measurement error and the latent variable can be modeled in different ways:
* Classical errors: the errors are independent of the latent variable. This is the most common assumption, it implies that the errors are introduced by the measuring device and their magnitude does not depend on the value being measured.
* Mean-independence: the errors are mean-zero for every value of the latent regressor. This is a less restrictive assumption than the classical one, as it allows for the presence of heteroscedasticity or other effects in the measurement errors.
* Berkson's errors: the errors are independent of the observed regressor x. This assumption has very limited applicability. One example is round-off errors: for example if a person's age* is a continuous random variable, whereas the observed age is truncated to the next smallest integer, then the truncation error is approximately independent of the observed age. Another possibility is with the fixed design experiment: for example if a scientist decides to make a measurement at a certain predetermined moment of time, say at, then the real measurement may occur at some other value of and such measurement error will be generally independent of the "observed" value of the regressor.
* Misclassification errors: special case used for the dummy regressors. If is an indicator of a certain event or condition, then the measurement error in such regressor will correspond to the incorrect classification similar to type I and type II errors in statistical testing. In this case the error may take only 3 possible values, and its distribution conditional on is modeled with two parameters:, and. The necessary condition for identification is that, that is misclassification should not happen "too often".
Linear model

Linear errors-in-variables models were studied first, probably because linear models were so widely used and they are easier than non-linear ones. Unlike standard least squares regression, extending errors in variables regression from the simple to the multivariable case is not straightforward.

Simple linear model

The simple linear errors-in-variables model was already presented in the "motivation" section:
where all variables are scalar. Here α and β are the parameters of interest, whereas σ_ε and σ_η—standard deviations of the error terms—are the nuisance parameters. The "true" regressor x* is treated as a random variable, independent of the measurement error η.
This model is identifiable in two cases: either the latent regressor x* is not normally distributed, or x* has normal distribution, but neither ε_t nor η_t are divisible by a normal distribution. That is, the parameters α, β can be consistently estimated from the data set without any additional information, provided the latent regressor is not Gaussian.
Before this identifiability result was established, statisticians attempted to apply the maximum likelihood technique by assuming that all variables are normal, and then concluded that the model is not identified. The suggested remedy was to assume that some of the parameters of the model are known or can be estimated from the outside source. Such estimation methods include

Deming regression — assumes that the ratio δ = σ²_ε/σ²_η is known. This could be appropriate for example when errors in y and x are both caused by measurements, and the accuracy of measuring devices or procedures are known. The case when δ = 1 is also known as the orthogonal regression.
Regression with known reliability ratio λ = σ²_∗/, where σ²_∗ is the variance of the latent regressor. Such approach may be applicable for example when repeating measurements of the same unit are available, or when the reliability ratio has been known from the independent study. In this case the consistent estimate of slope is equal to the least-squares estimate divided by λ.
Regression with known σ²_η may occur when the source of the errors in x's is known and their variance can be calculated. This could include rounding errors, or errors introduced by the measuring device. When σ²_η is known we can compute the reliability ratio as λ = / σ²_x and reduce the problem to the previous case.

Newer estimation methods that do not assume knowledge of some of the parameters of the model, include

Multivariable linear model

Multivariable model looks exactly like the simple linear model, only this time β, η_t, x_t and x*_t are k×1 vectors.
In the case when is jointly normal, the parameter β is not identified if and only if there is a non-singular k×k block matrix such that a′x* is distributed normally and independently of A′x*. In the case when ε_t, η_t1,..., η_tk are mutually independent, the parameter β is not identified if and only if in addition to the conditions above some of the errors can be written as the sum of two independent variables one of which is normal.
Some of the estimation methods for multivariable linear models are

Non-linear models

A generic non-linear measurement error model takes form
Here function g can be either parametric or non-parametric. When function g is parametric it will be written as g.
For a general vector-valued regressor x* the conditions for model identifiability are not known. However in the case of scalar x* the model is identified unless the function g is of the "log-exponential" form
and the latent regressor x* has density
where constants A,B,C,D,E,F may depend on a,b,c,d.
Despite this optimistic result, as of now no methods exist for estimating non-linear errors-in-variables models without any extraneous information. However there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.

Instrumental variables methods

Repeated observations

In this approach two repeated observations of the regressor x* are available. Both observations contain their own measurement errors, however those errors are required to be independent:
where x* ⊥ η₁ ⊥ η₂. Variables η₁, η₂ need not be identically distributed. With only these two observations it is possible to consistently estimate the density function of x* using Kotlarski's deconvolution technique.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...