Iterative proportional fitting

The iterative proportional fitting procedure is an iterative algorithm for estimating cell values of a contingency table such that the marginal totals remain fixed and the estimated table decomposes into an outer product.
IPF is a method that has been "re-invented" many times, e.g. G.U. Yule in 1912 in relation to standardizing cross-tabulations and Kruithof in 1937
in relation to telephone traffic, and expanded upon by Deming and Stephan in 1940, it has seen various extensions and related research. A rigorous proof of convergence by means of differential geometry is due to Fienberg. He interpreted the family of contingency tables of constant crossproduct ratios as a particular -dimensional manifold of constant interaction and showed that the IPFP is a fixed-point iteration on that manifold. Nevertheless, he assumed strictly positive observations. Generalization to tables with zero entries is still considered a hard and only partly solved problem.
An exhaustive treatment of the algorithm and its mathematical foundations can be found in the book of Bishop et al.. The first general proof of convergence, built on non-trivial measure theoretic theorems and entropy minimization, is due to Csiszár.
Relatively new results on convergence and error behavior have been published by Pukelsheim and Simeone
. They proved simple necessary and sufficient conditions for the convergence of the IPFP for arbitrary two-way tables by analysing an -error function.
Other general algorithms can be modified to yield the same limit as the IPFP, for instance the Newton–Raphson method and
the EM algorithm. In most cases, IPFP is preferred due to its computational speed, numerical stability and algebraic simplicity.

Algorithm 1 (classical IPF)

Given a two-way -table of counts, where the cell values are assumed to be Poisson or multinomially distributed, we wish to estimate a decomposition for all i and j such that is the maximum likelihood estimate of the expected values leaving the marginals and fixed. The assumption that the table factorizes in such a manner is known as the model of independence. Written in terms of a log-linear model, we can write this assumption as, where, and the interaction term vanishes, that is for all i and j.
Choose initial values , and for set
Notes:

Convergence does not depend on the actual distribution. Distributional assumptions are necessary for inferring that the limit is an MLE indeed.
IPFP can be manipulated to generate any positive marginals by replacing by the desired row marginal .
IPFP can be extended to fit the model of quasi-independence, where is known a priori for. Only the initial values have to be changed: Set if and 1 otherwise.
Algorithm 2 (factor estimation)

Assume the same setting as in the classical IPFP.
Alternatively, we can estimate the row and column factors separately: Choose initial values, and for set
Setting, the two variants of the algorithm are mathematically equivalent.
Notes:

In matrix notation, we can write, where and.
The factorization is not unique, since it is for all.
The factor totals remain constant, i.e. for all and for all.
To fit the Q-model, where a priori for, set if ( and otherwise. Then

Obviously, the I-model is a particular case of the Q-model.

Algorithm 3 (RAS)

The Problem: Let be the initial matrix with nonnegative entries, a vector of specified
row marginals and a vector of column marginals. We wish to compute a matrix similar to M and consistent with the predefined marginals, meaning
and
Define the diagonalization operator, which produces a matrix with its input vector on the main diagonal and zero elsewhere. Then, for, set
where
Finally, we obtain

Discussion and comparison of the algorithms

Although RAS seems to be the solution of an entirely different problem, it is indeed identical to the classical IPFP. In practice,
one would not implement actual matrix multiplication, since diagonal matrices are involved. Reducing the operations to the necessary ones,
it can easily be seen that RAS does the same as IPFP. The vaguely demanded 'similarity' can be explained as follows: IPFP
maintains the crossproduct ratios, e.i.
since
This property is sometimes called structure conservation and directly leads to the geometrical interpretation of contingency tables and the proof of convergence in the seminal paper of Fienberg.
Nevertheless, direct factor estimation is under all circumstances the best way to deal with IPF: Whereas classical IPFP needs
elementary operations in each iteration step, factor estimation needs only
operations being at least one order in magnitude faster than classical IPFP.

Existence and uniqueness of MLEs

Necessary and sufficient conditions for the existence and uniqueness of MLEs are complicated in the general case, but sufficient conditions for 2-dimensional tables are simple:

the marginals of the observed table do not vanish and
the observed table is inseparable.

If unique MLEs exist, IPFP exhibits linear convergence in the worst case, but exponential convergence has also been observed. If a direct estimator exists, IPFP converges after 2 iterations. If unique MLEs do not exist, IPFP converges toward the so-called extended MLEs by design, but convergence may be arbitrarily slow and often computationally infeasible.
If all observed values are strictly positive, existence and uniqueness of MLEs and therefore convergence is ensured.

Goodness of fit

Checking if the assumption of independence is adequate, one uses the Pearson X-squared statistic
or alternatively the likelihood-ratio test statistic
Both statistics are asymptotically -distributed, where is the number of degrees of freedom.
That is, if the p-values and are not too small, there is no indication to discard the hypothesis of independence.

Interpretation

If the rows correspond to different values of property A, and the columns correspond to different values of property B, and the hypothesis of independence is not discarded, the properties A and B are considered independent.

Example

Consider a table of observations :

	Right-handed	Left-handed	TOTAL
Male	43	9	52
Female	44	4	48
TOTAL	87	13	100

For executing the classical IPFP, we first initialize the matrix with ones, leaving the marginals untouched:

	Right-handed	Left-handed	TOTAL
Male	1	1	52
Female	1	1	48
TOTAL	87	13	100

Of course, the marginal sums do not correspond to the matrix anymore, but this is fixed in the next two iterations of IPFP. The first iteration deals with the row sums:

	Right-handed	Left-handed	TOTAL
Male	26	26	52
Female	24	24	48
TOTAL	87	13	100

Note that, by definition, the row sums always constitute a perfect match after odd iterations, as do the column sums for even ones. The subsequent iteration updates the matrix column-wise:

	Right-handed	Left-handed	TOTAL
Male	45.24	6.76	52
Female	41.76	6.24	48
TOTAL	87	13	100

Now, both row and column sums of the matrix match the given marginals again.
The variables are now independent, meaning the odds ratio is 1. This can be checked in either dimension: for both male and female, the odds of right-handed vs. left-handed are, since. Similarly, for both right-handed and left-handed, the odds of being male vs. female are, since.
For a 2×2 table, an exact solution is possible and iteration converges in a single pair of steps, and in fact a closed-form solution is to just take the outer product of the frequencies and divide by the population size, which yields the same values as above:

	Right-handed	Left-handed	TOTAL
Male	87·52/100	13·52/100	52
Female	87·48/100	13·48/100	48
TOTAL	87	13	100

However for larger tables an exact solution is not always possible, and multiple iteration steps are necessary.
The p-value of this matrix approximates to, meaning: gender and left-handedness/right-handedness can be considered independent.

Implementation

The R package mipfp provides a multi-dimensional implementation of the traditional iterative proportional fitting procedure. The package allows the updating of a N-dimensional array with respect to given target marginal distributions.
Python has an equivalent package, ipfn that can be installed via pip. The package supports numpy and pandas input objects.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...