Tajima's D

Tajima's D is a population genetic test statistic created by and named after the Japanese researcher Fumio Tajima. Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.
The purpose of Tajima's D test is to distinguish between a DNA sequence evolving randomly and one evolving under a non-random process, including directional selection or balancing selection, demographic expansion or contraction, genetic hitchhiking, or introgression. A randomly evolving DNA sequence contains mutations with no effect on the fitness and survival of an organism. The randomly evolving mutations are called "neutral", while mutations under selection are "non-neutral". For example, a mutation that causes prenatal death or severe disease would be expected to be under selection. In the population as a whole, the frequency of a neutral mutation fluctuates randomly through genetic drift.
The strength of genetic drift depends on population size. If a population is at a constant size with constant mutation rate, the population will reach an equilibrium of gene frequencies. This equilibrium has important properties, including the number of segregating sites, and the number of nucleotide differences between pairs sampled. To standardize the pairwise differences, the mean or 'average' number of pairwise differences is used. This is simply the sum of the pairwise differences divided by the number of pairs, and is often symbolized by.
The purpose of Tajima's test is to identify sequences which do not fit the neutral theory model at equilibrium between mutation and genetic drift. In order to perform the test on a DNA sequence or gene, you need to sequence homologous DNA for at least 3 individuals. Tajima's statistic computes a standardized measure of the total number of segregating sites in the sampled DNA and the average number of mutations between pairs in the sample. The two quantities whose values are compared are both method of moments estimates of the population genetic parameter theta, and so are expected to equal the same value. If these two numbers only differ by as much as one could reasonably expect by chance, then the null hypothesis of neutrality cannot be rejected. Otherwise, the null hypothesis of neutrality is rejected.

Scientific explanation

Under the neutral theory model, for a population at constant size at equilibrium:
for diploid DNA, and
for haploid.
In the above formulas, S is the number of segregating sites, n is the number of samples, N is the effective population size,
is the mutation rate at the examined genomic locus,
and i is the index of summation.
But selection, demographic fluctuations and other violations of the neutral model will change the expected values of and, so that they are no longer expected to be equal. The difference in the expectations for these two variables is the crux of Tajima's D test statistic.
is calculated by taking the difference between the two estimates of the population genetics parameter. This difference is called, and D is calculated by dividing by the square root of its variance .
Fumio Tajima demonstrated by computer simulation that the statistic described above could be modeled using a beta distribution. If the value for a sample of sequences is outside the confidence interval then one can reject the null hypothesis of neutral mutation for the sequence in question.

Mathematical details

where
and are two estimates of the expected number of single nucleotide polymorphisms between two DNA sequences under the neutral mutation model in a sample size from an effective population size.
The first estimate is the average number of SNPs found in pairwise comparisons of sequences in the sample,
The second estimate is derived from the expected value of, the total number of polymorphisms in the sample
Tajima defines, whereas Hartl & Clark use a different symbol to define the same parameter.

Example

Suppose you are a geneticist studying an unknown gene. As part of your research you get DNA samples from four random people. For simplicity, you label your sequence as a string of zeroes, and for the other four people you put a zero when their DNA is the same as yours and a one when it is different.


 1 2
Position 12345 67890 12345 67890
Person Y 00000 00000 00000 00000
Person A 00100 00000 00100 00010
Person B 00000 00000 00100 00010
Person C 00000 01000 00000 00010
Person D 00000 01000 00100 00010

Notice the four polymorphic sites. Now compare each pair of sequences and get the average number of polymorphisms between two sequences. There are "five choose two" comparisons that need to be done.

Person Y is you!

You vs A: 3 polymorphisms


Person Y 00000 00000 00000 00000
Person A 00100 00000 00100 00010

You vs B: 2 polymorphisms


Person Y 00000 00000 00000 00000
Person B 00000 00000 00100 00010

You vs C: 2 polymorphisms


Person Y 00000 00000 00000 00000
Person C 00000 01000 00000 00010

You vs D: 3 polymorphisms


Person Y 00000 00000 00000 00000
Person D 00000 01000 00100 00010

A vs B: 1 polymorphism


Person A 00100 00000 00100 00010
Person B 00000 00000 00100 00010

A vs C: 3 polymorphisms


Person A 00100 00000 00100 00010
Person C 00000 01000 00000 00010

A vs D: 2 polymorphisms


Person A 00100 00000 00100 00010
Person D 00000 01000 00100 00010

B vs C: 2 polymorphisms


Person B 00000 00000 00100 00010
Person C 00000 01000 00000 00010

B vs D: 1 polymorphism


Person B 00000 00000 00100 00010
Person D 00000 01000 00100 00010

C vs D: 1 polymorphism


Person C 00000 01000 00000 00010
Person D 00000 01000 00100 00010

The average number of polymorphisms is.
The second estimate of the equilibrium is M=S/a1
Since there were n=5 individuals and S=4 segregating sites
a1=1/1+1/2+1/3+1/4=2.08
M=4/2.08=1.92
The lower-case d described above is the difference between these two numbers—the average number of polymorphisms found in pairwise comparison and M. Thus.
Since this is a statistical test, you need to assess the significance of this value. A discussion of how to do this is provided below.

Interpreting Tajima's D

A negative Tajima's D signifies an excess of low frequency polymorphisms relative to expectation, indicating population size expansion and/or purifying selection. A positive Tajima's D signifies low levels of both low and high frequency polymorphisms, indicating a decrease in population size and/or balancing selection. However, calculating a conventional "p-value" associated with any Tajima's D value that is obtained from a sample is impossible. Briefly, this is because there is no way to describe the distribution of the statistic that is independent of the true, and unknown, theta parameter. To circumvent this issue, several options have been proposed.

Value of Tajima's D	Mathematical reason	Biological interpretation 1	Biological interpretation 2
Tajima's D=0	Theta-Pi equivalent to Theta-k. Average Heterozygosity= # of Segregating sites.	Observed variation similar to expected variation	Population evolving as per mutation-drift equilibrium. No evidence of selection
Tajima's D<0	Theta-Pi less than Theta-k. Fewer haplotypes than # of segregating sites.	Rare alleles abundant	Recent selective sweep, population expansion after a recent bottleneck, linkage to a swept gene
Tajima's D>0	Theta-Pi greater than Theta-k. More haplotypes than # of segregating sites.	Rare alleles scarce	Balancing selection, sudden population contraction

However, this interpretation should be made only if the D-value is deemed statistically significant.

Determining significance

When performing a statistical test such as Tajima's D, the critical question is whether the value calculated for the statistic is unexpected under a null process. For Tajima's D, the magnitude of the statistic is expected to increase the more the data deviates from a pattern expected under a population evolving according to the standard coalescent model.
Tajima found an empirical similarity between the distribution of the test statistic and a beta distribution with mean zero and variance one. He estimated theta by taking Watterson's estimator and dividing it by the number of samples. Simulations have shown this distribution to be conservative, and now that the computing power is more readily available this approximation is not frequently used.
A more nuanced approach was presented in a paper by Simonsen et al. These authors advocated constructing a confidence interval for the true theta value, and then performing a grid search over this interval to obtain the critical values at which the statistic is significant below a particular alpha value. An alternative approach is for the investigator to perform the grid search over the values of theta which they believe to be plausible based on their knowledge of the organism under study. Bayesian approaches are a natural extension of this method.
A very rough rule of thumb to significance is that values greater than +2 or less than -2 are likely to be significant. This rule is based on an appeal to asymptotic properties of some statistics, and thus +/- 2 does not actually represent a critical value for a significance test.
Finally, genome wide scans of Tajima's D in sliding windows along a chromosomal segment are often performed. With this approach, those regions that have a value of D that greatly deviates from the bulk of the empirical distribution of all such windows are reported as significant. This method does not assess significance in the traditional statistical sense, but is quite powerful given a large genomic region, and is unlikely to falsely identify interesting regions of a chromosome if only the greatest outliers are reported.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...