Polygenic score


In genetics, a polygenic score, also called a polygenic risk score, PRC, genetic risk score, or genome-wide score, is a number based on variation in multiple genetic loci and their associated weights. It serves as the best prediction for the trait that can be made when taking into account variation in multiple genetic variants.
Polygenic scores are widely employed in animal, plant, and behavioral genetics for predicting and understanding genetic architectures. In humans, polygenic scores were originally computed in an effort to predict the prevalence and etiology of complex, heritable diseases, which are typically affected by many genetic variants that individually confer a small effect to overall risk. A genome-wide association study of a such a polygenic trait is able to identify these individual genetic loci of small effect in a large enough sample, and various methods of aggregating the results can be used to form a polygenic score. This score will typically explain at least a few percent of a phenotype's variance, and can therefore be assumed to effectively incorporate a significant fraction of the genetic variants affecting that phenotype. A polygenic score can be used in several different ways: as a lower bound to test whether heritability estimates may be biased; as a measure of genetic overlap of traits, which might indicate e.g. shared genetic bases for groups of mental disorders; as a means to assess group differences in a trait such as height, or to examine changes in a trait over time due to natural selection indicative of a soft selective sweep ; in Mendelian randomization ; to detect & control for the presence of genetic confounds in outcomes ; or to investigate gene–environment interactions and correlations.
Polygenic scores are widely used in animal breeding and plant breeding due to their efficacy in improving livestock breeding and crops. Their use in human studies is increasing.

History

One of the first precursors to the modern polygenic score was proposed under the term marker-assisted selection in 1990. According to MAS, breeders are able to increase the efficiency of artificial selection by estimating the regression coefficients of genetic markers that are correlated with differences in the trait of interest and assigning individual animals a "score" from this information. A major development of these fundamentals was proposed in 2001 by researchers who discovered that the use of a Bayesian prior could help to mitigate the problem of the number of markers being greater than the sample of animals.
These methods were first applied to humans in the late 2000s, starting with a proposal in 2007 that these scores could be used in human genetics to identify individuals at high risk for disease. This was successfully applied in empirical research for the first time in 2009 by researchers who organized a genome-wide association study of schizophrenia to construct scores of risk propensity. This study was also the first to use the term polygenic score for a prediction drawn from a linear combination of single-nucleotide polymorphism genotypes, which was able to explain 3% of the variance in schizophrenia.
Height was the first complex physical phenotype to be studied well enough to be predicted from the genome alone in humans. The first validation of height prediction was conducted in 2017, with polygenic scores constructed from 500,000 participants, proven able to predict height to within an inch from the genome alone.
Years of education was the first human cognitive phenotype to be successfully studied in a GWAS. The most recent study of this phenotype was the largest GWAS yet conducted as of 2018, with polygenic scores constructed from 1.1 million participants able to predict upwards of 10% of the variance in various cognitive traits.

Methods of construction

A polygenic score is constructed from the "weights" derived from a genome-wide association study. In a GWAS, a set of genetic markers is genotyped on a training sample, and effect sizes are estimated for each marker's association with the trait of interest. These weights are then used to assign individualized polygenic scores in an independent replication sample. The estimated score,, generally follows the form
where the of an individual is equal to the weighted sum of the individual's marker genotypes,, at SNPs. Weights are estimated using some form of regression analysis. Because the number of genomic variants is usually larger than the sample size, one cannot use OLS multiple regression. Researchers have proposed various methodologies that deal with this problem as well as how to generate the weights of the SNPs,, and how to determine which SNPs should be included.

Naïve methods

The simplest so-called "naïve" method of construction sets weights equal to the coefficient estimates from a regression of the trait on each genetic variant. The included SNPs may be selected using an algorithm that attempts to ensure that each marker is approximately independent. Failing to account for non-random association of genetic variants will typically reduce the score's predictive accuracy. This is important because genetic variants are often correlated with other nearby variants, such that the weight of a causal variant will be attenuated if it is more strongly correlated with its neighbors than a null variant. This is called linkage disequilibrium, a common phenomenon that arises from the shared evolutionary history of neighboring genetic variants. Further restriction can be achieved by multiple-testing different sets of SNPs selected at various thresholds, such as all SNPs which are genome-wide statistically-significant hits or all SNPs p < 0.05 or all SNPs with p < 0.50, and the one with greatest performance used for further analysis; especially for highly polygenic traits, the best polygenic score will tend to use most or all SNPs.

Bayesian methods

Bayesian approaches, originally pioneered in concept in 2001, attempt to explicitly model preexisting genetic architecture, thereby accounting for the distribution of effect sizes with a prior that should improve the accuracy of a polygenic score. One of the most popular modern Bayesian methods uses "linkage disequilibrium prediction" to set the weight for each SNP equal to the average of its posterior distribution after linkage disequilibrium has been accounted for. LDpred tends to outperform simpler methods of pruning and thresholding, especially at large sample sizes; for example, its estimations have improved the predicted variance of a polygenic score for schizophrenia in a large data set from 20.1% to 25.3%.

Penalized regression

methods, such as LASSO and ridge regression, can also be used to improve the accuracy of polygenic scores. Penalized regression can be interpreted as placing informative prior probabilities on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes. In other words, these methods in effect "penalize" the large coefficients in a regression model and shrink them conservatively. Ridge regression accomplishes this by shrinking the prediction with a term that penalizes the sum of the squared coefficients. LASSO accomplishes something similar by penalizing the sum of absolute coefficients. Bayesian counterparts exist for LASSO and ridge regression, and other priors have been suggested and used. They can perform better in some circumstances. A multi-dataset, multi-method study found that of 15 different methods compared across four datasets, minimum redundancy maximum relevance was the best performing method. Furthermore, variable selection methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias.

Predictive validity

The benefit of polygenic scores is that they can be used to predict the future for crops, animal breeding, and humans alike. Although the same basic concepts underlie these areas of prediction, they face different challenges that require different methodologies. The ability to produce very large family size in nonhuman species, accompanied by deliberate selection, leads to a smaller effective population, higher degrees of linkage disequilibrium among individuals, and a higher average genetic relatedness among individuals within a population. For example, members of plant and animal breeds that humans have effectively created, such as modern maize or domestic cattle, are all technically "related". In human genomic prediction, by contrast, unrelated individuals in large populations are selected to estimate the effects of common SNPs. Because of smaller effective population in livestock, the mean coefficient of relationship between any two individuals is likely high, and common SNPs will tag causal variants at greater physical distance than for humans; this is the major reason for lower SNP-based heritability estimates for humans compared to livestock. In both cases, however, sample size is key for maximizing the accuracy of genomic prediction.
While modern genomic prediction scoring in humans is generally referred to as a "polygenic score" or a "polygenic risk score", in livestock the more common term is "genomic estimated breeding value", or GEBV. Conceptually, a GEBV is the same as a PGS: a linear function of genetic variants that are each weighted by the apparent effect of the variant. Despite this, polygenic prediction in livestock is useful for a fundamentally different reason than for humans. In humans, a PRS is used for the prediction of individual phenotype, while in livestock a GEBV is typically used to predict the offspring’s average value of a phenotype of interest in terms of the genetic material it inherited from a parent. In this way, a GEBV can be understood as the average of the offspring of an individual or pair of individual animals. GEBVs are also typically communicated in the units of the trait of interest. For example, the expected increase in milk production of the offspring of a specific parent compared to the offspring from a reference population might be a typical way of using a GEBV in dairy cow breeding and selection.
Some accuracy values are given in the sections below for comparison purposes. These are given in terms of correlations and have been converted from explained variance if given in that format in the source.

In plants

The predictive value of polygenic scoring has large practical benefits for plant and animal breeding because it increases the selection precision and allows for shorter generations, both of which speed up evolution. Genomic prediction with some version of polygenic scoring has been used in experiments on maize, small grains such as barley, wheat, oats and rye, and rice biparental families. In many cases, these predictions have been so successful that researchers have advocated for its use in combating global population growth and climate change.
For humans, polygenic scores can be used to predict future disease susceptibility and for embryo selection. As of 2019, polygenic scores from well over a hundred phenotypes have been developed from genome-wide association statistics. These include scores that can be categorized as anthropometric, behavioral, cardiovascular, non-cancer illness, psychiatric/neurological, and response to treatment/medication.