본문 바로가기

ecology/이론

[집단유전] populations haplotype, nucleotide diversity 통계적으로 비교하는법











Does anyone has a method to test if haplotype and nucleotide diversities between samples are significantly different? - ResearchGate. https://www.researchgate.net/post/Does_anyone_has_a_method_to_test_if_haplotype_and_nucleotide_diversities_between_samples_are_significantly_different [accessed Nov 17, 2016].


대답중 일부


Miguel Navascués · 35.13  · 

French National Institute for Agricultural Research


You could estimate theta (i.e. theta=4*Ne*u) for each of your samples and estimate the 95%CI 

(for instance, you could use some MCMC coalescent sampler for that, see ref: doi:10.1016/j.tree.2008.09.007). 

Overlapping CI would indicate that they are not significantly different.


Another way could be bootstraping the samples to get some idea of the variance of your genetic diversity estimates 

and compare the resulting 95%CI.


Or, as Artur Burzynski suggested, some general statistics could do the work. 

I have seen using the Wilcoxon sign rank test for this, but I have not found the reference at this moment...



Estimates of demographic parameters (e.g. theta=4*Ne*u in the most simple model) from genetic data 

are statistical evaluations of those genetic diversity measures

They take into account the stochasticity of sampling and of the subjacent biological process (which is the objective of any statistical evaluation). 

They are informative about the differences in genetic diversity among populations. 

And they are routinely used by most of the people working on population genetics. 



대답중 일부자료


Analysis of Molecular Variance (AMOVA)

Peter Werner

http://userwww.sfsu.edu/efc/classes/biol710/amova/amova.htm


Population Differentiation 

When a population is divided into isolated subpopulations, there is less heterozygosity than there would be if the population was undivided. 

Founder effects acting on different demes generally lead to subpopulations with allele frequencies that are different from the larger population. 

Also, these demes are smaller in size than the larger population; 

since allele frequency in each generation represents a sample of the previous generation's allele frequency

there will be greater sampling error in these small groups than there would be in a larger undifferentiated population. 

Hence, genetic drift will push these smaller demes toward different allele frequencies and 

allele fixation more quickly than would take place in a larger undifferentiated population.

 

Wright's F 

The decline in heterozygosity due to subdivision within a population 

has usually been quantified using an index known as Wright's F statistic, also known as the fixation index. 

The F statistic is a measure of the difference between the mean heterozygosity among the subdivisions in a population, 

and the potential frequency of heterozygotes if all members of the population mixed freely and non-assortatively (Hartl and Clark 1997). 

The fixation index ranges from 0 (indicating no differentiation between the overall population and its subpopulations) 

to a theoretical maximum of 1, though in practice the observed fixation index is much less than 1 even in highly differentiated populations.

 

Fixation indexes can be determined for differentiated hierarchical levels of a population structure, 

to indicate, for example, the degree of differentiation within a population among groups of demes (FSG)

within groups among demes (FGT), and within a population among demes (FST) (Hartl and Clark 1997).

 

To determine the fixation index, the mean heterozygosity at each level must be determined. 

For a locus with two alternate alleles, allele frequency is symbolized as p and the alternative form of the allele is equal to 1 – p. 

For a population subdivided to three hierarchical levels, the mean heterozygosity for each level is then determined as follows:


Level of population hierarchy

Heterozygosity

Demes

Groups of demes

Total population


For each level of the population hierarchy, the mean allele frequency p is determined, then the allele frequency is multiplied by 2(1 – p); 

this product is the frequency of heterozygotes for that allele if panmixia occurs at that hierarchical level.

 

Once the heterozygosity at each hierarchical level is determined, F statistics can be calculated (Hartl and Clark 1997):

 

Level of population hierarchy

F-statistic

Among demes within group

Among groups within population

Among demes within population

 

AMOVA

Wright's F is based upon comparison of gene frequencies among demes, however, molecular data reveals 

not only the frequency of molecular markers, but can also tell us something about the amount of mutational differences between different genes. 

A technique that could be used to estimate population differentiation by analyzing differences between molecular sequences 

rather than assumed Mendelian gene frequencies would therefore be very useful.

 

Analysis of Molecular Variance (AMOVA) is a method of estimating population differentiation directly from molecular data 

and testing hypotheses about such differentiation. 

A variety of molecular data – molecular marker data (for example, RFLP or AFLP), direct sequence data, 

or phylogenetic trees based on such molecular data – may be analyzed using this method (Excoffier, et al. 1992).

 

AMOVA treats any kind of raw molecular data as a Boolean vector pi, that is, a 1 ´ n matrix of 1s and 0s,

1 indicating the presence of a marker and 0 its absence. 

A marker could be a nucleotide base, a base sequence, a restriction fragment, or a mutational event (Excoffier, et al. 1992).

 

Euclidean distances between pairs of vectors are then calculated 

by subtracting the Boolean vector of one haplotype from another, according to the formula (pj – pk). 

If pj and pk are visualized as points in n-dimensional space indicated by the intersections of the values in each vector, 

with n being equal to the length of the vector, 

then the Euclidean distance is simply a scalar that is equal to the shortest distance between those two points. 

The squared Euclidean distances are then calculated using the equation . 

W is a weighting matrix; by default, it is an identity matrix and does not change the value of the final product; 

however, W can be a matrix with a number of values depending upon how one weights molecular change 

at different locations on a sequence or phylogenetic tree (Excoffier, et al. 1992).

 

Squared Euclidean distances are calculated for all pairwise arrangements of Boolean vectors, 

which are then arranged into a matrix, and partitioned into submatrices corresponding to subdivisions within the population (Excoffier, et al. 1992):

  

They are arranged in such a way that the submatrices on the diagonal of the larger matrix are pairs of individuals in the same population 

while those on the off-diagonal represent pairs of individuals from different populations. 

The sums of the diagonals in the matrix and submatrices yield sums of squares for the various hierarchical levels of the population.

 

These sums of squares can then be analyzed in a nested analysis of variance framework. 

A nested ANOVA differs from a simple ANOVA in that data is arranged hierarchically 

and mean squares are computed for groupings at all levels of the hierarchy. 

This allows for hypothesis tests of between-group and within-group differences at several hierarchical levels. 

The nested ANOVA framework for AMOVA is as follows (Excoffier, et al. 1992; Excoffier 2001):

 

Level of variation

Among haplotypes within demes

Among demes within groups

Among groups within population


The variance components can be used to calculate a series of statistics called phi-statistics (F), 

which summarize the degree of differentiation between population divisions and are analogous to F-statistics. 

F - statistics are derived as follows (Excoffier, et al. 1992; Excoffier 2001):

 

Level of population hierarchy

Among demes within group

Among groups within population

Among demes within population


A F -statistic can be treated as a hypothesis about differentiation at that level of a population; 

for example, FST can be treated as a hypothesis about differentiation between the population and its component demes. 

These hypotheses can be tested using the null distribution of the variance components; 

if the variance of the subpopulations does not significantly differ from the null distribution of the variance of the population, 

the hypothesis that those subpopulations are differentiated from the larger population would be rejected.

 

Because the molecular data consist of Euclidean distances derived from vectors of 1s and 0s, 

the data are unlikely to follow a normal distribution. 

A null distribution is therefore computed by resampling of the data (Excoffier, et al. 1992). 

In each permutation, each individual is assigned to a randomly chosen population while holding the sample sizes constant. 

These permutations are repeated many times, eventually building a null distribution. 

Hypothesis testing is carried out relative to these resampling distributions.

 

What assumptions are made about the data? 

The individuals from which haplotypes are sampled should be chosen independently and at random, or course. 

Since the null distributions are obtained by resampling, 

the Euclidean distances between haplotypes need not be assumed to be normally distributed or have homogeneity of variance.

 

Because of genetic drift, any one haplotype should not be assumed to be completely representative of variation among the whole genome. 

It is therefore important that the data are derived from an adequate number of markers or base pairs.

 

Certain assumptions are made about the nature of the population (Excoffier, et al. 1992), 

for example, that mating is entirely random and non-assortative and no inbreeding occurs. 

If non-random mating or inbreeding is occurring, it will result in lower heterozygosity, 

and if the rates of non-random mating or inbreeding differ between populations, fixation estimates will be confounded.

 

The effects of selection are not fully accounted for by this model. 

There is almost certain to be differing selective pressures among different subpopulations, 

and selection can have very different effects on different alleles and allele combinations. 

All variance among different allele frequencies due to genetic drift can be assumed to be 

the product of a degree of sampling error that is common to all alleles. 

However, selection acting on different alleles is non-random, 

hence any given between-population difference in the frequency of a given allele is potentially non-representative of allele frequency variation 

as a whole.

 

Again, use of a large number of markers makes it more likely that one is getting a representative cross-section of alleles. 

However, because of the non-random nature of the effects of selection on different allele frequencies, 

increasing the percentage of the genome that is sampled will not necessarily yield an unbiased estimate of allele differentiation across the whole genome, at least, not as readily as would be the case when compensating for the effects of differential genetic drift. 

Using neutral, non-selected genetic markers can be a useful means of avoiding the confounding effects of selection, 

if neutral markers can be identified.

 

AMOVA appears to be highly robust to the methods of estimating distance between haplotypes. 

Excoffier et al. (1992) examined the behavior of several different distance metrics. 

They constructed four different data sets using the same RFLP data, 

extracted from mtDNA sampled from 10 human populations in 5 geographical regions. The data sets were structured as follows:

 

D1:   The distance matrix was constructed from individual restriction site differences, with all restriction sites weighted equally.

D2:   The distance matrix was constructed from individual haplotypes that represented discrete restriction fragment patterns. 

Each haplotype consisted of one or more restriction fragments. 

All haplotype differences were weighted equally, regardless of whether a given haplotype difference represented one or several restriction fragment differences.

D3:   The distance matrix was derived from an un-rooted phylogenetic network constructed from a parsimonious arrangement of restriction site differences. 

When connections of equal length were possible, haplotypes that were regionally closer or those that did not represent a change from one rare haplotype to another were favored. 

Each step on this network is scored as 1 for a single mutational event.

D4:   The distance matrix was again derived from an un-rooted phylogenetic network constructed from a parsimonious arrangement of restriction site differences. 

In this case, a weighting matrix W was applied to the data, the different weightings based on variation on nucleotide diversity at different restriction sites.

 

The results were as follows:

(Table from Excoffier, et al. 1992)

 

The F -statistics and the partitioning of the variance components are nearly identical for data sets D1, D3, and D4 are 

close to identical, indicating that AMOVA is robust to most data arrangements. 

Data set D2 showed somewhat lower (but still significant) values for FST and FCT, 

indicating that the grouping of restriction fragment data into discrete haplotypes represents some loss of information. 

It is possible that there is an analogous difference between direct sequence data and restriction fragment data, 

with restriction fragment data representing a loss of information when compared with direct sequence data.