Archaeogenetics is the reconstruction of ancient demography from patterns of gene differences in contemporary populations. Population size, population movements, and subdivision into partially-isolated subpopulations leave characteristic signatures in the DNA of contemporary populations. New technologies for cheaply and rapidly examining DNA from human populations along with new theories and methods from population genetics have yielded important insights into human history. The literature in this field is of uneven quality: this article focuses on data and interpretations that are widely replicated and that have statistical support.
History of Human Numbers
Most theory about genetic diversity and population history describes neutral genes (the term "gene" here is used loosely to refer to any arbitrary DNA sequence). Much of the data in the literature derives from non-coding regions of the genome, because these are most likely to have been unaffected by natural selection. Genetic diversity in population studies refers to the average difference between two genes chosen at random: In the simplest case, it is simply heterozygosity, the probability that two random genes are different from each other.
Mutation introduces new diversity in a population, while genetic drift–the process by which each generation is effectively a sample with replacement of the gene pool of the previous generation–causes loss of diversity. The rate of diversity gain is the mutation rate; in humans, in the nuclear genome (that is, the DNA contained in the cell nucleus), the mutation rate is usually taken to be 10-9 per nucleotide position per year. The rate of loss of diversity is proportional to the reciprocal of the effective size, Ne, of the population. Effective size is the size of an ideal population with statistical properties equivalent to those of a real population. Many people in human populations have not yet reached reproductive age and many others are past the age of reproduction; neither of these groups influences the effective size. For humans, the effective size of a population is usually estimated to be one third of the census size.
Effective size is the inverse of the rate of diversity loss, and since effective size may fluctuate over time, the average rate of diversity loss is the average of the reciprocal of effective size (that is, it is the harmonic mean rather than the ordinary mean). A population that fluctuates in size between 1,000 and 10,000, and thus has a mean size of 5,500, has a long-term effective size of about 1,800. Because of this, genetic diversity in a population is sensitive to long term minima and less sensitive to maxima.
Direct estimates from genes put long-term effective size for the human species in the range 10,000 to 20,000. Since there are approximately 6 billion humans alive today, this small effective size suggests that the number of human ancestors has been drastically smaller, consistent with a recent origin of our species from a small founding population. Fossil and archaeological evidence support such a founding event, and place it about 100,000 to 200,000 years ago.
In order to infer more about demographic history, it is necessary to introduce some results of coalescent theory, the theory of the history of a sample
of genes from a population. Consider a sample of n genes drawn from a population. These n genes are tips of a tree of descent, called a coalescent tree: if one could follow their history backward in time, one would find that occasionally two of the genes are copies of the same gene in a parent in the previous generation–a coalescent event. Coalescent events reduce the number of ancestors of the sample: continuing backward, eventually one arrives at a single ancestor of all n genes–the most recent commonancestor (MRCA) of the sample.
A coalescent tree, with the vertical axis proportional to time, might look like Figure 1. There are six genes in the sample depicted, and as one follows them back in time the number of ancestors of the sample is six, then five after the first coalescence, then four, three, two, and finally the single common ancestor of all the genes. A tree like this, descending from a single random mating population of constant size Ne, has the following properties:
- The expected time back to coalescence of any pair of genes is 2Ne generations.
- The expected time to the MRCA is 4Ne generations for large sample size n.
- The expected total branch length of the tree is 4NeΣ(1/i) generations where the index of summation i goes from 1 to n-1.
Mutations are rare and occur randomly in time and across sites on a gene. If u is the mutation rate per site per generation, then, corresponding to (1) and (2) above:
- The average pairwise difference between sequences is 4Neu.
- The expected total number of mutations in the set of sequences is 4NeuΣ(1/i).
Given knowledge of the mutation rate, either of the above two expressions provides an estimate of the effective size of population, Ne.
As an illustration of property (1©), the Human Genome Project has found that a single nucleotide difference between chromosome pairs occurs on average once every 1,000 bases in the human genome, so the average pairwise difference is 10-3. The mutation rate of 10-9 per year corresponds to a rate of 25× 10-9 per generation, and substituting these figures in the expression (1©) yields an estimate of human effective size of 104. Thus, the human species has genetic diversity equivalent to that of a species whose effective size has been constant at 10 thousand, corresponding to a census size of 30 to 40 thousand people.
Now consider a coalescent tree from a population that originated from a small number of founders and then grew rapidly to a large size. During the time when it was large, few coalescent events occurred, but before it grew the population was very small and coalescence was rapid. A gene tree from such a population might look like Figure 2, a pattern described as "star-" or "comb-" like. The total branch length of the tree is nearly proportional to the sample size n; the mean pairwise difference between samples is slightly greater than 2Tu where T is the time since the population growth occurred; and the top of the tree is only slightly earlier than T generations ago.
Differences among DNA sequences sampled from trees like those in Figure 1 and Figure 2 are the bases for inferences about ancient demography from DNA. First, a mutation that occurs in the tree of Figure 2 is likely to occur in one of the long terminal branches, so that it will be found only once in the sample; a mutation in the tree of Figure 1 is likely to occur near the top and hence would be represented in the sample with many copies. Hence "excess singletons and rare types" is an indicator of population expansion in the past. Second, all the pairwise differences among sequences from Figure 2 will be roughly equal, since the times separating the sequences are similar. Pairwise differences from Figure 1, on the other hand, will be erratic and differ a lot among themselves.
The first genetic marker to be studied intensively with an interest in coalescences was human mitochondrial DNA (mtDNA)–the DNA that is contained in the cell's mitochondria rather than its nucleus. It was found that the human mtDNA tree was like that of Figure 2, with the time of expansion T estimated to be 80,000 years ago. This pattern indicated that the human species had a focal origin, and that the genetic contribution to modern humans of most of the world population of archaic humans, like the Neanderthals of Europe, was either nonexistent or vanishingly small.
Unfortunately, subsequent studies of other genetic systems–using nuclear rather than mitochondrial DNA–have not confirmed this picture. The issue of when and how the human population grew from only a few thousand to 6 billion is the subject of lively current debate. Some nuclear genes show no evidence of population expansion, while others show mild evidence of expansion, consistent with population growth since the end of the last ice age (around 12,000 years ago). The contending hypotheses are:
- (A) There was a founding event and subsequent population expansion about 100,000 years ago, as suggested by mtDNA and some other genetic systems, but pervasive natural selection in the nuclear genome has obscured the signature of this event.
- (B) A new genotype appeared before 100,000 years ago and spread throughout the species, leading to replacement of some of the genome and incorporation of genes from archaic populations at other parts of the genome. According to this hypothesis, mtDNA, the Y chromosome, and some other parts of the nuclear genome underwent replacement, but much of the nuclear genome did not.
- (C) The major numerical expansion of humans has taken place since the last ice age. Many nuclear genes coalesce about 1.8 million years ago, around the time of the expansion of modern human's precursor species, Homoerectus, out of Africa. This corresponds to 72,000 generations at 25 years per generation. Under a constant population size model, this would imply an effective population size Ne (72,000 ÷ 4) =18,000, which lies comfortably within the range of genetic estimates of Ne. In other words, the small human effective size reflects a focal origin 1.8 million years ago. According to this hypothesis, the evidence from human mtDNA of massive expansion over the last 100,000 years is an artifact.
Published estimates of coalescence times of nuclear genes are generally well below 1.8 million years, but the absolute values of these estimates should not be taken seriously. They rely on knowledge of the mutation rate, which is calculated from chimpanzee-human differences and the assumed time since the two species separated. Problems with this calculation may lead to a substantial overestimation of the rate. Taking this into account, almost all of the data on coalescence times of nuclear genes are consistent with hypothesis C, but hypothesis C is inconsistent with the expansion signature in mitochondrial DNA and the evidence of expansion from some other families of markers. Scholars at the beginning of the twenty-first century found some version of hypothesis B to be the most promising.
Human Diversity in Detail
The example given above showing an estimate of human Ne of 10,000 was based on single nucleotide differences along a pair of genes. These differences are called Single Nucleotide Polymorphisms (SNPs), and they are one of several important classes of genetic markers. Another class is repeat polymorphisms, genes where there are repetitions of a DNA sequence. Many of these are in non-coding DNA, but they also occur in genes and affect the protein that the gene produces. Generically these are called Variable Number of Tandem Repeats (VNTRs). If the repeat motifs are very short–two to four bases–they are called Short Tandem Repeats (STRs) or microsatellites. Commonly used STRs for identification and for studies of evolution are tetranucleotide repeats with four base motifs, and dinucleotides with two. Trinucleotide repeats are less likely to be neutral since, with three base motifs, they can affect genes more easily.
The earlier example suggested that SNP density is a natural statistic for describing genetic diversity within a population. The corresponding natural statistic for VNTR loci is the variance of repeat length in a sample of genes. Mutations change repeat length by a small amount, so that the mean squared difference between two chromosomes, as well as the probability they are the same length, should be monotonically related to the time elapsed since the common ancestor of the chromosomes, hence to the effective size of the population.
Human within-population diversity is highest in Africa and declines as one moves away from Africa. This is seen in the scatterplot in Figure 3, taken from Henry C. Harpending and Elise Eller's 1999 work. The horizontal axis of the figure is genetic distance from Africa; the vertical axis is average heterozygosity–the probability that two STR genes are the same length, averaged over 60 short repeat polymorphisms. The plot shows the relationship between how genetically different a population is from the African average and within-population genetic diversity. Populations more different from Africans are less diverse, and the relationship is nearly linear. This pattern is thought to be part of the signature of the African origin of our species and the loss of diversity associated with repeated founder effects during colonization at the edge of the expansion. While Figure 3 only includes Old World populations, other data sources show that the decline continues into the New World: American Indian populations are 15 to 25 percent less diverse than African populations. Direct studies of SNP density are likely to shed further light on this pattern as additional data become available.
Just as within-population diversity describes how different two genes from the same population are on average, between-population diversity describes
how much greater is the average difference between genes from different populations relative to overall average gene differences. In other words, total diversity of a sample of populations can be partitioned into within-and between-population components in a way completely analogous to the analysis of variance in statistics.
The fraction of diversity between populations is conventionally written as Fst. Various kinds of genetic data have been used to estimate Fst. For large human populations like those in Figure 3, all the estimates are in the range of 10 to 15 percent. One interpretation of this is that if the world's peoples were to mate at random, average within-population diversity would increase by this amount; another interpretation is that Fst measures relative within-population similarity or shared genetic material. Thus the excess shared genetic material within human subpopulations relative to the whole world is 10 to 15 percent. In a famous early discussion of this point, Richard Lewontin (1972) emphasized that 10 to 15 percent is a small amount and drew the conclusion that differences among human populations or races are insignificant. On the other hand, the excess shared genetic material within a population between grandparent and grandchild is 12.5 percent (one-eighth), and society does not regard the genetic similarity between grandparent and grandchild as trivial.
Harpending, Henry C., and Elise Eller. 1999. "Human Diversity and Its History." Biodiversity pp. 301–314.
Harpending, Henry C., and Alan Rogers. 2001. "Genetic Perspectives on Human Origins and Differentiation." Annual Review of Genomics and Human Genetics 1: 361–385.
Hudson, Richard R. 1990. "Gene Genealogies and the Coalescent Process." Oxford Surveys in Evolutionary Biology 7: 1–44.
Lewontin, Richard C. 1972. Evolutionary Biology, Vol. 6: The Apportionment of Human Diversity, ed. Theodosius H. Dobzhansky, Max K. Hecht, and William C. Steere. New York: Appleton-Century-Crofts.
Rogers, Alan R., and Lynn B. Jorde. 1995. "Genetic Evidence on the Origin of Modern Humans." Human Biology 67: 1–36.
Takahata, Naoyuki, Sang-Hee Lee, et al. 2001. "Testing Multiregionality of Modern Human Origins." Molecular Biology and Evolution 18: 172–183.
Henry C. Harpending