|If you prefer, you can display this tutorial in a new tab
PopHuman is a population genomics-oriented genome browser, based on JBrowse, that contains a complete inventory of nucleotide diversity metrics, linkage disequilibrium, recombination rates and neutrality tests in sliding windows along the human genome estimated from the 1000 Genomes Project data using the PopGenome software.
For a more detailed explanation of the source data and how it has been processed for PopHuman, we recommend reading through the Help → Data description, Help → Genome accessibility, and Help → Population differentiation pages. In addition, Help → Tracks description defines all the available tracks for windows-based analyses in PopHuman, and Help → Integrative MKT explains the extended MKT method applied to both windows-based and gene-based analyses.
PopHuman is designed to help testing evolutionary hypotheses from a population genetics perspective. Below are some examples of questions that can be answered using PopHuman. Questions are sequential and the whole guide constitutes an example of how to analyze a specific region of the genome.
Our region of interest in this tutorial will be chr7:142566051..142586393, a genomic area of around 20 kb in chromosome 7, where the TRPV6 gene is located. The TRPV6 gene is a well-studied protein coding gene which encodes a member of a family of multipass membrane proteins that functions as an epithelium calcium channel involved in the absorption of calcium from the diet in the intestine. It has been suggested that TRPV6 may have co-evolved with lactose tolerance, as fresh milk is a major source of calcium in European populations (Hughes et al. 2008). The region has experienced parallel selection sweeps in non-African populations, coinciding with the establishment of agriculture first in Europe about 10,000 years ago, and later in Asia. To display the target region, either type its coordinates or the TRPV6 gene name into the corresponding box from the navigation bar section, and then press the “Go” button (Figure 1). It is possible to highlight the region of interest using the marker button located next to the genomic coordinates box and then clicking and dragging the region to be highlighted. This allows exploring broader genomic areas without losing the region of interest.
Figure 1. Genomic coordinates box and highlight utility button.
The PopHuman browser includes a report for each gene annotation, as well as direct links to NCBI and UCSC. Hence, information regarding location, structure, sequence, gene ontology, gene expression, literature, etc. about the gene of interest can be accessed directly from the browser interface.
By right-clicking the gene of interest, a drop-down menu appears with several options (including “View details”, as well as the ones to search for the specific gene at both NCBI and UCSC) (Figure 2).
Figure 2. Drop-down menu with options for annotated genes.
PopHuman includes +1000 tracks, including both general tracks of the human hg19 reference sequence, and variation metrics for the 26 1000GP populations. Given the large number of tracks available, these can be filtered and selected using the “Select tracks” tool, which is displayed on the top left corner (below the navigation bar). This tool is used to narrow your search in order to finally find and select your track of interest, and this process can be done several times in order to finally get all the desired tracks selected. The filtering process is normally performed by refining your search using the menu on the left, in which track features are classified according to four major categories: Reference tracks, Data selection (by metapopulation and/or population), Variation statistics, and Visualization (only one available window size of 10 kb by now).
Tracks for the European CEU (Utah Residents (CEPH) with Northern and Western European Ancestry) population will displayed along this tutorial. This population includes 99 individuals. To filter tracks for this population, navigate to the “Data selection” section from the left-side menu of the “Track selector” tab (located in the top left corner, below the navigation bar) and select Population: CEU (Figure 3).
Figure 3. Data selection: CEU population.
Natural selection leaves signatures in the genome that can be now identified if polymorphic sequences within populations and divergent sequences between species are compared:
Other signatures might include relatively high levels of divergence (high K), and low haplotype diversity (low hap_diversity_within).
In this section we are going to explore the nucleotide diversity and divergence levels on the target region, as well as the distribution of rare alleles (through the Tajima's D statistic) and the degree of linkage disequilibrium.
Use the “Select tracks” tool to select Pi, theta, hap_diversity_within, K, Kelly_ZnS (average of r²) and Tajima_D metrics, computed in 10 kb non-overlapping windows, for the CEU population (Figure 4).
Figure 4. Procedure to select certain tracks (in this case, Pi, theta and hap_diversity_within ).
Figure 5 represents the target region (with the location of the gene TRPV6 highlighted in yellow) with 6 activated tracks: Pi, theta, Tajima D, K, hap_diversity_within, and Kelly_ZnS. This region shows a clear decrease in nucleotide diversity (Pi and theta, in blue), and a skew towards rare derived alleles (negative Tajima_D, in red). Furthermore, divergence (K, in brown) is increased in the specific region of the TRPV6 gene, as well as r² (Kelly_ZnS, in pink), while haplotype diversity (hap_diversity_within, in blue) is relatively low, specially compared to African populations.
Figure 5. Region of interest with the location of the gene TRPV6 highlighted, with the tracks Pi, theta, Tajima D, K, hap_diversity_within, and Kelly_ZnS activated for the CEU population (10 kb windows size).
To test this hypothesis, PopHuman contains several statistics derived from the McDonald and Kreitman test (MKT) (McDonald and Kreitman 1991), including NI, α, and DoS, computed in sliding windows. Table 1 summarizes these statistics:
Table 1. Basic tests of neutrality for protein coding sequences in PopHuman.
|NI||Neutrality index, which summarizes the four values in a MKT (McDonald and Kreitman 1991) table as a ratio of ratios (Rand and Kann 1996).||NI = (Pi0f/Pi4f) / (K0f/K4f)||=1 : neutral evolution
<1 : positive selection
>1 : negative selection
|α (alpha, alpha_cor)||Proportion of substitutions that are adaptive based on the MK test (McDonald and Kreitman 1991). α represents the proportion of adaptive evolution (Charlesworth 1994; Smith and Eyre-Walker 2002).||α = 1 - ((P0f/P4f) / (D0f/D4f))||=0 : neutral evolution
<0 : negative selection
>0 : positive selection
|FisherPval(Fisher1, Fisher2)||Fisher's p-value for the MKT (McDonald and Kreitman 1991) 2x2 contingency table, which contains Dneu (divergence in 4-fold coding sites), Dsel (divergence in 0-fold coding sites), Pneu (polymorphism in 4-fold coding sites) and Psel (polymorphism in 0-fold coding sites)||>0.05 : neutral evolution
<0.05 : negative or positive evolution
|DoS||Direction of Selection: difference between the proportion of nonsynonymous divergence and nonsynonymous polymorphism (Stoletzi and Eyre-Walker 2011).||DoS = (K0f/(K0f+K4f)) - (Pi0f/(Pi0f+Pi4f))||=0 : neutral evolution
<0 : negative selection
>0 : positive selection
Figure 6 below shows the target region with the estimates of NI, α (alpha_cor) , Fisher’s P-value (Fisher2) and DoS, estimated in 10 kb non-overlapping windows by considering protein coding sites only. Two estimates predict that recurrent adaptive evolution has occurred in the region of the TRPV6 gene: NI = 0.46 (lower than 1); DoS = 0.18 (greater than 0). However, α = -0.25 (negative), and the Fisher Exact Test is not significant. In this case, the signal of positive selection is largely disguised by the stronger signal of negative selection maintaining the sequence and function of the gene.
Figure 6. Region of interest with NI , Mkalpha (alpha_cor) , FisherPval(Fisher2) and DoS tracks selected, CEU population, 10 kb windows size.
We can expand this information by browsing the results of the Integrative MKT (see Help → Integrative MKT) specifically for gene TRPV6.
Figure 7A shows a section of the Integrative MKT report for gene TRPV6, which contains valuable statistics. Note first the value of the ratio Ka/Ks (Li et al. 1985; Nei and Gojobori 1986), Ka/Ks = 0.48. While being still lower than 1 (indicative of negative selection), this value is quite high compared to the ones found in most protein coding sequences. As before, the signal of positive selection is largely disguised by the stronger signal of negative selection maintaining the sequence and function of the gene.
Below the first table there is the Derived Allele Frequency (DAF) plot (Figure 7B). It shows that, while the African YRI population contains alleles at medium frequencies, the frequency of the alleles in both the European CEU and the Asian CHB populations is skewed towards either rare or high frequency alleles, specially in the Asian population. This is in accordance with the negative Tajima’s D, Fu and Li’s D and F, and Fay and Wu’s H values we saw before, and they represent high frequency alleles that did not reach fixation after recombination broke the linkage with the selected variant during the selective sweep, and rare alleles that entered the population by mutation after the sweep.
Below this plot different versions of the MKT (see Help → Integrative MKT) are shown for different functional regions of the gene (0-fold, 5’UTR, 3’UTR, intron, and intergenic flanking regions, all against 4-fold protein coding sites). Results show the action of both positive and negative selection acting in this region.
Figure 7. (A) Section of the Integrative MKT report for the TRPV6 gene in CEU, CHB and YRI populations. (B) Derived Allele Frequency (DAF) plot for the TRPV6 gene in CEU, CHB and YRI populations.
In order to examine genetic variation at specific sites of interest, you might need to download raw VCF files with the subset of variants located at the region of interest. PopHuman incorporates a utility specifically designed for easy and fast downloading of such information. It can be accessed from the main browser interface by clicking the corresponding button , or from the “Resources” menu.
Once in the utility, the user is required to provide the genomic coordinates (chr, start, and end) of the target region (the genomic coordinates of the region which is currently being displayed are taken by default), as well as the population(s) of interest (Figure 8). Then, by clicking the “Download” button, a zip file containing the requested VCF will be downloaded. Then, this VCF file can be examined using a proper tool (e.g. VCFtools or BCFtools).
Figure 8. Download sequences utility: chr7:142363887..143174636 genomic region and CEU population selected.
To download track data from the browser interface, place the mouse over the track title (on the left), press the arrow which appears to the right of the track label, and select “Save track data”. In the dialog, define the region of interest, the output format (bedGraph, Wiggle or GFF3), and the file name, and press the “Save” button (Figure 9).
Figure 9. Downloading track data.
To upload you own tracks, navigate to Track → Open track file or URL from the navigation menu located in the top-left area of the browser interface (Figure 10). Next, browse your local file or paste a remote URL in the corresponding box and customize the file type and display options. Finally, select whether you want to open immediately your custom track or rather you prefer to add it to the tracks set, and press the “Open” button.
Figure 10. Uploading custom tracks.
PopHuman allows saving any particular browser instance to be shared or visualized later on. This can be done through the “Share” button located in the top-right corner of the browser interface (Figure 11). When clicking it, the user can copy the corresponding URL, which will render the same exact PopHuman instance that is being displayed now.
Figure 11. Saving the current browser instance.
|PopHuman share URL with all activated tracks used in this guide: link
- Charlesworth, B. (1994) The effect of background selection against deleterious mutations on weakly selected, linked variants. Genet. Res. 63:213-27.
- Hughes, David A., Kun Tang, Rainer Strotmann, Torsten Schöneberg, Jean Prenen, Bernd Nilius, and Mark Stoneking. "Parallel Selection on TRPV6 in Human Populations." PLoS ONE 3.2 (2008): 1-13. Web.
- Li, W. H., et al., (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2:150-74.
- McDonald, J. H. and Kreitman, M. (1991) Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652-654.
- Nei, M. and Gojobori T. (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418-426.
- Rand, D. M., and Kann, L. M. (1996) Excess amino acid polymorphism in mitochondrial DNA: contrasts among genes from Drosophila, mice, and humans. Mol. Biol. Evol. 13:735-748.
- Smith, N. G. and Eyre-Walker, A. (2002) Adaptive protein evolution in Drosophila. Nature 415:1022-4.
- Stoletzki, N. and Eyre-Walker, A. (2011) Estimation of the Neutrality Index. Mol. Biol. Evol. 28:63-70.