{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Tutorial: The UseR Conference 2016\n", " \n", "\n", "### Genome-Wide Association Analysis and Post-Analytic Interrogation with R \n", "\n", " \n", "### Part 1\n", " \n", "\n", "#### Instructor: Professor Andrea S. Foulkes \n", "\n", "Department of Mathematics and Statistics, Mount Holyoke College" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Acknowledgements: \n", "\n", " Special thanks to Sara Nunez , Reserach Scientist, Mount Holyoke College, for work on the development of code snippets.\n", "\n", "Support for this tutorial was provided by:\n", "\n", "NIH/NHLBI R01 107196 (PI: Foulkes)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### A Jupyter notebook \n", "for this tutorial is available at:\n", "\n", " \n", "\n", " \n", "\n", "http://nbviewer.jupyter.org/urls/www.mtholyoke.edu/courses/afoulkes/UseR2016/UseR2016-Part1.ipynb" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Outline\n", " \n", "#### Part 1: Core analytic components of a standard GWAS\n", "\n", "1. Data preprocessing \n", "2. Data generation \n", "3. Association analysis\n", "4. Post-analytic visualization\n", "\n", " \n", "#### Part 2: Interrogation of regional and higher-order associations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Part 1: Core analytic components of a standard GWAS." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### I. Data preprocessing\n", " \n", "#### Step 1: Creation of R objects" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data files: \n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Download and unzip data needed for this tutorial:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "library(downloader)\n", "download(\"https://www.mtholyoke.edu/courses/afoulkes/UseR2016_Tutorial_Files.zip\",\n", " \"UseR2016_Tutorial_Files.zip\")\n", "unzip(\"UseR2016_Tutorial_Files.zip\", exdir = \"/Users/afoulkes/UseR2016/\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Reading in SNP data:\n", "\n", "- Use the read.plink() function in the Bioconductor snpStats package to read the \\textit{.fam}, \\textit{.bim} and \\textit{.bed} files into R." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: survival\n", "Loading required package: Matrix\n" ] }, { "data": { "text/html": [ "$names = \n", "\t 1. 'genotypes' 2. \n", "\t 3. 'fam' 4. \n", "\t 5. 'map' 6. \n", " \n" ], "text/latex": [ "\\textbf{\\$names} = \\begin{enumerate*}\n", "\\item 'genotypes'\n", "\\item 'fam'\n", "\\item 'map'\n", "\\end{enumerate*}\n" ], "text/markdown": [ "**$names** = 1. 'genotypes'\n", "2. 'fam'\n", "3. 'map'\n", "\n", "\n" ], "text/plain": [ "$names\n", "[1] \"genotypes\" \"fam\" \"map\" \n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "library(snpStats)\n", "setwd(\"/Users/afoulkes/UseR2016/\")\n", "geno <- read.plink(\"UseR2016_Tutorial_Files/GWAStutorial\", \n", " na.strings = (\"-9\")) \n", "attributes(geno) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Manipulating data:\n", "\n", "- Obtain the genotype of the resulting list. This is a SnpMatrix object, a matrix of genotype data with a column for each SNP and a row for each study participant. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A SnpMatrix with 1401 rows and 861473 columns\n", "Row names: 10002 ... 11596 \n", "Col names: rs10458597 ... rs5970564 \n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
rs10458597rs12565286rs12082473rs3094315rs2286139rs11240776rs2980319
10002 2 2 2 2 2NA 2
100042021222
100052222222
\n" ], "text/latex": [ "\\begin{tabular}{r|lllllll}\n", " & rs10458597 & rs12565286 & rs12082473 & rs3094315 & rs2286139 & rs11240776 & rs2980319\\\\\n", "\\hline\n", "\t10002 & 2 & 2 & 2 & 2 & 2 & NA & 2\\\\\n", "\t10004 & 2 & 0 & 2 & 1 & 2 & 2 & 2\\\\\n", "\t10005 & 2 & 2 & 2 & 2 & 2 & 2 & 2\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "1. 2\n", "2. 2\n", "3. 2\n", "4. 2\n", "5. 0\n", "6. 2\n", "7. 2\n", "8. 2\n", "9. 2\n", "10. 2\n", "11. 1\n", "12. 2\n", "13. 2\n", "14. 2\n", "15. 2\n", "16. NA\n", "17. 2\n", "18. 2\n", "19. 2\n", "20. 2\n", "21. 2\n", "\n", "\n" ], "text/plain": [ " rs10458597 rs12565286 rs12082473 rs3094315 rs2286139 rs11240776 rs2980319\n", "10002 2 2 2 2 2 NA 2\n", "10004 2 0 2 1 2 2 2\n", "10005 2 2 2 2 2 2 2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "genotype <- geno$genotypes\n", "print(genotype) \n", "genotypeNumeric <- as(genotype,\"numeric\")\n", "genotypeNumeric[1:3,1:7] " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ " \n", "\t 1. 1401 2. \n", "\t 3. 861473 4. \n", " \n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 1401\n", "\\item 861473\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 1401\n", "2. 861473\n", "\n", "\n" ], "text/plain": [ "[1] 1401 861473" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dim(genotype)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Manipulating data:\n", "\n", "- Obtain the SNP information table:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " chr SNP gen.dist position A1 A2\n", "rs10458597 1 rs10458597 0 564621 C\n", "rs12565286 1 rs12565286 0 721290 G C\n", "rs12082473 1 rs12082473 0 740857 T C\n", "rs3094315 1 rs3094315 0 752566 C T\n", "rs2286139 1 rs2286139 0 761732 C T\n", "rs11240776 1 rs11240776 0 765269 G A\n" ] } ], "source": [ "genoBim <- geno$map\n", "colnames(genoBim) <- c(\"chr\", \"SNP\", \"gen.dist\", \"position\", \"A1\", \"A2\")\n", "print(head(genoBim))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Clinical data components:\n", "\n", "- ID\n", "- coronary artery disease (CAD): 0 = control and 1 = case; \n", "- sex: 1 = male, 2 = female;\n", "- age (years); \n", "- triglyceride level (TG - mg/dL) \n", "- high-density lipoprotein cholesterol (HDL-C - mg/dL) \n", "- low-density lipoprotein cholesterol (LDL-C - mg/dL)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Reading in clinical data:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " FamID CAD sex age tg hdl ldl\n", "10002 10002 1 1 60 NA NA NA\n", "10004 10004 1 2 50 55 23 75\n", "10005 10005 1 1 55 105 37 69\n", "10007 10007 1 1 52 314 54 108\n", "10008 10008 1 1 58 161 40 94\n", "10009 10009 1 1 59 171 46 92\n" ] } ], "source": [ "clinical <- read.csv(\"UseR2016_Tutorial_Files/GWAStutorial_clinical.csv\",\n", " colClasses=c(\"character\", \"factor\", \"factor\", rep(\"numeric\", 4)))\n", "rownames(clinical) <- clinical$FamID\n", "print(head(clinical)) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### I. Data preprocessing\n", " \n", "#### Step 1: Creation of R objects$\\checkmark$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "#### Step 2: SNP level filtering (part 1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Step 2: SNP level filtering (part 1)\n", "\n", "Once the data is loaded, we are ready to remove SNPs that fail to meet minimum criteria due to missing data, low variability or genotyping errors. \n", "\n", "We address these by first filtering on call rate and minor allele frequency (MAF)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### SNP level characteristics:\n", "\n", "snpStats provides functions, col.summary() and row.summary(), that return statistics on SNPs and samples, respectively." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", " CallsCall.rateCertain.callsRAFMAFP.AAP.ABP.BBz.HWE rs104585971398.0000000 0.9978587 1.0000000 1.0000000 0.0000000 0.0000000 0.0000000 1.0000000 NA rs125652861384.00000000 0.98786581 1.00000000 0.94833815 0.05166185 0.00433526 0.09465318 0.90101156 -1.26529432 rs120824731.369000e+039.771592e-011.000000e+009.985391e-011.460920e-030.000000e+002.921841e-039.970782e-015.413314e-02 rs30943151386.00000000 0.98929336 1.00000000 0.82178932 0.17821068 0.04761905 0.26118326 0.69119769 -4.03172248 rs22861391364.00000000 0.97359029 1.00000000 0.86217009 0.13782991 0.02199413 0.23167155 0.74633431 -0.93146122 rs112407761.269000e+039.057816e-011.000000e+009.988180e-011.182033e-030.000000e+002.364066e-039.976359e-014.215743e-02 \n" ], "text/latex": [ "\\begin{tabular}{r|lllllllll}\n", " & Calls & Call.rate & Certain.calls & RAF & MAF & P.AA & P.AB & P.BB & z.HWE\\\\\n", "\\hline\n", "\trs10458597 & 1398.0000000 & 0.9978587 & 1.0000000 & 1.0000000 & 0.0000000 & 0.0000000 & 0.0000000 & 1.0000000 & NA\\\\\n", "\trs12565286 & 1384.00000000 & 0.98786581 & 1.00000000 & 0.94833815 & 0.05166185 & 0.00433526 & 0.09465318 & 0.90101156 & -1.26529432\\\\\n", "\trs12082473 & 1.369000e+03 & 9.771592e-01 & 1.000000e+00 & 9.985391e-01 & 1.460920e-03 & 0.000000e+00 & 2.921841e-03 & 9.970782e-01 & 5.413314e-02\\\\\n", "\trs3094315 & 1386.00000000 & 0.98929336 & 1.00000000 & 0.82178932 & 0.17821068 & 0.04761905 & 0.26118326 & 0.69119769 & -4.03172248\\\\\n", "\trs2286139 & 1364.00000000 & 0.97359029 & 1.00000000 & 0.86217009 & 0.13782991 & 0.02199413 & 0.23167155 & 0.74633431 & -0.93146122\\\\\n", "\trs11240776 & 1.269000e+03 & 9.057816e-01 & 1.000000e+00 & 9.988180e-01 & 1.182033e-03 & 0.000000e+00 & 2.364066e-03 & 9.976359e-01 & 4.215743e-02\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " Calls Call.rate Certain.calls RAF MAF P.AA\n", "rs10458597 1398 0.9978587 1 1.0000000 0.000000000 0.00000000\n", "rs12565286 1384 0.9878658 1 0.9483382 0.051661850 0.00433526\n", "rs12082473 1369 0.9771592 1 0.9985391 0.001460920 0.00000000\n", "rs3094315 1386 0.9892934 1 0.8217893 0.178210678 0.04761905\n", "rs2286139 1364 0.9735903 1 0.8621701 0.137829912 0.02199413\n", "rs11240776 1269 0.9057816 1 0.9988180 0.001182033 0.00000000\n", " P.AB P.BB z.HWE\n", "rs10458597 0.000000000 1.0000000 NA\n", "rs12565286 0.094653179 0.9010116 -1.26529432\n", "rs12082473 0.002921841 0.9970782 0.05413314\n", "rs3094315 0.261183261 0.6911977 -4.03172248\n", "rs2286139 0.231671554 0.7463343 -0.93146122\n", "rs11240776 0.002364066 0.9976359 0.04215743" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "snpsum.col <- col.summary(genotype)\n", "head(snpsum.col)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Call rate and minor allele frequency (MAF) filtering:\n", "\n", "Using these summary statistics, we keep the subset of SNPs that meet our criteria for minimum call rate and MAF." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A SnpMatrix with 1401 rows and 658186 columns\n", "Row names: 10002 ... 11596 \n", "Col names: rs12565286 ... rs5970564 \n" ] } ], "source": [ "# Setting thresholds\n", "call <- 0.95\n", "minor <- 0.01\n", "\n", "# Filter on MAF and call rate\n", "use <- with(snpsum.col, (!is.na(MAF) & MAF > minor) & Call.rate >= call)\n", "use[is.na(use)] <- FALSE # Remove NA's as well\n", " \n", "# Subset genotype and SNP summary data for SNPs that pass call rate and MAF criteria\n", "genotype <- genotype[,use]\n", "snpsum.col <- snpsum.col[use,]\n", "\n", "print(genotype) # started with 861,473 SNPs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### I. Data preprocessing\n", " \n", "#### Step 1: Creation of R objects$\\checkmark$\n", "#### Step 2: SNP level filtering (part 1)$\\checkmark$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "#### Step 3: Sample level filtering " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Step 3: Sample level filtering\n", "\n", "Next we remove individuals due to missing data, sample contamination, correlation (for population-based investigations), and racial/ethnic or gender ambiguity or discordance. \n", "\n", "We address these issues by filtering on call rate, heterozygosity, cryptic relatedness and duplicates, and ancestry." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Calculate SNP level call rate and heterozygosity:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", " Call.rateCertain.callsHeterozygosity 100020.98265541.00000000.3289825 100040.98915811.00000000.3242931 100050.99184271.00000000.3231825 100070.98610271.00000000.3241469 100080.98233331.00000000.3228218 100090.99130341.00000000.3213658 \n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " & Call.rate & Certain.calls & Heterozygosity\\\\\n", "\\hline\n", "\t10002 & 0.9826554 & 1.0000000 & 0.3289825\\\\\n", "\t10004 & 0.9891581 & 1.0000000 & 0.3242931\\\\\n", "\t10005 & 0.9918427 & 1.0000000 & 0.3231825\\\\\n", "\t10007 & 0.9861027 & 1.0000000 & 0.3241469\\\\\n", "\t10008 & 0.9823333 & 1.0000000 & 0.3228218\\\\\n", "\t10009 & 0.9913034 & 1.0000000 & 0.3213658\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " Call.rate Certain.calls Heterozygosity\n", "10002 0.9826554 1 0.3289825\n", "10004 0.9891581 1 0.3242931\n", "10005 0.9918427 1 0.3231825\n", "10007 0.9861027 1 0.3241469\n", "10008 0.9823333 1 0.3228218\n", "10009 0.9913034 1 0.3213658" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "snpsum.row <- row.summary(genotype)\n", "head(snpsum.row)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Calculating the F statistic (inbreeding coefficient):\n", "\n", "\\begin{equation*}\n", "|F|=(1−O/E)\n", "\\end{equation*}\n", "\n", "where O and E are the observed and expected proportions of heterozygous genotypes for a given sample based on the MAFs." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "MAF <- snpsum.col$MAF\n", "callmatrix <- !is.na(genotype)\n", "hetExp <- callmatrix %*% (2*MAF*(1-MAF))\n", "hetObs <- with(snpsum.row, Heterozygosity*(ncol(genotype))*Call.rate)\n", "snpsum.row$hetF <- 1-(hetObs/hetExp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Call rate and heterozygosity filtering:\n", "\n", "Using these summary statistics, we keep the subset of individuals (samples) that meet our criteria for call rate and heterozygosity." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 subjects will be removed due to low sample call rate or inbreeding coefficient.\n" ] } ], "source": [ "# Setting thresholds\n", "sampcall <- 0.95 # Sample call rate cut-off\n", "hetcutoff <- 0.1 # Inbreeding coefficient cut-off\n", "\n", "# Filter on MAF and heterozygosity\n", "sampleuse <- with(snpsum.row, !is.na(Call.rate) & Call.rate > sampcall & abs(hetF) <= hetcutoff)\n", "sampleuse[is.na(sampleuse)] <- FALSE # remove NA's as well\n", "cat(nrow(genotype)-sum(sampleuse), \"subjects will be removed due to low sample call rate or inbreeding coefficient.\\n\") \n", "\n", "# Subset genotype and clinical data for subjects who pass call rate and heterozygosity crtieria\n", "genotype <- genotype[sampleuse,]\n", "clinical<- clinical[ rownames(genotype), ]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### IBD analysis for relatedness and duplicates:\n", "\n", "- We use the snpgdsIBDMoM(), snpgdsIBDSelection() and ibdcoeff() functions in the SNPRelate package to perform identity-by-descent (IBD) analysis. This package requires that the data be transformed into a GDS format file. \n", "- Methods of moments IBD analysis is performed on only a subset of SNPs that are in linkage equilibrium by iteratively removing adjacent SNPs that exceed an LD threshold in a sliding window using the snpgdsLDpruning() function also in the SNPRelate package." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### IBD analysis for relatedness and duplicates:\n", "\n", "- More information on implementing this approach is provided on the stat-gen.org site:\n", "\n", " http://www.stat-gen.org/tut/tut_preproc.html\n", "\n", "- Relatedness filtering was performed previously and therefore none of the$1401$individuals in the available data are expected to be related or duplicated." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### PCA analysis for ancestry filtering:\n", "\n", "- Recall, population substructure refers to a population consisting of subgroups within which there is random mating but between which there is little mixing or gene flow (population stratification) or to a population in which mating occurs between subgroups with different allelic distributions (admixture). \n", "\n", "- We use the snpgdsPCA() function in the SNPRelate package to perform principal component analysis (PCA) and then plot the first two principal components to identify cluster and outliers." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### PCA analysis for ancestry filtering:\n", "\n", "- More information on implementing this approach is provided on the stat-gen.org site:\n", "\n", " http://www.stat-gen.org/tut/tut_preproc.html\n", "\n", "- Our samples are reasonably homogeneous from European ancestry, and therefore none of the$1401$individuals are filtered for ancestry." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### I. Data preprocessing\n", " \n", "#### Step 1: Creation of R objects$\\checkmark$\n", "#### Step 2: SNP level filtering (part 1)$\\checkmark$\n", "#### Step 3: Sample level filtering$\\checkmark$" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "#### Step 4: SNP level filtering (part 2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Step 4: SNP level filtering (part 2)\n", "\n", "Final SNP level filtering is based on Hardy-Weinberg equilibrium" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Hardy-Weinberg equilibrium (HWE): \n", "- HWE denotes independence of alleles at a single site between two homologous chromosomes.\n", "- Recall *Linkage Disequilibrium (LD)* is allelic association across sites on a single homolog.\n", "\n", "- *Example*: \n", "\n", " Consider a bi-allelic SNP with genotypes AA, Aa and aa: \n", " \n", " HWE implies$p_{AA}=p_A^2$,$p_{aa}=p_a^2$and$p_{Aa}=p_Ap_a$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Violations of Hardy-Weinberg equilibrium (HWE) \n", "\n", "Can indicate:\n", "- Presence of population substructure (non-random mating)\n", "- Genotyping errors\n", "\n", "We test for HWE at the SNP level within CAD controls, and remove SNPs with corresponding$p<1 \\times 10^{−6}$." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1296 SNPs will be removed due to high HWE.\n", "A SnpMatrix with 1401 rows and 656890 columns\n", "Row names: 10002 ... 11596 \n", "Col names: rs12565286 ... rs28729663 \n" ] } ], "source": [ "# Setting threshold\n", "hardy <- 10^-6 \n", "\n", "# Filter on HWE\n", "CADcontrols <- clinical[ clinical$CAD==0, 'FamID' ]\n", "snpsum.colCont <- col.summary( genotype[CADcontrols,] )\n", "HWEuse <- with(snpsum.colCont, !is.na(z.HWE) & ( abs(z.HWE) < abs( qnorm(hardy/2) ) ) )\n", "rm(snpsum.colCont)\n", "\n", "HWEuse[is.na(HWEuse)] <- FALSE # Remove NA's as well\n", "cat(ncol(genotype)-sum(HWEuse),\"SNPs will be removed due to high HWE.\\n\") \n", "\n", "# Subset genotype and SNP summary data for SNPs that pass HWE criteria\n", "genotype <- genotype[,HWEuse]\n", "print(genotype) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### I. Data preprocessing\n", " \n", "#### Step 1: Creation of R objects $\\checkmark$\n", "#### Step 2: SNP level filtering (part 1) $\\checkmark$\n", "#### Step 3: Sample level filtering $\\checkmark$\n", "#### Step 4: SNP level filtering (part 2) $\\checkmark$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### II. Data generation\n", " \n", "#### Step 5: Principal component analysis" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Principal component analysis:\n", "\n", "Principal components are re-calculted to be included as covariates in the GWA models. \n", "\n", "These serve to adjust for any remaining substructure that may act as confounders in the SNP-trait association analysis. \n", "\n", "As with Ancestry filtering we calculate PCs using the snpgdsPCA() function in the SNPRelate package, after performing LD pruning once again on the filtered genotype data set. Further details are found on the stat-gen.org site:\n", "\n", "http://www.stat-gen.org/tut/tut_generation.html\n", "\n", "The first 10 principal components are retained for the GWA models." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Principal components read in from data file:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
FamIDpc1pc2pc3pc4pc5pc6pc7pc8pc9pc10
1 1.000200e+04 7.764870e-03 1.448038e-02-6.315881e-04 2.866464e-03-1.883914e-02 9.680646e-03 2.764681e-02-6.645818e-03-2.342975e-02 1.049231e-02
2 1.000400e+04-1.204511e-02-7.231015e-03-3.001290e-03-1.079727e-02-7.770540e-03-4.645751e-03 1.806108e-03-3.087891e-03-1.833242e-03-4.538746e-03
3 1.000500e+04-1.670293e-02-5.347697e-03 1.444984e-02-6.151058e-04 3.451702e-02 3.870855e-02 2.057908e-02-1.226551e-02 3.592690e-03-2.287043e-03
4 1.000700e+04-9.537235e-03 4.556977e-03 2.683566e-03 1.662557e-02-2.363142e-04 5.514627e-03 1.595889e-02 2.797546e-02 2.977718e-02-7.461255e-03
5 1.000800e+04-1.539211e-02-2.446933e-03 2.050879e-02-5.724177e-03-3.969623e-03 5.354244e-03-7.269312e-04 2.701471e-02 1.067216e-02-3.352997e-03
6 1.000900e+04-1.512386e-02-2.353917e-03 2.136045e-02 6.915653e-03 4.006776e-02 2.322248e-02 1.524852e-02 1.329685e-02 2.274635e-02 1.314389e-02
\n" ], "text/latex": [ "\\begin{tabular}{r|lllllllllll}\n", " & FamID & pc1 & pc2 & pc3 & pc4 & pc5 & pc6 & pc7 & pc8 & pc9 & pc10\\\\\n", "\\hline\n", "\t1 & 1.000200e+04 & 7.764870e-03 & 1.448038e-02 & -6.315881e-04 & 2.866464e-03 & -1.883914e-02 & 9.680646e-03 & 2.764681e-02 & -6.645818e-03 & -2.342975e-02 & 1.049231e-02\\\\\n", "\t2 & 1.000400e+04 & -1.204511e-02 & -7.231015e-03 & -3.001290e-03 & -1.079727e-02 & -7.770540e-03 & -4.645751e-03 & 1.806108e-03 & -3.087891e-03 & -1.833242e-03 & -4.538746e-03\\\\\n", "\t3 & 1.000500e+04 & -1.670293e-02 & -5.347697e-03 & 1.444984e-02 & -6.151058e-04 & 3.451702e-02 & 3.870855e-02 & 2.057908e-02 & -1.226551e-02 & 3.592690e-03 & -2.287043e-03\\\\\n", "\t4 & 1.000700e+04 & -9.537235e-03 & 4.556977e-03 & 2.683566e-03 & 1.662557e-02 & -2.363142e-04 & 5.514627e-03 & 1.595889e-02 & 2.797546e-02 & 2.977718e-02 & -7.461255e-03\\\\\n", "\t5 & 1.000800e+04 & -1.539211e-02 & -2.446933e-03 & 2.050879e-02 & -5.724177e-03 & -3.969623e-03 & 5.354244e-03 & -7.269312e-04 & 2.701471e-02 & 1.067216e-02 & -3.352997e-03\\\\\n", "\t6 & 1.000900e+04 & -1.512386e-02 & -2.353917e-03 & 2.136045e-02 & 6.915653e-03 & 4.006776e-02 & 2.322248e-02 & 1.524852e-02 & 1.329685e-02 & 2.274635e-02 & 1.314389e-02\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " FamID pc1 pc2 pc3 pc4 pc5\n", "1 10002 0.007764870 0.014480384 -0.0006315881 0.0028664643 -0.0188391406\n", "2 10004 -0.012045108 -0.007231015 -0.0030012896 -0.0107972693 -0.0077705400\n", "3 10005 -0.016702930 -0.005347697 0.0144498361 -0.0006151058 0.0345170160\n", "4 10007 -0.009537235 0.004556977 0.0026835662 0.0166255657 -0.0002363142\n", "5 10008 -0.015392106 -0.002446933 0.0205087909 -0.0057241772 -0.0039696226\n", "6 10009 -0.015123858 -0.002353917 0.0213604518 0.0069156529 0.0400677558\n", " pc6 pc7 pc8 pc9 pc10\n", "1 0.009680646 0.0276468057 -0.006645818 -0.023429747 0.010492314\n", "2 -0.004645751 0.0018061075 -0.003087891 -0.001833242 -0.004538746\n", "3 0.038708551 0.0205790788 -0.012265508 0.003592690 -0.002287043\n", "4 0.005514627 0.0159588869 0.027975455 0.029777180 -0.007461255\n", "5 0.005354244 -0.0007269312 0.027014714 0.010672162 -0.003352997\n", "6 0.023222478 0.0152485234 0.013296852 0.022746352 0.013143889" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pca10 <- read.csv(\"UseR2016_Tutorial_Files/PCA_out.csv\")\n", "head(pca10)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### II. Data generation\n", " \n", "#### Step 5: Principal component analysis $\\checkmark$\n", "\n", "#### Step 6: SNP Imputation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Step 6: SNP Imputation\n", "\n", "SNP imputation is used to extend analysis to non-typed SNPs, rare variants and SNPs that were removed during filtering. \n", "\n", "Genotype imputation requires reference data from homogeneous sample. Sources for these data include HapMap and 1000 Genomes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Imputation of SNPs:\n", "\n", "The snp.imputation() function in the snpStats package is used to derive 'rules' representing a predictive model for genotypes of untyped SNPs that is based on near-by typed SNPs. The impute.snps() function, also in the snpStats package, is then applied to estimate expected posterior values of the non-typed SNPs. \n", "\n", "An implementation example is provided on the stat-gen.org site:\n", "\n", "http://www.stat-gen.org/tut/tut_generation.html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### II. Data generation\n", " \n", "#### Step 5: Principal component analysis $\\checkmark$\n", "#### Step 6: SNP imputation $\\checkmark$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### III. GWAS analysis\n", " \n", "#### Step 7: Association analysis for typed SNPs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Step 7: Association analysis for typed SNPs\n", "\n", "The Genome-Wide Association Analysis (GWAA) function in R:\n", "\n", "- Parallel implementation of generalized linear model fitting on each SNP using the doParallel package.\n", "- Returns \"SNP\", \"Estimate\", \"Std.Error\", \"t-value\" and \"p-value\".\n", "- An additive model of association (trend test) is evaluated; data are manipulated to always count the number of minor alleles.\n", "- Gaussian (linear) or binomial (logistic) families (among others) can be selected." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The GWAA() function is available in the downloaded tutorial folder and can be read into R as follows:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "source(\"UseR2016_Tutorial_Files/GWAA.R\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Additional details on specifying the parameters of the GWAA() function are provided at: http://www.stat-gen.org/tut/tut_analysis.html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The GWAA() function writes output to a .txt file. An example of this output is in the GWAStutorialout.txt file:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
SNPEstimateStd.Errort.valuep.value
1rs12565286 -0.11006048440612 0.0811307149802178-1.35658220728063 0.1751533944922
2rs3094315 -0.08774214467582560.0458130124264677 -1.91522320905346 0.0556863412136138
3rs2286139 -0.1253968023206270.0533236836341205-2.35161552568339 0.0188456227223414
4rs2980319 -0.0973866846369160.0531366340089406-1.83275976081831 0.0670677893265018
5rs2980300 -0.08068613306415050.0521019732939761 -1.54861952365783 0.121720599080485
6rs11240777 0.003296683641471230.0438024109222104 0.0752626070589099 0.940017405693539
\n" ], "text/latex": [ "\\begin{tabular}{r|lllll}\n", " & SNP & Estimate & Std.Error & t.value & p.value\\\\\n", "\\hline\n", "\t1 & rs12565286 & -0.11006048440612 & 0.0811307149802178 & -1.35658220728063 & 0.1751533944922 \\\\\n", "\t2 & rs3094315 & -0.0877421446758256 & 0.0458130124264677 & -1.91522320905346 & 0.0556863412136138 \\\\\n", "\t3 & rs2286139 & -0.125396802320627 & 0.0533236836341205 & -2.35161552568339 & 0.0188456227223414\\\\\n", "\t4 & rs2980319 & -0.097386684636916 & 0.0531366340089406 & -1.83275976081831 & 0.0670677893265018\\\\\n", "\t5 & rs2980300 & -0.0806861330641505 & 0.0521019732939761 & -1.54861952365783 & 0.121720599080485 \\\\\n", "\t6 & rs11240777 & 0.00329668364147123 & 0.0438024109222104 & 0.0752626070589099 & 0.940017405693539 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ " SNP Estimate Std.Error t.value p.value\n", "1 rs12565286 -0.110060484 0.08113071 -1.35658221 0.17515339\n", "2 rs3094315 -0.087742145 0.04581301 -1.91522321 0.05568634\n", "3 rs2286139 -0.125396802 0.05332368 -2.35161553 0.01884562\n", "4 rs2980319 -0.097386685 0.05313663 -1.83275976 0.06706779\n", "5 rs2980300 -0.080686133 0.05210197 -1.54861952 0.12172060\n", "6 rs11240777 0.003296684 0.04380241 0.07526261 0.94001741" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "GWASout <- read.table(\"UseR2016_Tutorial_Files/GWAStutorialout.txt\", \n", " header=TRUE, colClasses=c(\"character\", rep(\"numeric\",4)))\n", "head(GWASout)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### III. GWAS analysis\n", " \n", "#### Step 7: Association analysis for typed SNPs $\\checkmark$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### IV. Post-analytic interrogation\n", " \n", "#### Step 10: Visualization" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The results of a GWA analysis are typically visualized using a Manhattan plot:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
SNPEstimateStd.Errort.valuep.valueNeg_logPchrposition
1rs10000012 -0.02085290094662520.0544108133922203 -0.383249204460648 0.701597949161597 0.153911689124775 4 1357325
2rs1000002 0.03665472523381010.03523749467771281.04021939042658 0.298433967867149 0.52515174682373 3 183635768
3rs1000003 -0.02929796356718280.0509472366476217 -0.57506482186312 0.565347973331718 0.247684160357594 3 98342907
4rs10000033 0.02337480415042060.03552719384432460.657941188736883 0.510693039327769 0.291840061475068 4 139599898
5rs10000036 0.02126911707137570.03757186210596470.56609164090377 0.571434580103119 0.243033482025455 4 139219262
6rs10000037 -0.02557369319108390.0442322654635925 -0.578168287856148 0.563251030147267 0.249298005308766 4 38924330
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllll}\n", " & SNP & Estimate & Std.Error & t.value & p.value & Neg\\_logP & chr & position\\\\\n", "\\hline\n", "\t1 & rs10000012 & -0.0208529009466252 & 0.0544108133922203 & -0.383249204460648 & 0.701597949161597 & 0.153911689124775 & 4 & 1357325 \\\\\n", "\t2 & rs1000002 & 0.0366547252338101 & 0.0352374946777128 & 1.04021939042658 & 0.298433967867149 & 0.52515174682373 & 3 & 183635768 \\\\\n", "\t3 & rs1000003 & -0.0292979635671828 & 0.0509472366476217 & -0.57506482186312 & 0.565347973331718 & 0.247684160357594 & 3 & 98342907 \\\\\n", "\t4 & rs10000033 & 0.0233748041504206 & 0.0355271938443246 & 0.657941188736883 & 0.510693039327769 & 0.291840061475068 & 4 & 139599898 \\\\\n", "\t5 & rs10000036 & 0.0212691170713757 & 0.0375718621059647 & 0.56609164090377 & 0.571434580103119 & 0.243033482025455 & 4 & 139219262 \\\\\n", "\t6 & rs10000037 & -0.0255736931910839 & 0.0442322654635925 & -0.578168287856148 & 0.563251030147267 & 0.249298005308766 & 4 & 38924330 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ " SNP Estimate Std.Error t.value p.value Neg_logP chr\n", "1 rs10000012 -0.02085290 0.05441081 -0.3832492 0.7015979 0.1539117 4\n", "2 rs1000002 0.03665473 0.03523749 1.0402194 0.2984340 0.5251517 3\n", "3 rs1000003 -0.02929796 0.05094724 -0.5750648 0.5653480 0.2476842 3\n", "4 rs10000033 0.02337480 0.03552719 0.6579412 0.5106930 0.2918401 4\n", "5 rs10000036 0.02126912 0.03757186 0.5660916 0.5714346 0.2430335 4\n", "6 rs10000037 -0.02557369 0.04423227 -0.5781683 0.5632510 0.2492980 4\n", " position\n", "1 1357325\n", "2 183635768\n", "3 98342907\n", "4 139599898\n", "5 139219262\n", "6 38924330" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Find the -log_10 of the p-values\n", "GWASout$Neg_logP <- -log10(GWASout$p.value)\n", "\n", "# Merge output with genoBim by SNP name to add position and chromosome number\n", "GWASout <- merge(GWASout, genoBim[,c(\"SNP\", \"chr\", \"position\")])\n", "head(GWASout)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "library(Cairo)\n", "GWASout\$type <- \"typed\"\n", "source(\"UseR2016_Tutorial_Files/GWAS_Manhattan.R\")\n", "GWAS_Manhattan(GWASout)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " " ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.3.0" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 } }, "nbformat": 4, "nbformat_minor": 0 }