Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
project:opensnp [2015/06/06 13:19] – [openSNP (open genetics data)] gedankenstueckeproject:opensnp [2015/06/06 18:31] (current) – [Team] gedankenstuecke
Line 3: Line 3:
  
   - trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users   - trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users
-  - trying to recreate the main graphics of this publication, a graph of a principal component analysis which shows that genetic variation clusters well according to geography:   http://www.nature.com/nature/journal/v456/n7218/fig_tab/nature07331_F1.html+  - trying to recreate [[http://www.nature.com/nature/journal/v456/n7218/fig_tab/nature07331_F1.html|the main graphics of this publication]], a graph of a principal component analysis which shows that genetic variation clusters well according to geography:   
  
  
 ==== Doing a PCA w/ openSNP data ==== ==== Doing a PCA w/ openSNP data ====
-  * Decided to only use data from 23andMe +[[https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png|{{https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png?350}}]]
-  *  +
-===== Data =====+
  
-  * List and link your actual and ideal data sources.+This is done in multiple steps with the help of PLINK and SmartPCA:
  
 +===Data Selection===
 +    * Use only complete 23andMe data sets -> Remove all files < 15 mb in size and that are not 23andMe
 +    * Convert the 23andMe files into PLINKs binary format //plink --23file input-file F1 ID1 I 1 --out output-file-prefix//
 +    * This will give you a directory full of BED/BIM/FAM/LOG files. The .log files can already be useful, e.g. to figure out the chromosomal sex ratio in our sample: 
 +<code>
 +bastian@phix ~/eth_hackdays/plink_binaries  ᐅ cat *.log|grep Inferred|sort|uniq -c
 +    578 Inferred sex: female.
 +    845 Inferred sex: male.
 +</code>
 +    * Try to merge those BED files into a single one: //plink -bfile a_single_file --merge-list bed_files_to_merge_list.txt --make-bed --out merged_23andme_opensnp//
 +      * but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus have to be removed
 +      * Fortunately the crashing merge-list command gives you a list of broken SNPs: //merged-23andme-opensnp-merge.missnp//, this again can be used by PLINK to remove those from the single individual files before doing the merge.
 +      * For this you can use //run_plink_remove_snps.py// from the GitHub-repo. It remove the ~2k SNPs that do not work as expected. 
 +      * With those new files you can then run the merge step: //plink --bfile ../plink_binaries_filtered/2-filtered --merge-list ../opensnp-fun/plink_merge_list_filtered.txt --make-bed --out merged-23andme-opensnp-filtered//
 +      * This should work just fine now, the resulting logfile should give you ± this: 
 +
 +<code>
 +Total genotyping rate is 0.482581.
 +1626942 variants and 1423 people pass filters and QC.
 +Note: No phenotypes present
 +</code>
 +
 +    * Great! Now we can remove SNPs that are too highly correlated with other ones close by, using PLINK once again. 
 +      * For the parameters we use the ones provided in the paper cited above: A SNP-Window Size of 50, an increment of 5 SNPs and a r2 of 0.8: 
 +        * //plink --bfile merged-23andme-opensnp-filtered --indep-pairwise 50 5 0.8 --make-bed --out merged-23andme-opensnp-filtered-noLD//
 +      * The resulting out-files can then again be used to remove the listed SNPs using PLINK
 +===PCA===
 +So far for the data processing, now we can go for the PCA itself.  
 +    * For this we can use //smartpca//, which is part of the [[https://github.com/chrchang/eigensoft/|EIGENSOFT]] package. 
 +    * The documentation is kind of lacking for smartpca, but this seems to work:
 +<code>
 +smartpca.perl -i merged-23andme-opensnp-filtered-noLD.bed -a merged-23andme-opensnp-filtered-noLD.bim -b merged-23andme-opensnp-filtered-noLD.fam -l test.log -o test.pca -p test.plot -e test.eigen
 +
 +
 +-i / -a / -b are all fed with the corresponding bed/bim/fam files of PLINK. 
 +-l -o -p -e are the output files for smartpca. The -o gives just the prefix. 
 +</code>
 +    * With that done you should get a //test.log// which ends with //##end of smartpca run//
 +
 +===Results & Viz: R + ggplot ===
 +With that done we can use the //test.pca.evec// which was generated for plotting with R & ggplot2. 
 +
 +<code>
 +library(ggplot2)
 +d = read.table("test.pca.evec", as.is = T)
 +ggplot(d,aes(x=d[,2],y=d[,3],color=d[,4])) + geom_point() + scale_x_continuous("PC1") + scale_y_continuous("PC2") + scale_color_continuous("PC3")
 +</code>
 +
 +This should yield the upper of the two plots linked below. In an even more quick & dirty approach (and totally undocumented) I associated the genotyping files with the phenotypes given by the openSNP users, here the eye color. 
 +
 +  * Example Results
 +      * https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png
 +      * https://dl.dropboxusercontent.com/u/170329/pca-opensnp-eyecolor.png
 +
 +=== TO-DO ===
 +  * The plots here were done on all SNPs, not the subsampled ones (as I didn't notice that it wasn't done automatically…). Should re-run it on the less autocorrelated set
 +  * Right now the scripts for generating all the data are messy at best and can't be run automatically, should be fixed
 +  * Include the 1000Genomes data, not done so far.
 +  * Go wild!
 +===== Data =====
 +
 +  * [[https://opensnp.org|openSNP]]
 +  * [[1000genomes.org|1000 Genomes]]
 ===== Team ===== ===== Team =====
  
   * [[https://twitter.com/gedankenstuecke|@gedankenstuecke]]   * [[https://twitter.com/gedankenstuecke|@gedankenstuecke]]
   * [[https://twitter.com/ciyer|@ciyer]]   * [[https://twitter.com/ciyer|@ciyer]]
 +  * [[https://twitter.com/podehaye|@podehaye]]
   * and other team members   * and other team members
 ===== Links ===== ===== Links =====
  
   * Repo for doing PCA on openSNP: https://github.com/ciyer/opensnp-fun   * Repo for doing PCA on openSNP: https://github.com/ciyer/opensnp-fun
-   +  * [[https://github.com/gedankenstuecke/opensnp-fun|@gedankenstuecke's fork of ciyer's repo]] 
-{{tag>status:concept needs:dev needs:design needs:data needs:expert}}+  * [[http://www.nature.com/nature/journal/v456/n7218/fig_tab/nature07331_F1.html|The figure we tried to reproduce]] 
 +  * [[https://www.cog-genomics.org/plink2/strat|PLINK manual]] 
 + 
 +{{tag>status:demo needs:dev needs:design needs:data needs:expert}}
  • project/opensnp.1433589589.txt.gz
  • Last modified: 2015/06/06 13:19
  • by gedankenstuecke