Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
project:opensnp [2015/06/06 13:44] – [Data] gedankenstueckeproject:opensnp [2015/06/06 18:29] – [Doing a PCA w/ openSNP data] gedankenstuecke
Line 7: Line 7:
  
 ==== Doing a PCA w/ openSNP data ==== ==== Doing a PCA w/ openSNP data ====
 +[[https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png|{{https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png?350}}]]
 +
 This is done in multiple steps with the help of PLINK and SmartPCA: This is done in multiple steps with the help of PLINK and SmartPCA:
  
-  * Data Selection:+===Data Selection===
     * Use only complete 23andMe data sets -> Remove all files < 15 mb in size and that are not 23andMe     * Use only complete 23andMe data sets -> Remove all files < 15 mb in size and that are not 23andMe
     * Convert the 23andMe files into PLINKs binary format //plink --23file input-file F1 ID1 I 1 --out output-file-prefix//     * Convert the 23andMe files into PLINKs binary format //plink --23file input-file F1 ID1 I 1 --out output-file-prefix//
 +    * This will give you a directory full of BED/BIM/FAM/LOG files. The .log files can already be useful, e.g. to figure out the chromosomal sex ratio in our sample: 
 +<code>
 +bastian@phix ~/eth_hackdays/plink_binaries  ᐅ cat *.log|grep Inferred|sort|uniq -c
 +    578 Inferred sex: female.
 +    845 Inferred sex: male.
 +</code>
     * Try to merge those BED files into a single one: //plink -bfile a_single_file --merge-list bed_files_to_merge_list.txt --make-bed --out merged_23andme_opensnp//     * Try to merge those BED files into a single one: //plink -bfile a_single_file --merge-list bed_files_to_merge_list.txt --make-bed --out merged_23andme_opensnp//
-      * but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus we removed those with…+      * but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus have to be removed 
 +      * Fortunately the crashing merge-list command gives you a list of broken SNPs: //merged-23andme-opensnp-merge.missnp//, this again can be used by PLINK to remove those from the single individual files before doing the merge. 
 +      * For this you can use //run_plink_remove_snps.py// from the GitHub-repo. It remove the ~2k SNPs that do not work as expected.  
 +      * With those new files you can then run the merge step: //plink --bfile ../plink_binaries_filtered/2-filtered --merge-list ../opensnp-fun/plink_merge_list_filtered.txt --make-bed --out merged-23andme-opensnp-filtered// 
 +      * This should work just fine now, the resulting logfile should give you ± this: 
  
-In the end: https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png+<code> 
 +Total genotyping rate is 0.482581. 
 +1626942 variants and 1423 people pass filters and QC. 
 +Note: No phenotypes present 
 +</code>
  
 +    * Great! Now we can remove SNPs that are too highly correlated with other ones close by, using PLINK once again. 
 +      * For the parameters we use the ones provided in the paper cited above: A SNP-Window Size of 50, an increment of 5 SNPs and a r2 of 0.8: 
 +        * //plink --bfile merged-23andme-opensnp-filtered --indep-pairwise 50 5 0.8 --make-bed --out merged-23andme-opensnp-filtered-noLD//
 +      * The resulting out-files can then again be used to remove the listed SNPs using PLINK
 +===PCA===
 +So far for the data processing, now we can go for the PCA itself.  
 +    * For this we can use //smartpca//, which is part of the [[https://github.com/chrchang/eigensoft/|EIGENSOFT]] package. 
 +    * The documentation is kind of lacking for smartpca, but this seems to work:
 +<code>
 +smartpca.perl -i merged-23andme-opensnp-filtered-noLD.bed -a merged-23andme-opensnp-filtered-noLD.bim -b merged-23andme-opensnp-filtered-noLD.fam -l test.log -o test.pca -p test.plot -e test.eigen
 +
 +
 +-i / -a / -b are all fed with the corresponding bed/bim/fam files of PLINK. 
 +-l -o -p -e are the output files for smartpca. The -o gives just the prefix. 
 +</code>
 +    * With that done you should get a //test.log// which ends with //##end of smartpca run//
 +
 +===Results & Viz: R + ggplot ===
 +With that done we can use the //test.pca.evec// which was generated for plotting with R & ggplot2. 
 +
 +<code>
 +library(ggplot2)
 +d = read.table("test.pca.evec", as.is = T)
 +ggplot(d,aes(x=d[,2],y=d[,3],color=d[,4])) + geom_point() + scale_x_continuous("PC1") + scale_y_continuous("PC2") + scale_color_continuous("PC3")
 +</code>
 +
 +This should yield the upper of the two plots linked below. In an even more quick & dirty approach (and totally undocumented) I associated the genotyping files with the phenotypes given by the openSNP users, here the eye color. 
 +
 +  * Example Results
 +      * https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png
 +      * https://dl.dropboxusercontent.com/u/170329/pca-opensnp-eyecolor.png
 +
 +=== TO-DO ===
 +  * The plots here were done on all SNPs, not the subsampled ones (as I didn't notice that it wasn't done automatically…). Should re-run it on the less autocorrelated set
 +  * Right now the scripts for generating all the data are messy at best and can't be run automatically, should be fixed
 +  * Include the 1000Genomes data, not done so far.
 +  * Go wild!
 ===== Data ===== ===== Data =====
  
Line 29: Line 82:
  
   * Repo for doing PCA on openSNP: https://github.com/ciyer/opensnp-fun   * Repo for doing PCA on openSNP: https://github.com/ciyer/opensnp-fun
-   +  * [[https://github.com/gedankenstuecke/opensnp-fun|@gedankenstuecke's fork of ciyer's repo]] 
-{{tag>status:concept needs:dev needs:design needs:data needs:expert}}+  * [[http://www.nature.com/nature/journal/v456/n7218/fig_tab/nature07331_F1.html|The figure we tried to reproduce]] 
 +  * [[https://www.cog-genomics.org/plink2/strat|PLINK manual]] 
 + 
 +{{tag>status:demo needs:dev needs:design needs:data needs:expert}}
  • project/opensnp.txt
  • Last modified: 2015/06/06 18:31
  • by gedankenstuecke