Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
project:opensnp [2015/06/06 13:44] – [Data] gedankenstuecke | project:opensnp [2015/06/06 18:31] (current) – [Team] gedankenstuecke | ||
---|---|---|---|
Line 7: | Line 7: | ||
==== Doing a PCA w/ openSNP data ==== | ==== Doing a PCA w/ openSNP data ==== | ||
+ | [[https:// | ||
+ | |||
This is done in multiple steps with the help of PLINK and SmartPCA: | This is done in multiple steps with the help of PLINK and SmartPCA: | ||
- | * Data Selection: | + | ===Data Selection=== |
* Use only complete 23andMe data sets -> Remove all files < 15 mb in size and that are not 23andMe | * Use only complete 23andMe data sets -> Remove all files < 15 mb in size and that are not 23andMe | ||
* Convert the 23andMe files into PLINKs binary format //plink --23file input-file F1 ID1 I 1 --out output-file-prefix// | * Convert the 23andMe files into PLINKs binary format //plink --23file input-file F1 ID1 I 1 --out output-file-prefix// | ||
+ | * This will give you a directory full of BED/ | ||
+ | < | ||
+ | bastian@phix ~/ | ||
+ | 578 Inferred sex: female. | ||
+ | 845 Inferred sex: male. | ||
+ | </ | ||
* Try to merge those BED files into a single one: //plink -bfile a_single_file --merge-list bed_files_to_merge_list.txt --make-bed --out merged_23andme_opensnp// | * Try to merge those BED files into a single one: //plink -bfile a_single_file --merge-list bed_files_to_merge_list.txt --make-bed --out merged_23andme_opensnp// | ||
- | * but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus we removed those with… | + | * but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus have to be removed |
+ | * Fortunately the crashing merge-list command gives you a list of broken SNPs: // | ||
+ | * For this you can use // | ||
+ | * With those new files you can then run the merge step: //plink --bfile ../ | ||
+ | * This should work just fine now, the resulting logfile should give you ± this: | ||
- | In the end: https://dl.dropboxusercontent.com/u/ | + | < |
+ | Total genotyping rate is 0.482581. | ||
+ | 1626942 variants and 1423 people pass filters and QC. | ||
+ | Note: No phenotypes present | ||
+ | </code> | ||
+ | * Great! Now we can remove SNPs that are too highly correlated with other ones close by, using PLINK once again. | ||
+ | * For the parameters we use the ones provided in the paper cited above: A SNP-Window Size of 50, an increment of 5 SNPs and a r2 of 0.8: | ||
+ | * //plink --bfile merged-23andme-opensnp-filtered --indep-pairwise 50 5 0.8 --make-bed --out merged-23andme-opensnp-filtered-noLD// | ||
+ | * The resulting out-files can then again be used to remove the listed SNPs using PLINK | ||
+ | ===PCA=== | ||
+ | So far for the data processing, now we can go for the PCA itself. | ||
+ | * For this we can use // | ||
+ | * The documentation is kind of lacking for smartpca, but this seems to work: | ||
+ | < | ||
+ | smartpca.perl -i merged-23andme-opensnp-filtered-noLD.bed -a merged-23andme-opensnp-filtered-noLD.bim -b merged-23andme-opensnp-filtered-noLD.fam -l test.log -o test.pca -p test.plot -e test.eigen | ||
+ | |||
+ | |||
+ | -i / -a / -b are all fed with the corresponding bed/bim/fam files of PLINK. | ||
+ | -l -o -p -e are the output files for smartpca. The -o gives just the prefix. | ||
+ | </ | ||
+ | * With that done you should get a // | ||
+ | |||
+ | ===Results & Viz: R + ggplot === | ||
+ | With that done we can use the // | ||
+ | |||
+ | < | ||
+ | library(ggplot2) | ||
+ | d = read.table(" | ||
+ | ggplot(d, | ||
+ | </ | ||
+ | |||
+ | This should yield the upper of the two plots linked below. In an even more quick & dirty approach (and totally undocumented) I associated the genotyping files with the phenotypes given by the openSNP users, here the eye color. | ||
+ | |||
+ | * Example Results | ||
+ | * https:// | ||
+ | * https:// | ||
+ | |||
+ | === TO-DO === | ||
+ | * The plots here were done on all SNPs, not the subsampled ones (as I didn't notice that it wasn't done automatically…). Should re-run it on the less autocorrelated set | ||
+ | * Right now the scripts for generating all the data are messy at best and can't be run automatically, | ||
+ | * Include the 1000Genomes data, not done so far. | ||
+ | * Go wild! | ||
===== Data ===== | ===== Data ===== | ||
Line 25: | Line 78: | ||
* [[https:// | * [[https:// | ||
* [[https:// | * [[https:// | ||
+ | * [[https:// | ||
* and other team members | * and other team members | ||
===== Links ===== | ===== Links ===== | ||
* Repo for doing PCA on openSNP: https:// | * Repo for doing PCA on openSNP: https:// | ||
- | + | | |
- | {{tag> | + | * [[http:// |
+ | * [[https:// | ||
+ | |||
+ | {{tag> |