Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| project:opensnp [2015/06/06 13:43] – [Doing a PCA w/ openSNP data] gedankenstuecke | project:opensnp [2015/06/06 18:31] (current) – [Team] gedankenstuecke | ||
|---|---|---|---|
| Line 7: | Line 7: | ||
| ==== Doing a PCA w/ openSNP data ==== | ==== Doing a PCA w/ openSNP data ==== | ||
| + | [[https:// | ||
| + | |||
| This is done in multiple steps with the help of PLINK and SmartPCA: | This is done in multiple steps with the help of PLINK and SmartPCA: | ||
| - | * Data Selection: | + | ===Data Selection=== |
| * Use only complete 23andMe data sets -> Remove all files < 15 mb in size and that are not 23andMe | * Use only complete 23andMe data sets -> Remove all files < 15 mb in size and that are not 23andMe | ||
| * Convert the 23andMe files into PLINKs binary format //plink --23file input-file F1 ID1 I 1 --out output-file-prefix// | * Convert the 23andMe files into PLINKs binary format //plink --23file input-file F1 ID1 I 1 --out output-file-prefix// | ||
| + | * This will give you a directory full of BED/ | ||
| + | < | ||
| + | bastian@phix ~/ | ||
| + | 578 Inferred sex: female. | ||
| + | 845 Inferred sex: male. | ||
| + | </ | ||
| * Try to merge those BED files into a single one: //plink -bfile a_single_file --merge-list bed_files_to_merge_list.txt --make-bed --out merged_23andme_opensnp// | * Try to merge those BED files into a single one: //plink -bfile a_single_file --merge-list bed_files_to_merge_list.txt --make-bed --out merged_23andme_opensnp// | ||
| - | * but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus we removed those with… | + | * but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus have to be removed |
| + | * Fortunately the crashing merge-list command gives you a list of broken SNPs: // | ||
| + | * For this you can use // | ||
| + | * With those new files you can then run the merge step: //plink --bfile ../ | ||
| + | * This should work just fine now, the resulting logfile should give you ± this: | ||
| - | In the end: https://dl.dropboxusercontent.com/u/ | + | < |
| + | Total genotyping rate is 0.482581. | ||
| + | 1626942 variants and 1423 people pass filters and QC. | ||
| + | Note: No phenotypes present | ||
| + | </code> | ||
| - | ===== Data ===== | + | * Great! Now we can remove SNPs that are too highly correlated with other ones close by, using PLINK once again. |
| + | * For the parameters we use the ones provided in the paper cited above: A SNP-Window Size of 50, an increment of 5 SNPs and a r2 of 0.8: | ||
| + | * //plink --bfile merged-23andme-opensnp-filtered --indep-pairwise 50 5 0.8 --make-bed --out merged-23andme-opensnp-filtered-noLD// | ||
| + | * The resulting out-files can then again be used to remove the listed SNPs using PLINK | ||
| + | ===PCA=== | ||
| + | So far for the data processing, now we can go for the PCA itself. | ||
| + | * For this we can use // | ||
| + | * The documentation is kind of lacking for smartpca, but this seems to work: | ||
| + | < | ||
| + | smartpca.perl -i merged-23andme-opensnp-filtered-noLD.bed -a merged-23andme-opensnp-filtered-noLD.bim -b merged-23andme-opensnp-filtered-noLD.fam -l test.log -o test.pca -p test.plot -e test.eigen | ||
| - | * List and link your actual and ideal data sources. | ||
| + | -i / -a / -b are all fed with the corresponding bed/bim/fam files of PLINK. | ||
| + | -l -o -p -e are the output files for smartpca. The -o gives just the prefix. | ||
| + | </ | ||
| + | * With that done you should get a // | ||
| + | |||
| + | ===Results & Viz: R + ggplot === | ||
| + | With that done we can use the // | ||
| + | |||
| + | < | ||
| + | library(ggplot2) | ||
| + | d = read.table(" | ||
| + | ggplot(d, | ||
| + | </ | ||
| + | |||
| + | This should yield the upper of the two plots linked below. In an even more quick & dirty approach (and totally undocumented) I associated the genotyping files with the phenotypes given by the openSNP users, here the eye color. | ||
| + | |||
| + | * Example Results | ||
| + | * https:// | ||
| + | * https:// | ||
| + | |||
| + | === TO-DO === | ||
| + | * The plots here were done on all SNPs, not the subsampled ones (as I didn't notice that it wasn't done automatically…). Should re-run it on the less autocorrelated set | ||
| + | * Right now the scripts for generating all the data are messy at best and can't be run automatically, | ||
| + | * Include the 1000Genomes data, not done so far. | ||
| + | * Go wild! | ||
| + | ===== Data ===== | ||
| + | |||
| + | * [[https:// | ||
| + | * [[1000genomes.org|1000 Genomes]] | ||
| ===== Team ===== | ===== Team ===== | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| * and other team members | * and other team members | ||
| ===== Links ===== | ===== Links ===== | ||
| * Repo for doing PCA on openSNP: https:// | * Repo for doing PCA on openSNP: https:// | ||
| - | + | | |
| - | {{tag> | + | * [[http:// |
| + | * [[https:// | ||
| + | |||
| + | {{tag> | ||