Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
project:opensnp [2015/06/06 13:19] – [openSNP (open genetics data)] gedankenstuecke | project:opensnp [2015/06/06 18:31] (current) – [Team] gedankenstuecke | ||
---|---|---|---|
Line 3: | Line 3: | ||
- trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users | - trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users | ||
- | - trying to recreate the main graphics of this publication, | + | - trying to recreate |
==== Doing a PCA w/ openSNP data ==== | ==== Doing a PCA w/ openSNP data ==== | ||
- | * Decided to only use data from 23andMe | + | [[https:// |
- | * | + | |
- | ===== Data ===== | + | |
- | * List and link your actual and ideal data sources. | + | This is done in multiple steps with the help of PLINK and SmartPCA: |
+ | ===Data Selection=== | ||
+ | * Use only complete 23andMe data sets -> Remove all files < 15 mb in size and that are not 23andMe | ||
+ | * Convert the 23andMe files into PLINKs binary format //plink --23file input-file F1 ID1 I 1 --out output-file-prefix// | ||
+ | * This will give you a directory full of BED/ | ||
+ | < | ||
+ | bastian@phix ~/ | ||
+ | 578 Inferred sex: female. | ||
+ | 845 Inferred sex: male. | ||
+ | </ | ||
+ | * Try to merge those BED files into a single one: //plink -bfile a_single_file --merge-list bed_files_to_merge_list.txt --make-bed --out merged_23andme_opensnp// | ||
+ | * but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus have to be removed | ||
+ | * Fortunately the crashing merge-list command gives you a list of broken SNPs: // | ||
+ | * For this you can use // | ||
+ | * With those new files you can then run the merge step: //plink --bfile ../ | ||
+ | * This should work just fine now, the resulting logfile should give you ± this: | ||
+ | |||
+ | < | ||
+ | Total genotyping rate is 0.482581. | ||
+ | 1626942 variants and 1423 people pass filters and QC. | ||
+ | Note: No phenotypes present | ||
+ | </ | ||
+ | |||
+ | * Great! Now we can remove SNPs that are too highly correlated with other ones close by, using PLINK once again. | ||
+ | * For the parameters we use the ones provided in the paper cited above: A SNP-Window Size of 50, an increment of 5 SNPs and a r2 of 0.8: | ||
+ | * //plink --bfile merged-23andme-opensnp-filtered --indep-pairwise 50 5 0.8 --make-bed --out merged-23andme-opensnp-filtered-noLD// | ||
+ | * The resulting out-files can then again be used to remove the listed SNPs using PLINK | ||
+ | ===PCA=== | ||
+ | So far for the data processing, now we can go for the PCA itself. | ||
+ | * For this we can use // | ||
+ | * The documentation is kind of lacking for smartpca, but this seems to work: | ||
+ | < | ||
+ | smartpca.perl -i merged-23andme-opensnp-filtered-noLD.bed -a merged-23andme-opensnp-filtered-noLD.bim -b merged-23andme-opensnp-filtered-noLD.fam -l test.log -o test.pca -p test.plot -e test.eigen | ||
+ | |||
+ | |||
+ | -i / -a / -b are all fed with the corresponding bed/bim/fam files of PLINK. | ||
+ | -l -o -p -e are the output files for smartpca. The -o gives just the prefix. | ||
+ | </ | ||
+ | * With that done you should get a // | ||
+ | |||
+ | ===Results & Viz: R + ggplot === | ||
+ | With that done we can use the // | ||
+ | |||
+ | < | ||
+ | library(ggplot2) | ||
+ | d = read.table(" | ||
+ | ggplot(d, | ||
+ | </ | ||
+ | |||
+ | This should yield the upper of the two plots linked below. In an even more quick & dirty approach (and totally undocumented) I associated the genotyping files with the phenotypes given by the openSNP users, here the eye color. | ||
+ | |||
+ | * Example Results | ||
+ | * https:// | ||
+ | * https:// | ||
+ | |||
+ | === TO-DO === | ||
+ | * The plots here were done on all SNPs, not the subsampled ones (as I didn't notice that it wasn't done automatically…). Should re-run it on the less autocorrelated set | ||
+ | * Right now the scripts for generating all the data are messy at best and can't be run automatically, | ||
+ | * Include the 1000Genomes data, not done so far. | ||
+ | * Go wild! | ||
+ | ===== Data ===== | ||
+ | |||
+ | * [[https:// | ||
+ | * [[1000genomes.org|1000 Genomes]] | ||
===== Team ===== | ===== Team ===== | ||
* [[https:// | * [[https:// | ||
* [[https:// | * [[https:// | ||
+ | * [[https:// | ||
* and other team members | * and other team members | ||
===== Links ===== | ===== Links ===== | ||
* Repo for doing PCA on openSNP: https:// | * Repo for doing PCA on openSNP: https:// | ||
- | + | | |
- | {{tag> | + | * [[http:// |
+ | * [[https:// | ||
+ | |||
+ | {{tag> |