openSNP (open genetics data)
We split into two groups:
- trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users
- trying to recreate the main graphics of this publication, a graph of a principal component analysis which shows that genetic variation clusters well according to geography:
Doing a PCA w/ openSNP data
Data Selection
- Use only complete 23andMe data sets → Remove all files < 15 mb in size and that are not 23andMe
- Convert the 23andMe files into PLINKs binary format plink –23file input-file F1 ID1 I 1 –out output-file-prefix
- This will give you a directory full of BED/BIM/FAM/LOG files. The .log files can already be useful, e.g. to figure out the chromosomal sex ratio in our sample:
bastian@phix ~/eth_hackdays/plink_binaries ᐅ cat *.log|grep Inferred|sort|uniq -c 578 Inferred sex: female. 845 Inferred sex: male.
- Try to merge those BED files into a single one: plink -bfile a_single_file –merge-list bed_files_to_merge_list.txt –make-bed –out merged_23andme_opensnp
- but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus have to be removed
- Fortunately the crashing merge-list command gives you a list of broken SNPs: merged-23andme-opensnp-merge.missnp, this again can be used by PLINK to remove those from the single individual files before doing the merge.
- For this you can use run_plink_remove_snps.py from the GitHub-repo. It remove the ~2k SNPs that do not work as expected.
- With those new files you can then run the merge step: plink –bfile ../plink_binaries_filtered/2-filtered –merge-list ../opensnp-fun/plink_merge_list_filtered.txt –make-bed –out merged-23andme-opensnp-filtered
- This should work just fine now, the resulting logfile should give you ± this:
Total genotyping rate is 0.482581. 1626942 variants and 1423 people pass filters and QC. Note: No phenotypes present
- Great! Now we can remove SNPs that are too highly correlated with other ones close by, using PLINK once again.
- For the parameters we use the ones provided in the paper cited above: A SNP-Window Size of 50, an increment of 5 SNPs and a r2 of 0.8:
- plink –bfile merged-23andme-opensnp-filtered –indep-pairwise 50 5 0.8 –make-bed –out merged-23andme-opensnp-filtered-noLD
- The resulting out-files can then again be used to remove the listed SNPs using PLINK
PCA
So far for the data processing, now we can go for the PCA itself.
- For this we can use smartpca, which is part of the EIGENSOFT package.
- The documentation is kind of lacking for smartpca, but this seems to work:
smartpca.perl -i merged-23andme-opensnp-filtered-noLD.bed -a merged-23andme-opensnp-filtered-noLD.bim -b merged-23andme-opensnp-filtered-noLD.fam -l test.log -o test.pca -p test.plot -e test.eigen -i / -a / -b are all fed with the corresponding bed/bim/fam files of PLINK. -l -o -p -e are the output files for smartpca. The -o gives just the prefix.
- With that done you should get a test.log which ends with ##end of smartpca run
Results & Viz: R + ggplot
With that done we can use the test.pca.evec which was generated for plotting with R & ggplot2.
library(ggplot2) d = read.table("test.pca.evec", as.is = T) ggplot(d,aes(x=d[,2],y=d[,3],color=d[,4])) + geom_point() + scale_x_continuous("PC1") + scale_y_continuous("PC2") + scale_color_continuous("PC3")
This should yield the upper of the two plots linked below. In an even more quick & dirty approach (and totally undocumented) I associated the genotyping files with the phenotypes given by the openSNP users, here the eye color.
- Example Results
TO-DO
- The plots here were done on all SNPs, not the subsampled ones (as I didn't notice that it wasn't done automatically…). Should re-run it on the less autocorrelated set
- Right now the scripts for generating all the data are messy at best and can't be run automatically, should be fixed
- Include the 1000Genomes data, not done so far.
- Go wild!
Data
Team
- and other team members
Links
- Repo for doing PCA on openSNP: https://github.com/ciyer/opensnp-fun