This is an old revision of the document!
openSNP (open genetics data)
We split into two groups:
- trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users
- trying to recreate the main graphics of this publication, a graph of a principal component analysis which shows that genetic variation clusters well according to geography:
Doing a PCA w/ openSNP data
This is done in multiple steps with the help of PLINK and SmartPCA:
- Data Selection:
- Use only complete 23andMe data sets → Remove all files < 15 mb in size and that are not 23andMe
- Convert the 23andMe files into PLINKs binary format plink –23file input-file F1 ID1 I 1 –out output-file-prefix
- Try to merge those BED files into a single one: plink -bfile a_single_file –merge-list bed_files_to_merge_list.txt –make-bed –out merged_23andme_opensnp
- but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus we removed those with…
In the end: https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png
Data
Team
- and other team members
Links
- Repo for doing PCA on openSNP: https://github.com/ciyer/opensnp-fun