openSNP (open genetics data)

We split into two groups:

  1. trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users
  2. trying to recreate the main graphics of this publication, a graph of a principal component analysis which shows that genetic variation clusters well according to geography:

Doing a PCA w/ openSNP data

This is done in multiple steps with the help of PLINK and SmartPCA:

Data Selection

bastian@phix ~/eth_hackdays/plink_binaries  ᐅ cat *.log|grep Inferred|sort|uniq -c
    578 Inferred sex: female.
    845 Inferred sex: male.
Total genotyping rate is 0.482581.
1626942 variants and 1423 people pass filters and QC.
Note: No phenotypes present

PCA

So far for the data processing, now we can go for the PCA itself.

smartpca.perl -i merged-23andme-opensnp-filtered-noLD.bed -a merged-23andme-opensnp-filtered-noLD.bim -b merged-23andme-opensnp-filtered-noLD.fam -l test.log -o test.pca -p test.plot -e test.eigen


-i / -a / -b are all fed with the corresponding bed/bim/fam files of PLINK. 
-l -o -p -e are the output files for smartpca. The -o gives just the prefix. 

Results & Viz: R + ggplot

With that done we can use the test.pca.evec which was generated for plotting with R & ggplot2.

library(ggplot2)
d = read.table("test.pca.evec", as.is = T)
ggplot(d,aes(x=d[,2],y=d[,3],color=d[,4])) + geom_point() + scale_x_continuous("PC1") + scale_y_continuous("PC2") + scale_color_continuous("PC3")

This should yield the upper of the two plots linked below. In an even more quick & dirty approach (and totally undocumented) I associated the genotyping files with the phenotypes given by the openSNP users, here the eye color.

TO-DO

Data

Team