openSNP (open genetics data)

We split into two groups:

  1. trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users
  2. trying to recreate the main graphics of this publication, a graph of a principal component analysis which shows that genetic variation clusters well according to geography:

Doing a PCA w/ openSNP data

This is done in multiple steps with the help of PLINK and SmartPCA:

Data Selection

  • Use only complete 23andMe data sets → Remove all files < 15 mb in size and that are not 23andMe
  • Convert the 23andMe files into PLINKs binary format plink –23file input-file F1 ID1 I 1 –out output-file-prefix
  • This will give you a directory full of BED/BIM/FAM/LOG files. The .log files can already be useful, e.g. to figure out the chromosomal sex ratio in our sample:
bastian@phix ~/eth_hackdays/plink_binaries  ᐅ cat *.log|grep Inferred|sort|uniq -c
    578 Inferred sex: female.
    845 Inferred sex: male.
  • Try to merge those BED files into a single one: plink -bfile a_single_file –merge-list bed_files_to_merge_list.txt –make-bed –out merged_23andme_opensnp
    • but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus have to be removed
    • Fortunately the crashing merge-list command gives you a list of broken SNPs: merged-23andme-opensnp-merge.missnp, this again can be used by PLINK to remove those from the single individual files before doing the merge.
    • For this you can use run_plink_remove_snps.py from the GitHub-repo. It remove the ~2k SNPs that do not work as expected.
    • With those new files you can then run the merge step: plink –bfile ../plink_binaries_filtered/2-filtered –merge-list ../opensnp-fun/plink_merge_list_filtered.txt –make-bed –out merged-23andme-opensnp-filtered
    • This should work just fine now, the resulting logfile should give you ± this:
Total genotyping rate is 0.482581.
1626942 variants and 1423 people pass filters and QC.
Note: No phenotypes present
  • Great! Now we can remove SNPs that are too highly correlated with other ones close by, using PLINK once again.
    • For the parameters we use the ones provided in the paper cited above: A SNP-Window Size of 50, an increment of 5 SNPs and a r2 of 0.8:
      • plink –bfile merged-23andme-opensnp-filtered –indep-pairwise 50 5 0.8 –make-bed –out merged-23andme-opensnp-filtered-noLD
    • The resulting out-files can then again be used to remove the listed SNPs using PLINK

PCA

So far for the data processing, now we can go for the PCA itself.

  • For this we can use smartpca, which is part of the EIGENSOFT package.
  • The documentation is kind of lacking for smartpca, but this seems to work:
smartpca.perl -i merged-23andme-opensnp-filtered-noLD.bed -a merged-23andme-opensnp-filtered-noLD.bim -b merged-23andme-opensnp-filtered-noLD.fam -l test.log -o test.pca -p test.plot -e test.eigen


-i / -a / -b are all fed with the corresponding bed/bim/fam files of PLINK. 
-l -o -p -e are the output files for smartpca. The -o gives just the prefix. 
  • With that done you should get a test.log which ends with ##end of smartpca run

Results & Viz: R + ggplot

With that done we can use the test.pca.evec which was generated for plotting with R & ggplot2.

library(ggplot2)
d = read.table("test.pca.evec", as.is = T)
ggplot(d,aes(x=d[,2],y=d[,3],color=d[,4])) + geom_point() + scale_x_continuous("PC1") + scale_y_continuous("PC2") + scale_color_continuous("PC3")

This should yield the upper of the two plots linked below. In an even more quick & dirty approach (and totally undocumented) I associated the genotyping files with the phenotypes given by the openSNP users, here the eye color.

TO-DO

  • The plots here were done on all SNPs, not the subsampled ones (as I didn't notice that it wasn't done automatically…). Should re-run it on the less autocorrelated set
  • Right now the scripts for generating all the data are messy at best and can't be run automatically, should be fixed
  • Include the 1000Genomes data, not done so far.
  • Go wild!

Data

Team