This is an old revision of the document!


We split into two groups:

  1. trying to run a GWAS, comparing the genotypes of openSNP users to the 1000 Genomes data set, to see whether there are significantly overrepresented variants in openSNP users
  2. trying to recreate the main graphics of this publication, a graph of a principal component analysis which shows that genetic variation clusters well according to geography:

This is done in multiple steps with the help of PLINK and SmartPCA:

  • Data Selection:
    • Use only complete 23andMe data sets → Remove all files < 15 mb in size and that are not 23andMe
    • Convert the 23andMe files into PLINKs binary format plink –23file input-file F1 ID1 I 1 –out output-file-prefix
    • Try to merge those BED files into a single one: plink -bfile a_single_file –merge-list bed_files_to_merge_list.txt –make-bed –out merged_23andme_opensnp
      • but it crashed, as some of the SNPs are tri-allelic (instead of only having two different variants at the position, 3 are measured). This is most likely an artefact of the data, and thus we removed those with…

In the end: https://dl.dropboxusercontent.com/u/170329/pca-opensnp.png

  • project/opensnp.1433591052.txt.gz
  • Last modified: 2015/06/06 13:44
  • by gedankenstuecke