project:diplomatic_documents_and_swiss_newspapers_in_1914

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
project:diplomatic_documents_and_swiss_newspapers_in_1914 [2015/02/28 16:23] – [Documentation] atterebproject:diplomatic_documents_and_swiss_newspapers_in_1914 [2015/03/01 15:46] (current) – [Team] giovanni
Line 12: Line 12:
   * [[https://www.dropbox.com/s/pyuuuv2ms7ushua/data.zip?dl=0|Data.zip (CC BY 4.0)]] (DropBox)   * [[https://www.dropbox.com/s/pyuuuv2ms7ushua/data.zip?dl=0|Data.zip (CC BY 4.0)]] (DropBox)
   * [[https://github.com/aoloe/glamhack-dodis-gdl-jdg|github]]   * [[https://github.com/aoloe/glamhack-dodis-gdl-jdg|github]]
- 
 ===== Documentation ===== ===== Documentation =====
  
Line 19: Line 18:
 In this context, at first we cleaned the data, for example by removing noise caused by short strings of characters and stopwords. The cleansing is a necessary step to reduce noise in the corpus. We then compared prepared tfidf vectors of words and LSI topics and represented each article in the corpus as such. Finally, we indexed the whole corpus of Le Temps to prepare it for similarity queries. THe last step was to build an interface to search the corpus by entering a text (e.g., a Dodis summary). In this context, at first we cleaned the data, for example by removing noise caused by short strings of characters and stopwords. The cleansing is a necessary step to reduce noise in the corpus. We then compared prepared tfidf vectors of words and LSI topics and represented each article in the corpus as such. Finally, we indexed the whole corpus of Le Temps to prepare it for similarity queries. THe last step was to build an interface to search the corpus by entering a text (e.g., a Dodis summary).
  
-Difficulties were not only technical. For example, the data are massive: we started doing this work on fifteen days, then on three months. Moreover, some Dodis documents were classified (i.e. non public) at the time, therefore some of the decisions don't appear in the newspapers articles. We also used the [[http://textometrie.ens-lyon.fr/?lang=en|TXM software]], a platform for lexicometry and text statistical analysis, to explore both text corpora (the DODIS metadata and the newspapers) and to detect frequencies of significant words and their presence in both corpora+Difficulties were not only technical. For example, the data are massive: we started doing this work on fifteen days, then on three months. Moreover, some Dodis documents were classified (i.e. non public) at the time, therefore some of the decisions don't appear in the newspapers articles. We also used the [[http://textometrie.ens-lyon.fr/?lang=en|TXM software]], a platform for lexicometry and text statistical analysis, to explore both text corpora (the DODIS metadata and the newspapers) and to detect frequencies of significant words and their presence in both corpora.  
 + 
 +===== Dodis Map ===== 
 + 
 +{{:project:project:dodis_map.png?400|}} 
 + 
 + 
 +  * [[http://t.preus.se/dodis-map/#/|demo]] 
 +  * [[https://github.com/tpreusse/dodis-map/tree/master|github]] 
 + 
 ===== Team ===== ===== Team =====
  
Line 27: Line 36:
   * [[http://t.preus.se/|Thomas Preusse]]   * [[http://t.preus.se/|Thomas Preusse]]
   * [[http://ideale.ch/|Ale Rimoldi]]   * [[http://ideale.ch/|Ale Rimoldi]]
-  * [[user:yrochat|Yannick Rochat]] +  * [[http://people.epfl.ch/yannick.rochat?lang=fr|Yannick Rochat]]
  
 +{{tag> glam}}
  • project/diplomatic_documents_and_swiss_newspapers_in_1914.1425137013.txt.gz
  • Last modified: 2015/02/28 16:23
  • by attereb