project:diplomatic_documents_and_swiss_newspapers_in_1914

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
project:diplomatic_documents_and_swiss_newspapers_in_1914 [2015/02/28 14:47] – [Diplomatic Documents and Swiss Newspapers in 1914] giovanniproject:diplomatic_documents_and_swiss_newspapers_in_1914 [2015/03/01 15:46] (current) – [Team] giovanni
Line 6: Line 6:
   - The Text similarity search of the corpora. We train two models on the Le Temps corpus: Term Frequency Inverse Document Frequency and Latent Semantic Indexing with 25 topics. We then develop a web interface for text similarity search over the corpus and test it with Dodis summaries and full text documents.    - The Text similarity search of the corpora. We train two models on the Le Temps corpus: Term Frequency Inverse Document Frequency and Latent Semantic Indexing with 25 topics. We then develop a web interface for text similarity search over the corpus and test it with Dodis summaries and full text documents. 
  
-{{:project:jdg_dodis_hackathon.png?400|}}+{{:project:jdg_dodis_hackathon.png?400|}} {{:project:2015-02-28_15.08.18.jpg?400|}}
  
 ===== Data and source code ===== ===== Data and source code =====
Line 12: Line 12:
   * [[https://www.dropbox.com/s/pyuuuv2ms7ushua/data.zip?dl=0|Data.zip (CC BY 4.0)]] (DropBox)   * [[https://www.dropbox.com/s/pyuuuv2ms7ushua/data.zip?dl=0|Data.zip (CC BY 4.0)]] (DropBox)
   * [[https://github.com/aoloe/glamhack-dodis-gdl-jdg|github]]   * [[https://github.com/aoloe/glamhack-dodis-gdl-jdg|github]]
- 
 ===== Documentation ===== ===== Documentation =====
  
-In this project, we want to connect newspaper articles from Journal de Genève (a Genevan daily newspaper) and a sample of the Diplomatic Documents in Switzerland database (Dodis). The goal is to conduct requests in the Dodis descriptive metadata to look for what appears in a given interval of time in the press by comparing occurrences from both data sets. Thus, we should be able to examine if and how the written press reflected what was happening at the diplomatic level. The time interval for this project is the summer of 1914.+In this project, we want to connect newspaper articles from Journal de Genève (a Genevan daily newspaper) and the Gazette de Lausanne to a sample of the Diplomatic Documents in Switzerland database (Dodis). The goal is to conduct requests in the Dodis descriptive metadata to look for what appears in a given interval of time in the press by comparing occurrences from both data sets. Thus, we should be able to examine if and how the written press reflected what was happening at the diplomatic level. The time interval for this project is the summer of 1914.
  
-In this context, at first we cleaned the data, for example by removing noise caused by short strings of characters and stopwords. The cleansing is a huge work, the pre-processing step of journal indexing. We compared small sizes vectors of wordsand paired words in the index with their occurrencesSemantic groups were thus recomposed (thesaurus).+In this context, at first we cleaned the data, for example by removing noise caused by short strings of characters and stopwords. The cleansing is a necessary step to reduce noise in the corpus. We then compared prepared tfidf vectors of words and LSI topics and represented each article in the corpus as suchFinally, we indexed the whole corpus of Le Temps to prepare it for similarity queries. THe last step was to build an interface to search the corpus by entering a text (e.g., a Dodis summary). 
 + 
 +Difficulties were not only technical. For example, the data are massive: we started doing this work on fifteen days, then on three months. Moreover, some Dodis documents were classified (i.e. non publicat the time, therefore some of the decisions don't appear in the newspapers articlesWe also used the [[http://textometrie.ens-lyon.fr/?lang=en|TXM software]], a platform for lexicometry and text statistical analysis, to explore both text corpora (the DODIS metadata and the newspapers) and to detect frequencies of significant words and their presence in both corpora.  
 + 
 +===== Dodis Map ===== 
 + 
 +{{:project:project:dodis_map.png?400|}} 
 + 
 + 
 +  * [[http://t.preus.se/dodis-map/#/|demo]] 
 +  * [[https://github.com/tpreusse/dodis-map/tree/master|github]]
  
-Difficulties were not only technical. For example, the data are massive : we started doing this work on fifteen days, then on three months. Moreover, some documents were classified at the time, therefore some of the decisions don't appear in the newspapers articles. 
  
 ===== Team ===== ===== Team =====
Line 28: Line 36:
   * [[http://t.preus.se/|Thomas Preusse]]   * [[http://t.preus.se/|Thomas Preusse]]
   * [[http://ideale.ch/|Ale Rimoldi]]   * [[http://ideale.ch/|Ale Rimoldi]]
-  * [[user:yrochat|Yannick Rochat]] +  * [[http://people.epfl.ch/yannick.rochat?lang=fr|Yannick Rochat]]
- +
-{{:project:20150228_115443.jpg?300|}} +
- +
-===== Links ===== +
  
-  * Blog or forum posts ... +{{tag> glam}}
-  * Tools you used ... +
-   +
-{{tag>status:concept needs:dev needs:design needs:data needs:expert}}+
  • project/diplomatic_documents_and_swiss_newspapers_in_1914.txt
  • Last modified: 2015/03/01 15:46
  • by giovanni