project:jung_rilke_correspondance_network

This is an old revision of the document!


Joint project bringing together three separate projects: Rilke correspondance, Jung correspondance and ETH Library.

  • List and link your actual and ideal data sources.

ACTUAL

Comment: The Rilke data is cleaner than the Jung data. Some cleaning needed to make them match: 1) separate sender and receiver; clean up and cluster (OpenRefine) 2) clean up dates and put in a format that IT developpers need (Perl) 3) clean up placenames and match to geolocators (Dariah-DE) 4) match senders and receivers to Wikidata where possible (Openrefine, problem with volume)

IDEAL

DATA after cleaning:

https://github.com/basimar/hackathon17_jungrilke

* Description of steps, and issues, in Process (please correct and refine).

Objective: provide a framework for correspondance, defining a database that can can be used not just by these two projects but others as well, and that works well with visualisation software in order to see correspondance networks.

Issues with the Jung correspondence is data quality. Sender and recipient in one column. Data cleaning still needed. Also dates need both cleaning for consistency and transformation to meet developper specs. (Basil using Perl) For Geolocators, first Dariah-DE was tried but it did not seem to be able to handle the large file. Switched to Open Street View.

For matching senders and recipients to Wikidata Q codes, OpenRefine was used. Issues encountered with large file (second file) and with recovering Q codes after successful matching, as well as need of scholarly expertise to ID people without clear identification.

Issues with the target database: Fields defined, SQL databases and visuablisation program being evaluated. How - and whether - to integrate with WIkidata still not clear.

Issues: letters are too detailed to be Wikidata items, although it looks like the senders and recipients have the notability and networks to make it worthwhile. Trying to keep options open.

As IT guys are building the database to be used with the visualization tool, data is being cleaned and Q codes are being extracted. They took the cleaned CVS files, converted to SQL, then JSON.

Doing this all at once poses some project management challenges, since several people may be working on same files to clean different data. Need to integrate all files.

Additional issues encountered: - Wikidata Q codes that Openrefine linked to seem to have disappeared? Instructions on how to add the Q codes are here https://github.com/OpenRefine/OpenRefine/wiki/reconciliation.

- The second file, with over 16,000 lines, appears to be too big for Openrefine to match with Q codes. Proposed solution: split it into several files. (Attempt to solve this by increasing RAM alloted to OpenRefine in ini file)

- Visualization: three tools are being tested: 1) Paladio (Stanford) concerns about limits on large files? 2) Viseyes and 3) Gephi.

- Ensuring that the files from different projects respect same structure in final, cleaned-up versions.

- Need for scholar to decide which of Openrefine's Q proposals is correct. Specialist knowledge needed.

Please add yourself to the list

Flor Méchain (Wikimedia CH): working on cleaning and matching with Wikidata Q codes using OpenRefine.

Lena Heizman (Dodis / histHub): Mentoring with OpenRefine.

Hugo Martin

Samantha Weiss

Michael Gasser (Archives, ETH Library): provider of the dataset C. G. Jung correspondence

Irina Schubert

Sylvie Béguelin

Basie Manti

Jérome Zbinden

Deborah Kyburz

Paul Varé

Laurel Zuckerman

Christiane Sibille (Dodis / histHub)

Adrien Zemma

Dominik Sievi

  • project/jung_rilke_correspondance_network.1505644558.txt.gz
  • Last modified: 2017/09/17 12:35
  • by mgasser