project:big_data_analytics

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
project:big_data_analytics [2017/09/15 17:06] – [Big Data Analytics (bibliographical data)] j.martinelliproject:big_data_analytics [2017/09/18 10:12] (current) – Correct a few typos andrea
Line 3: Line 3:
 We try to analyse bibliographical data using big data technology (flink, elasticsearch, metafacture). We try to analyse bibliographical data using big data technology (flink, elasticsearch, metafacture).
  
-- more info will follow...+Here a first sketch of what we're aiming at:
  
-here a first sketch of what we're aiming at:+{{:project:plakat_v01.jpg?direct&600|}}
  
-{{:project:plakat_v01.jpg?direct&800|}}+===== Datasets =====
  
-===== Data =====+We use bibliographical metadata:
  
-We use biographical metadata: +**Swissbib bibliographical data** [[https://www.swissbib.ch/]] 
-  crossref [[https://www.crossref.org/]] +  * Catalog of all the Swiss University Libraries, the Swiss National Library, etc. 
-  * edoc [[http://edoc.unibas.ch/]] +  * 960 Libraries / 23 repositories (Bibliotheksverbunde) 
-  * swissbib [[https://www.swissbib.ch/]]+  * ca. 30 Mio records 
 +  * MARC21 XML Format 
 +  * → raw data stored in Mongo DB 
 +  * → transformed and clustered data stored in CBS (central library system)  
 + 
 + 
 +**edoc** [[http://edoc.unibas.ch/]] 
 +  * Institutional Repository der Universität Basel (Dokumentenserver, Open Access Publications) 
 +  * ca. 50'000 records 
 +  * JSON File 
 + 
 +**crossref** [[https://www.crossref.org/]] 
 +  * Digital Object Identifier (DOI) Registration Agency 
 +  * ca. 90 Mio records (we only use 30 Mio) 
 +  * JSON scraped from API 
 + 
 +===== Use Cases ===== 
 + 
 +=== Swissbib === 
 + 
 +__Librarian__:  
 + 
 +- For prioritizing which of our holdings should be digitized most urgently, I want to know which of our holdings are nowhere else to be found. 
 + 
 +- We would like to have a list of all the DVDs in swissbib. 
 + 
 +- What is special about the holdings of some library/institution? Profile? 
 +  
 +__Data analyst__:  
 + 
 +- I want to get to know better my data. And be faster.  
 + 
 +→  e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyze, if these records should be sent through the merging process of CBS. Therefore I also want to know, if these records contain other ‚relevant‘ fields, defined by CBS (e.g. ISBN, etc.). To analyze the results, a visualization tool might be useful. 
 + 
 +=== edoc === 
 + 
 +Goal: Enrichment. I want to add missing identifiers (e.g. DOIs, ORCID, funder IDs) to the edoc dataset. 
 + 
 +→ Match the two datasets by author and title 
 + 
 +→ Quality of the matches? (score) 
 + 
 +===== Tools ===== 
 + 
 +**elasticsearch** [[https://www.elastic.co/de/]] 
 + 
 +JAVA based search engine, results exported in JSON  
 + 
 +**Flink** [[https://flink.apache.org/]] 
 + 
 +open-source stream processing framework 
 + 
 +**Metafacture** [[https://culturegraph.github.io/]],  
 +[[https://github.com/dataramblers/hackathon17/wiki#metafacture]] 
 + 
 +Tool suite for metadata-processing and transformation 
 + 
 +**Zeppelin** [[https://zeppelin.apache.org/]] 
 + 
 +Visualisation of the results 
 + 
 +===== How to get there ===== 
 + 
 +=== Usecase 1: Swissbib === 
 + 
 +{{:project:swissbib_infra.jpg?direct&300|}} 
 + 
 +=== Usecase 2: edoc === 
 + 
 +{{:project:edoc-infra.jpg?direct&300|}} 
 +===== Links ===== 
 + 
 +Data Ramblers Project Wiki [[https://github.com/dataramblers/hackathon17/wiki]]
  
 ===== Team ===== ===== Team =====
Line 30: Line 102:
  
      
-{{tag>status:concept}}+{{tag>status:concept glam}}
  
  • project/big_data_analytics.1505487996.txt.gz
  • Last modified: 2017/09/15 17:06
  • by j.martinelli