project:big_data_analytics

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
project:big_data_analytics [2017/09/15 17:06] – [Big Data Analytics (bibliographical data)] j.martinelliproject:big_data_analytics [2017/09/16 17:50] waddell
Line 4: Line 4:
  
 - more info will follow... - more info will follow...
 +
 +- see also:
 +  - https://github.com/dataramblers
 +  - https://github.com/dataramblers/hackathon17/wiki
  
 here a first sketch of what we're aiming at: here a first sketch of what we're aiming at:
  
-{{:project:plakat_v01.jpg?direct&800|}}+{{:project:plakat_v01.jpg?direct&600|}}
  
-===== Data =====+===== Datasets =====
  
 We use biographical metadata: We use biographical metadata:
-  crossref [[https://www.crossref.org/]] + 
-  * edoc [[http://edoc.unibas.ch/]] +**Swissbib bibliographical data** [[https://www.swissbib.ch/]] 
-  * swissbib [[https://www.swissbib.ch/]]+  * Catalog of all the Swiss University Libraries, the Swiss Nationallibrary, etc. 
 +  * 960 Libraries / 23 repositories (Bibliotheksverbunde) 
 +  * ca. 30 Mio records 
 +  * MARC21 XML Format 
 +  * → raw data stored in Mongo DB 
 +  * → transformed and clustered data stored in CBS (central library system)  
 + 
 + 
 +**edoc** [[http://edoc.unibas.ch/]] 
 +  * Institutional Repository der Universität Basel (Dokumentenserver, Open Access Publications) 
 +  * ca. 50'000 records 
 +  * JSON File 
 + 
 +**crossref** [[https://www.crossref.org/]] 
 +  * Digital Object Identifier (DOI) Registration Agency 
 +  * ca. 90 Mio records (we only use 30 Mio) 
 +  * JSON scraped from API 
 + 
 +===== Usecases ===== 
 + 
 +=== Swissbib === 
 + 
 +__Librarian__:  
 + 
 +- For prioritizing which of our holdings should be digitized most urgently, I want to know which of our holdings are nowhere else to be found. 
 + 
 +- We would like to have a list of all the DVDs in swissbib. 
 + 
 +- What is special about the holdings of some library/institution? Profile? 
 +  
 +__Data analyst__:  
 + 
 +- I wan‘t to get to know better my data. And be faster.  
 + 
 +→  e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyse, if these records should be sent through the merging process of CBS. There fore I also want to know, if these records contain other ‚relevant‘ fields, definded by CBS (e.g. ISBN, etc.). To analyze the results a visualization tool might be useful. 
 + 
 +=== edoc === 
 + 
 +Goal: Enrichment. I want to add missing identifiers (e.g. DOIs, ORCID, funder IDs) to the edoc dataset. 
 + 
 +→ Match the two datasets by author and title 
 + 
 +→ Quality of the matches? (score) 
 + 
 +===== Tools ===== 
 + 
 +**elasticsearch** [[https://www.elastic.co/de/]] 
 + 
 +JAVA based searchengine, results exported in JSON  
 + 
 +**Flink** [[https://flink.apache.org/]] 
 + 
 +open-source stream processing framework 
 + 
 +**Metafacture** [[https://culturegraph.github.io/]],  
 +[[https://github.com/dataramblers/hackathon17/wiki#metafacture]] 
 + 
 +Tool suite for metadata-processing and transformation 
 + 
 +**Zeppelin** [[https://zeppelin.apache.org/]] 
 + 
 +Visualisation of the results 
 + 
 +===== How to get there ===== 
 + 
 +=== Usecase 1: Swissbib === 
 + 
 +{{:project:swissbib_infra.jpg?direct&300|}} 
 + 
 +=== Usecase 2: edoc === 
 + 
 +{{:project:edoc-infra.jpg?direct&300|}} 
 +===== Links ===== 
 + 
 +Data Ramblers Project Wiki [[https://github.com/dataramblers/hackathon17/wiki]]
  
 ===== Team ===== ===== Team =====
Line 30: Line 108:
  
      
-{{tag>status:concept}}+{{tag>status:concept glam}}
  
  • project/big_data_analytics.txt
  • Last modified: 2017/09/18 10:12
  • by andrea