project:big_data_analytics

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
project:big_data_analytics [2017/09/16 15:09] – [Usecases] j.martinelliproject:big_data_analytics [2017/09/18 10:12] (current) – Correct a few typos andrea
Line 3: Line 3:
 We try to analyse bibliographical data using big data technology (flink, elasticsearch, metafacture). We try to analyse bibliographical data using big data technology (flink, elasticsearch, metafacture).
  
-- more info will follow... +Here a first sketch of what we're aiming at:
- +
-- see also: +
-  - https://github.com/dataramblers +
-  - https://github.com/dataramblers/hackathon17/wiki +
- +
-here a first sketch of what we're aiming at:+
  
 {{:project:plakat_v01.jpg?direct&600|}} {{:project:plakat_v01.jpg?direct&600|}}
Line 15: Line 9:
 ===== Datasets ===== ===== Datasets =====
  
-We use biographical metadata:+We use bibliographical metadata:
  
 **Swissbib bibliographical data** [[https://www.swissbib.ch/]] **Swissbib bibliographical data** [[https://www.swissbib.ch/]]
-  * Catalog of all the Swiss University Libraries, the Swiss Nationallibrary, etc.+  * Catalog of all the Swiss University Libraries, the Swiss National Library, etc.
   * 960 Libraries / 23 repositories (Bibliotheksverbunde)   * 960 Libraries / 23 repositories (Bibliotheksverbunde)
   * ca. 30 Mio records   * ca. 30 Mio records
Line 36: Line 30:
   * JSON scraped from API   * JSON scraped from API
  
-===== Usecases =====+===== Use Cases =====
  
 === Swissbib === === Swissbib ===
Line 50: Line 44:
 __Data analyst__:  __Data analyst__: 
  
-- I wan‘t to get to know better my data. And be faster. +- I want to get to know better my data. And be faster. 
  
-→  e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyse, if these records should be sent through the merging process of CBS. There fore I also want to know, if these records contain other ‚relevant‘ fields, definded by CBS (e.g. ISBN, etc.). To analyze the results a visualization tool might be useful.+→  e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyze, if these records should be sent through the merging process of CBS. Therefore I also want to know, if these records contain other ‚relevant‘ fields, defined by CBS (e.g. ISBN, etc.). To analyze the resultsa visualization tool might be useful.
  
 === edoc === === edoc ===
  
-Goal: Enrichment. I want to add missing data (e.g. DOIs) to the edoc dataset.+Goal: Enrichment. I want to add missing identifiers (e.g. DOIs, ORCID, funder IDs) to the edoc dataset.
  
 → Match the two datasets by author and title → Match the two datasets by author and title
Line 63: Line 57:
  
 ===== Tools ===== ===== Tools =====
 +
 +**elasticsearch** [[https://www.elastic.co/de/]]
 +
 +JAVA based search engine, results exported in JSON 
 +
 +**Flink** [[https://flink.apache.org/]]
 +
 +open-source stream processing framework
 +
 +**Metafacture** [[https://culturegraph.github.io/]], 
 +[[https://github.com/dataramblers/hackathon17/wiki#metafacture]]
 +
 +Tool suite for metadata-processing and transformation
 +
 +**Zeppelin** [[https://zeppelin.apache.org/]]
 +
 +Visualisation of the results
 +
 +===== How to get there =====
 +
 +=== Usecase 1: Swissbib ===
 +
 +{{:project:swissbib_infra.jpg?direct&300|}}
 +
 +=== Usecase 2: edoc ===
 +
 +{{:project:edoc-infra.jpg?direct&300|}}
 +===== Links =====
 +
 +Data Ramblers Project Wiki [[https://github.com/dataramblers/hackathon17/wiki]]
 +
 ===== Team ===== ===== Team =====
  
Line 77: Line 102:
  
      
-{{tag>status:concept}}+{{tag>status:concept glam}}
  
  • project/big_data_analytics.1505567349.txt.gz
  • Last modified: 2017/09/16 15:09
  • by j.martinelli