project:big_data_analytics

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
project:big_data_analytics [2017/09/16 17:50]
waddell
project:big_data_analytics [2017/09/18 10:12] (current)
andrea Correct a few typos
Line 3: Line 3:
 We try to analyse bibliographical data using big data technology (flink, elasticsearch,​ metafacture). We try to analyse bibliographical data using big data technology (flink, elasticsearch,​ metafacture).
  
-- more info will follow... +Here a first sketch of what we're aiming at:
- +
-- see also: +
-  - https://​github.com/​dataramblers +
-  - https://​github.com/​dataramblers/​hackathon17/​wiki +
- +
-here a first sketch of what we're aiming at:+
  
 {{:​project:​plakat_v01.jpg?​direct&​600|}} {{:​project:​plakat_v01.jpg?​direct&​600|}}
Line 15: Line 9:
 ===== Datasets ===== ===== Datasets =====
  
-We use biographical ​metadata:+We use bibliographical ​metadata:
  
 **Swissbib bibliographical data** [[https://​www.swissbib.ch/​]] **Swissbib bibliographical data** [[https://​www.swissbib.ch/​]]
-  * Catalog of all the Swiss University Libraries, the Swiss Nationallibrary, etc.+  * Catalog of all the Swiss University Libraries, the Swiss National Library, etc.
   * 960 Libraries / 23 repositories (Bibliotheksverbunde)   * 960 Libraries / 23 repositories (Bibliotheksverbunde)
   * ca. 30 Mio records   * ca. 30 Mio records
Line 36: Line 30:
   * JSON scraped from API   * JSON scraped from API
  
-===== Usecases ​=====+===== Use Cases =====
  
 === Swissbib === === Swissbib ===
Line 50: Line 44:
 __Data analyst__: ​ __Data analyst__: ​
  
-- I wan‘t ​to get to know better my data. And be faster. ​+- I want to get to know better my data. And be faster. ​
  
-→  e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyse, if these records should be sent through the merging process of CBS. There fore I also want to know, if these records contain other ‚relevant‘ fields, ​definded ​by CBS (e.g. ISBN, etc.). To analyze the results a visualization tool might be useful.+→  e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyze, if these records should be sent through the merging process of CBS. Therefore ​I also want to know, if these records contain other ‚relevant‘ fields, ​defined ​by CBS (e.g. ISBN, etc.). To analyze the resultsa visualization tool might be useful.
  
 === edoc === === edoc ===
Line 66: Line 60:
 **elasticsearch** [[https://​www.elastic.co/​de/​]] **elasticsearch** [[https://​www.elastic.co/​de/​]]
  
-JAVA based searchengine, results exported in JSON +JAVA based search engine, results exported in JSON 
  
 **Flink** [[https://​flink.apache.org/​]] **Flink** [[https://​flink.apache.org/​]]
  • project/big_data_analytics.txt
  • Last modified: 2017/09/18 10:12
  • by andrea