Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
project:big_data_analytics [2017/09/16 15:01] – [Data] j.martinelli | project:big_data_analytics [2017/09/18 10:12] (current) – Correct a few typos andrea | ||
---|---|---|---|
Line 3: | Line 3: | ||
We try to analyse bibliographical data using big data technology (flink, elasticsearch, | We try to analyse bibliographical data using big data technology (flink, elasticsearch, | ||
- | - more info will follow... | + | Here a first sketch of what we're aiming at: |
- | + | ||
- | - see also: | + | |
- | - https:// | + | |
- | - https:// | + | |
- | + | ||
- | here a first sketch of what we're aiming at: | + | |
{{: | {{: | ||
Line 15: | Line 9: | ||
===== Datasets ===== | ===== Datasets ===== | ||
- | We use biographical | + | We use bibliographical |
**Swissbib bibliographical data** [[https:// | **Swissbib bibliographical data** [[https:// | ||
- | * Catalog of all the Swiss University Libraries, the Swiss Nationallibrary, etc. | + | * Catalog of all the Swiss University Libraries, the Swiss National Library, etc. |
* 960 Libraries / 23 repositories (Bibliotheksverbunde) | * 960 Libraries / 23 repositories (Bibliotheksverbunde) | ||
* ca. 30 Mio records | * ca. 30 Mio records | ||
Line 35: | Line 29: | ||
* ca. 90 Mio records (we only use 30 Mio) | * ca. 90 Mio records (we only use 30 Mio) | ||
* JSON scraped from API | * JSON scraped from API | ||
+ | |||
+ | ===== Use Cases ===== | ||
+ | |||
+ | === Swissbib === | ||
+ | |||
+ | __Librarian__: | ||
+ | |||
+ | - For prioritizing which of our holdings should be digitized most urgently, I want to know which of our holdings are nowhere else to be found. | ||
+ | |||
+ | - We would like to have a list of all the DVDs in swissbib. | ||
+ | |||
+ | - What is special about the holdings of some library/ | ||
+ | |||
+ | __Data analyst__: | ||
+ | |||
+ | - I want to get to know better my data. And be faster. | ||
+ | |||
+ | → e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyze, if these records should be sent through the merging process of CBS. Therefore I also want to know, if these records contain other ‚relevant‘ fields, defined by CBS (e.g. ISBN, etc.). To analyze the results, a visualization tool might be useful. | ||
+ | |||
+ | === edoc === | ||
+ | |||
+ | Goal: Enrichment. I want to add missing identifiers (e.g. DOIs, ORCID, funder IDs) to the edoc dataset. | ||
+ | |||
+ | → Match the two datasets by author and title | ||
+ | |||
+ | → Quality of the matches? (score) | ||
+ | |||
+ | ===== Tools ===== | ||
+ | |||
+ | **elasticsearch** [[https:// | ||
+ | |||
+ | JAVA based search engine, results exported in JSON | ||
+ | |||
+ | **Flink** [[https:// | ||
+ | |||
+ | open-source stream processing framework | ||
+ | |||
+ | **Metafacture** [[https:// | ||
+ | [[https:// | ||
+ | |||
+ | Tool suite for metadata-processing and transformation | ||
+ | |||
+ | **Zeppelin** [[https:// | ||
+ | |||
+ | Visualisation of the results | ||
+ | |||
+ | ===== How to get there ===== | ||
+ | |||
+ | === Usecase 1: Swissbib === | ||
+ | |||
+ | {{: | ||
+ | |||
+ | === Usecase 2: edoc === | ||
+ | |||
+ | {{: | ||
+ | ===== Links ===== | ||
+ | |||
+ | Data Ramblers Project Wiki [[https:// | ||
===== Team ===== | ===== Team ===== | ||
Line 50: | Line 102: | ||
| | ||
- | {{tag> | + | {{tag> |