Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| project:big_data_analytics [2017/09/16 15:06] – [Usecases] j.martinelli | project:big_data_analytics [2017/09/18 10:12] (current) – Correct a few typos andrea | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| We try to analyse bibliographical data using big data technology (flink, elasticsearch, | We try to analyse bibliographical data using big data technology (flink, elasticsearch, | ||
| - | - more info will follow... | + | Here a first sketch of what we're aiming at: |
| - | + | ||
| - | - see also: | + | |
| - | - https:// | + | |
| - | - https:// | + | |
| - | + | ||
| - | here a first sketch of what we're aiming at: | + | |
| {{: | {{: | ||
| Line 15: | Line 9: | ||
| ===== Datasets ===== | ===== Datasets ===== | ||
| - | We use biographical | + | We use bibliographical |
| **Swissbib bibliographical data** [[https:// | **Swissbib bibliographical data** [[https:// | ||
| - | * Catalog of all the Swiss University Libraries, the Swiss Nationallibrary, etc. | + | * Catalog of all the Swiss University Libraries, the Swiss National Library, etc. |
| * 960 Libraries / 23 repositories (Bibliotheksverbunde) | * 960 Libraries / 23 repositories (Bibliotheksverbunde) | ||
| * ca. 30 Mio records | * ca. 30 Mio records | ||
| Line 36: | Line 30: | ||
| * JSON scraped from API | * JSON scraped from API | ||
| - | ===== Usecases | + | ===== Use Cases ===== |
| === Swissbib === | === Swissbib === | ||
| - | * Librarian: For prioritizing which of our holdings should be digitized most urgently, I want to know which of our holdings are nowhere else to be found. | + | __Librarian__: |
| - | * Library: | + | |
| - | | + | - For prioritizing which of our holdings should be digitized most urgently, I want to know which of our holdings are nowhere else to be found. |
| - | * Data analyst: I wan‘t | + | |
| + | - We would like to have a list of all the DVDs in swissbib. | ||
| + | |||
| + | - What is special about the holdings of some library/ | ||
| + | |||
| + | __Data analyst__: | ||
| + | |||
| + | - I want to get to know better my data. And be faster. | ||
| + | |||
| + | → e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyze, if these records should be sent through the merging process of CBS. Therefore | ||
| === edoc === | === edoc === | ||
| - | Goal: Enrichment. I want to add missing | + | Goal: Enrichment. I want to add missing |
| → Match the two datasets by author and title | → Match the two datasets by author and title | ||
| → Quality of the matches? (score) | → Quality of the matches? (score) | ||
| + | |||
| + | ===== Tools ===== | ||
| + | |||
| + | **elasticsearch** [[https:// | ||
| + | |||
| + | JAVA based search engine, results exported in JSON | ||
| + | |||
| + | **Flink** [[https:// | ||
| + | |||
| + | open-source stream processing framework | ||
| + | |||
| + | **Metafacture** [[https:// | ||
| + | [[https:// | ||
| + | |||
| + | Tool suite for metadata-processing and transformation | ||
| + | |||
| + | **Zeppelin** [[https:// | ||
| + | |||
| + | Visualisation of the results | ||
| + | |||
| + | ===== How to get there ===== | ||
| + | |||
| + | === Usecase 1: Swissbib === | ||
| + | |||
| + | {{: | ||
| + | |||
| + | === Usecase 2: edoc === | ||
| + | |||
| + | {{: | ||
| + | ===== Links ===== | ||
| + | |||
| + | Data Ramblers Project Wiki [[https:// | ||
| + | |||
| ===== Team ===== | ===== Team ===== | ||
| Line 66: | Line 102: | ||
| | | ||
| - | {{tag> | + | {{tag> |