Differences

This shows you the differences between two versions of the page.

--- project:big_data_analytics [2017/09/15 17:06] – [Big Data Analytics (bibliographical data)] j.martinelli
+++ project:big_data_analytics [2017/09/18 10:12] (current) – Correct a few typos andrea
@@ Line 3: / Line 3: @@
 We try to analyse bibliographical data using big data technology (flink, elasticsearch, metafacture).
-- more info will follow...
+Here a first sketch of what we're aiming at:
-here a first sketch of what we're aiming at:
+{{:project:plakat_v01.jpg?direct&600|}}
-{{:project:plakat_v01.jpg?direct&800|}}
+===== Datasets =====
-===== Data =====
+We use bibliographical metadata:
-We use biographical metadata:
+**Swissbib bibliographical data** [[https://www.swissbib.ch/]]
-  * crossref [[https://www.crossref.org/]]
+  * Catalog of all the Swiss University Libraries, the Swiss National Library, etc.
-  * edoc [[http://edoc.unibas.ch/]]
+  * 960 Libraries / 23 repositories (Bibliotheksverbunde)
-  * swissbib [[https://www.swissbib.ch/]]
+  * ca. 30 Mio records
+  * MARC21 XML Format
+  * → raw data stored in Mongo DB
+  * → transformed and clustered data stored in CBS (central library system)
+**edoc** [[http://edoc.unibas.ch/]]
+  * Institutional Repository der Universität Basel (Dokumentenserver, Open Access Publications)
+  * ca. 50'000 records
+  * JSON File
+**crossref** [[https://www.crossref.org/]]
+  * Digital Object Identifier (DOI) Registration Agency
+  * ca. 90 Mio records (we only use 30 Mio)
+  * JSON scraped from API
+===== Use Cases =====
+=== Swissbib ===
+__Librarian__:
+- For prioritizing which of our holdings should be digitized most urgently, I want to know which of our holdings are nowhere else to be found.
+- We would like to have a list of all the DVDs in swissbib.
+- What is special about the holdings of some library/institution? Profile?
+__Data analyst__:
+- I want to get to know better my data. And be faster.
+→  e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyze, if these records should be sent through the merging process of CBS. Therefore I also want to know, if these records contain other ‚relevant‘ fields, defined by CBS (e.g. ISBN, etc.). To analyze the results, a visualization tool might be useful.
+=== edoc ===
+Goal: Enrichment. I want to add missing identifiers (e.g. DOIs, ORCID, funder IDs) to the edoc dataset.
+→ Match the two datasets by author and title
+→ Quality of the matches? (score)
+===== Tools =====
+**elasticsearch** [[https://www.elastic.co/de/]]
+JAVA based search engine, results exported in JSON
+**Flink** [[https://flink.apache.org/]]
+open-source stream processing framework
+**Metafacture** [[https://culturegraph.github.io/]],
+[[https://github.com/dataramblers/hackathon17/wiki#metafacture]]
+Tool suite for metadata-processing and transformation
+**Zeppelin** [[https://zeppelin.apache.org/]]
+Visualisation of the results
+===== How to get there =====
+=== Usecase 1: Swissbib ===
+{{:project:swissbib_infra.jpg?direct&300|}}
+=== Usecase 2: edoc ===
+{{:project:edoc-infra.jpg?direct&300|}}
+===== Links =====
+Data Ramblers Project Wiki [[https://github.com/dataramblers/hackathon17/wiki]]
 ===== Team =====
@@ Line 30: / Line 102: @@
-{{tag>status:concept}}
+{{tag>status:concept glam}}