===== Big Data Analytics (bibliographical data) ===== We try to analyse bibliographical data using big data technology (flink, elasticsearch, metafacture). Here a first sketch of what we're aiming at: {{:project:plakat_v01.jpg?direct&600|}} ===== Datasets ===== We use bibliographical metadata: **Swissbib bibliographical data** [[https://www.swissbib.ch/]] * Catalog of all the Swiss University Libraries, the Swiss National Library, etc. * 960 Libraries / 23 repositories (Bibliotheksverbunde) * ca. 30 Mio records * MARC21 XML Format * → raw data stored in Mongo DB * → transformed and clustered data stored in CBS (central library system) **edoc** [[http://edoc.unibas.ch/]] * Institutional Repository der Universität Basel (Dokumentenserver, Open Access Publications) * ca. 50'000 records * JSON File **crossref** [[https://www.crossref.org/]] * Digital Object Identifier (DOI) Registration Agency * ca. 90 Mio records (we only use 30 Mio) * JSON scraped from API ===== Use Cases ===== === Swissbib === __Librarian__: - For prioritizing which of our holdings should be digitized most urgently, I want to know which of our holdings are nowhere else to be found. - We would like to have a list of all the DVDs in swissbib. - What is special about the holdings of some library/institution? Profile? __Data analyst__: - I want to get to know better my data. And be faster. → e.g. I want to know which records don‘t have any entry for ‚year of publication‘. I want to analyze, if these records should be sent through the merging process of CBS. Therefore I also want to know, if these records contain other ‚relevant‘ fields, defined by CBS (e.g. ISBN, etc.). To analyze the results, a visualization tool might be useful. === edoc === Goal: Enrichment. I want to add missing identifiers (e.g. DOIs, ORCID, funder IDs) to the edoc dataset. → Match the two datasets by author and title → Quality of the matches? (score) ===== Tools ===== **elasticsearch** [[https://www.elastic.co/de/]] JAVA based search engine, results exported in JSON **Flink** [[https://flink.apache.org/]] open-source stream processing framework **Metafacture** [[https://culturegraph.github.io/]], [[https://github.com/dataramblers/hackathon17/wiki#metafacture]] Tool suite for metadata-processing and transformation **Zeppelin** [[https://zeppelin.apache.org/]] Visualisation of the results ===== How to get there ===== === Usecase 1: Swissbib === {{:project:swissbib_infra.jpg?direct&300|}} === Usecase 2: edoc === {{:project:edoc-infra.jpg?direct&300|}} ===== Links ===== Data Ramblers Project Wiki [[https://github.com/dataramblers/hackathon17/wiki]] ===== Team ===== * Data Ramblers [[https://github.com/dataramblers]] * Dominique Blaser * Jean-Baptiste Genicot * Günter Hipler * Jacqueline Martinelli * Rémy Meja * Andrea Notroff * Sebastian Schüpbach * T * Silvia Witzig {{tag>status:concept glam}}