Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
project:hds_out_of_the_box [2016/07/02 12:37] – [Historical Dictionary of Switzerland Out of the Box] maudehrmann | project:hds_out_of_the_box [2016/07/04 15:12] (current) – [Exploring bibliographic enrichment with OpenRefine] pmau | ||
---|---|---|---|
Line 1: | Line 1: | ||
===== Historical Dictionary of Switzerland Out of the Box ===== | ===== Historical Dictionary of Switzerland Out of the Box ===== | ||
- | The [[http:// | + | The [[http:// |
- | The digital edition comprises about XXXX articles organized in 4 main headword groups: \\ | + | The HDS digital edition comprises about 36.000 |
- Biographies, | - Biographies, | ||
- Families, \\ | - Families, \\ | ||
- | - Geographical | + | - Geographical |
- Thematical contributions. | - Thematical contributions. | ||
- | Additionally, each articles | + | Beyond the encyclopaedic description of entities/ |
+ | |||
===== Data ===== | ===== Data ===== | ||
- | Historical Dictionary of Switzerland: [[http:// | + | We have the following data:\\ |
- | | + | |
+ | | ||
+ | | ||
+ | * article titles\\ | ||
+ | | ||
| | ||
===== Goals ===== | ===== Goals ===== | ||
+ | Our projects revolve around **linking the HDS to external data** and aim at:\\ | ||
+ | - **Entity linking towards HDS**\\ The objective is to link named entity mentions discovered in historical Swiss newspapers to their correspondant HDS articles.\\ | ||
- | ===== Named Entities Recognition ===== | + | - **Exploring reference citation of HDS articles**\\ The objective is to reconcile HDS bibliographic data contained in articles with SwissBib. |
- | ===== Bibliographic | + | |
+ | ===== Named Entity Recognition ===== | ||
+ | |||
+ | We used web-services to annotate text with named entities: | ||
+ | - Dandelion\\ | ||
+ | - Alchemy\\ | ||
+ | - OpenCalais \\ | ||
+ | |||
+ | |||
+ | {{: | ||
+ | |||
+ | Named entity mentions (persons and places) are matched against entity labels of HDS entries and directly linked when only one HDS entry exists. | ||
+ | |||
+ | Further developments would includes: | ||
+ | - handling name variants, e.g. 'W.A. Mozart' | ||
+ | - real disambiguation by comparing the newspaper article context with the HDS article context (a first simple similarity could be tf-idf based)\\ | ||
+ | - working with a more refined NER output which comprises information about name components (first, middle,last names)\\ | ||
+ | |||
+ | === Some statistics === | ||
+ | In the 23.622 articles of the year 1914 in «Le Temps digital archive» we linked 90.603 entities pointing to 1.417 articles of the «Historical Dictionary of Switzerland». | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | === Web Interface === | ||
+ | |||
+ | We developed a simple web interface for searching in the corpus and displaying the texts with the links.\\ | ||
+ | It consists of 3 views: | ||
+ | |||
+ | |||
+ | 1. Home\\ | ||
+ | {{: | ||
+ | \\ | ||
+ | 2. Search\\ | ||
+ | {{: | ||
+ | \\ | ||
+ | 3. Article with links to HDS, Wikipedia and dbpedia\\ | ||
+ | {{: | ||
+ | \\ | ||
+ | |||
+ | === Further works === | ||
+ | Further works would include: | ||
+ | - evaluate and improve method.\\ | ||
+ | - apply the method to the Historical Dictionary of Switzerland itself for internal linking.\\ | ||
+ | |||
+ | |||
+ | |||
+ | ===== Bibliographic | ||
We work on the list of references in all articles of the HDS, with three goals: | We work on the list of references in all articles of the HDS, with three goals: | ||
- | - Finding all the sources which are cited in the HDS (several sources are cited multiple times). | + | - Finding all the sources which are cited in the HDS (several sources are cited multiple times) |
- | - Link all the sources with the SwissBib catalog, if possible. | + | - Link all the sources with the SwissBib catalog, if possible |
- | - Interactively explore the citation | + | - Interactively explore the citation |
- | The dataset: lists of references in every HDS article: | + | The dataset |
{{: | {{: | ||
Line 46: | Line 102: | ||
{{: | {{: | ||
+ | |||
+ | ==== Exploring bibliographic enrichment with OpenRefine ==== | ||
+ | |||
+ | Bibliographic data in the HDS citations is unfortunately not structured. There is no logical separation between work title, publication year, page numbers, etc. other than typographical convention. Furthermore, | ||
+ | |||
+ | === Examples of unstructured data issues === | ||
+ | |||
+ | * // | ||
+ | * A. Niederer, «Vergleichende Bemerkungen zur ethnolog. und zur volkskundl. Arbeitsweise», | ||
+ | * //La visite des églises du diocèse de L. en 1453, hg. von A. Wildermann et al., 1993// - The subject of the dictionary entry is often abbreviated in the related citations. In this example, " | ||
+ | * //Stat. Jb. des Kt. L., 2002- // - In this example, " | ||
+ | |||
+ | === OpenRefine workflow === | ||
+ | |||
+ | After several attempts, it was established that combining several keywords from the reference title with the authors (without initials) produced the best results for querying swissbib. The following GREL expression can be applied to the OpenRefine column (using Edit column -> Add column based on this column) that contains the contents of the < | ||
+ | |||
+ | < | ||
+ | join(with(value.split(" | ||
+ | </ | ||
+ | |||
+ | Note that the above expression combines the <PUB> column (accessed through value) and the <AUT> column (containing the author' | ||
+ | |||
+ | Swissbib queries can return Dublin Core, MARC XML or MARC in JSON format. Dublin core is the easiest to manipulate, but unfortunately it does not contain the entirety of the returned record. To access the full record, it is necessary to use either MARC XML or MARC JSON. | ||
+ | |||
+ | To query swissbib and return Dublin Core, use (using Edit column -> Add column by fetching URLs): | ||
+ | |||
+ | < | ||
+ | replace(" | ||
+ | </ | ||
+ | |||
+ | To get MARC XML, use | ||
+ | |||
+ | < | ||
+ | replace(" | ||
+ | </ | ||
+ | |||
+ | To get MARC JSON, use | ||
+ | |||
+ | < | ||
+ | replace(" | ||
+ | </ | ||
+ | |||
+ | Using either of these queries seems to be returning good results. The returned data must be parsed to extract the required fields, for example the following GREL expression extracts the Title from the swissbib data when it is returned as Dublin Core: | ||
+ | |||
+ | < | ||
+ | if(value.parseHtml().select(" | ||
+ | </ | ||
+ | |||
+ | All the above operations can be reproduced on an OpenRefine project containing DHS citation data by using [[https:// | ||
+ | |||
+ | === Link back to swissbib === | ||
+ | |||
+ | Using the above queries, it is possible to receive the swissbib record ID that corresponds to a citation entry. Unfortunately, | ||
+ | |||
+ | For example, looking at the following returned result (in MARC XML format): | ||
+ | |||
+ | <code xml> | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | <record xmlns: | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | (...) | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | (...) | ||
+ | </ | ||
+ | </ | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | we find that this record has the internal swissbib ID of 215650557 but we know we cannot use it to retrieve this record in the future, since it can change. Instead, we have to use the ID of the source catalogue, in this case RERO, found in the MARC field 035$a: | ||
+ | |||
+ | <code xml> | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | A link back to swissbib can be constructed as | ||
+ | |||
+ | < | ||
+ | https:// | ||
+ | </ | ||
+ | |||
+ | (note that the parentheses around " | ||
+ | |||
+ | === Further works === | ||
+ | This is only the first step of a more general work inside the HDS:\\ | ||
+ | * identify precisely each notice in an article (ID attribute to generate)\\ | ||
+ | * collect references with a separation by language\\ | ||
+ | * clean and refine the collected data\\ | ||
+ | * setup a querying workflow that keeps the ID of the matched target in a reference catalog\\ | ||
+ | * replace each matching occurence in the HDS article by a reference to an external catalog\\ | ||
===== Team ===== | ===== Team ===== | ||
- | * [[user: | + | * Pierre-Marie Aubertel |
- | * [[user: | + | * Francesco Beretta |
- | * [[user:maudehrmann]] | + | * Giovanni Colavizza |
+ | * Maud Ehrmann | ||
+ | * [[https:// | ||
+ | * Jonas Schneider | ||
+ | |||
+ | |||
+ | |||
+ | |||
| | ||
- | {{tag> | + | {{tag> |