Skip to content

Commit

Permalink
Add how to build a concordance of RVK <-> Library Union IDs
Browse files Browse the repository at this point in the history
Originates from hbz/lobid-resources#1058.
Created for a ligthning talk at Dini-Kim-Workshop 2020.
  • Loading branch information
dr0i committed Apr 30, 2020
1 parent 85883d4 commit c372986
Show file tree
Hide file tree
Showing 5 changed files with 237 additions and 0 deletions.
56 changes: 56 additions & 0 deletions Concordance-RVK-Verbundbibliothek/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
Create a json file of concordance of Verbundbibliotheks-IDs <-> RVK
====================
This shows how to load the `culturegraph aggregate MARC21 XML` file,
selecting a Library Union Catalog based on an ISIL, get the IDs of the
libraries with the associated [RVK](https://de.wikipedia.org/wiki/Regensburger_Verbundklassifikation) and generate a json bulk file which can be indexed into elasticsearch.

# Installation
Until the next metafacture release (> 5.1.0, coming this year) the easiest way is to
load this prebundle:
```bash
wget http://lobid.org/download/tmp/dini-kim-2020_rvk/metafacture-core-rvk-dist.zip
unzip metafacture-core-rvk-dist.zip
cd metafacture-core-rvk-dist
```
# Run
```bash
bash flux.sh culturegraph_to_Rvk-Verbundbibliothek_concordance_jsonbulk.flux
```
The parameter after `flux.sh` (the flux file) can be the path to the flux file, e.g. the
flux from this repo. Just make sure that the files used in the flux (input, morph)
reside in the same directory as the flux itself (as it is in this repo).

Get the real aggregate data dump (~7GB) from somewhere, adjust the path in the morph.

# Index
_This shall work for all elasticsearch versions <8.0 where the "index-type" setting is still valid._

The generated `bulk.ndjson` looks like this:

> {"index":{"_index":"cgrvk","_type":"rvk"}}
> {"rvk":["CI 1100","5,1"],"hbzId":"HT018839495, HT018625006"}
This is elasticsearch's bulk format, where the odd-numbered lines are the index'
metadata and the following even-numbered the actual data to be indexed.

Make sure your Elasticsearch is up and running. Index:
```
curl -XPOST --header 'Content-Type: application/x-ndjson' --data-binary @bulk.ndjson 'http://localhost:9200/_bulk'
```
*Note*: elasticsearch's default upload sizes are limited to the Elasticsearch HTTP receive buffer size (default 100 Mb). See the script `bulkIndexingEs.sh` how to split
the `bulk.ndjson` and index when your `bulk.ndjson` is bigger when using the
culturegraph's complete aggregate data.

Test-query:
```bash
curl 'localhost:9200/cgrvk/_search?q=hbzId:HT018625006&pretty=true'
```

# Enrichment
The resulting elasticsearch index can be used to enrich your data.

*Note*: as you may have quite a lot of records (several millions) don't use
HTTP-Requests when doing lookups against the index but use native `TransportClients`
of Elasticsearch, thus avoiding the HTTP overhead for performance reasons. Elasticsearch
provide the libraries for nearly all programming languages.
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
<?xml version="1.0" encoding="UTF-8"?>
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim">
<marc:record>
<marc:leader>00000nam a2200000 a 4500</marc:leader>
<marc:controlfield tag="001">CG_1_2019-12-08T10:23:41.073Z</marc:controlfield>
<marc:controlfield tag="003">DE-101</marc:controlfield>
<marc:datafield ind2=" " ind1=" " tag="035">
<marc:subfield code="a">(AT-OBV)990034557380203331</marc:subfield>
<marc:subfield code="8">7\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2="4" ind1="2" tag="689">
<marc:subfield code="a">Quelle</marc:subfield>
<marc:subfield code="A">f</marc:subfield>
<marc:subfield code="8">7\p</marc:subfield>
</marc:datafield>
</marc:record>
<marc:record>
<marc:leader>00000nam a2200000 a 4500</marc:leader>
<marc:controlfield tag="001">CG_2_2019-12-08T10:23:41.073Z</marc:controlfield>
<marc:controlfield tag="003">DE-101</marc:controlfield>
<marc:datafield ind2=" " ind1=" " tag="035">
<marc:subfield code="a">(AT-OBV)990032216710203331</marc:subfield>
<marc:subfield code="8">4\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1=" " tag="035">
<marc:subfield code="a">(DE-101)962757853</marc:subfield>
<marc:subfield code="8">6\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1=" " tag="035">
<marc:subfield code="a">(DE-605)HT018839495</marc:subfield>
<marc:subfield code="8">1\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1=" " tag="035">
<marc:subfield code="a">(DE-605)HT018625006</marc:subfield>
<marc:subfield code="8">9\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1=" " tag="035">
<marc:subfield code="a">(DE-607)HT01862500i7</marc:subfield>
<marc:subfield code="8">9\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1=" " tag="084">
<marc:subfield code="a">CI 1100</marc:subfield>
<marc:subfield code="2">rvk</marc:subfield>
<marc:subfield code="0">(DE-625)18356:</marc:subfield>
<marc:subfield code="0">(DE-603)407647848</marc:subfield>
<marc:subfield code="8">3\p</marc:subfield>
<marc:subfield code="8">5\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1=" " tag="084">
<marc:subfield code="a">5,1</marc:subfield>
<marc:subfield code="2">ssgn</marc:subfield>
<marc:subfield code="8">10\p</marc:subfield>
<marc:subfield code="8">11\p</marc:subfield>
<marc:subfield code="8">12\p</marc:subfield>
<marc:subfield code="8">13\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2="7" ind1="1" tag="600">
<marc:subfield code="0">(DE-588)118572350</marc:subfield>
<marc:subfield code="0">(DE-627)135606780</marc:subfield>
<marc:subfield code="0">(DE-576)20901315X</marc:subfield>
<marc:subfield code="a">Lévinas, Emmanuel</marc:subfield>
<marc:subfield code="d">1906-1995</marc:subfield>
<marc:subfield code="2">gnd</marc:subfield>
<marc:subfield code="8">11\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2="4" ind1=" " tag="650">
<marc:subfield code="a">Ethics</marc:subfield>
<marc:subfield code="8">11\p</marc:subfield>
<marc:subfield code="8">12\p</marc:subfield>
<marc:subfield code="8">13\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2="0" ind1=" " tag="653">
<marc:subfield code="a">Phenomenology</marc:subfield>
<marc:subfield code="8">10\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1="0" tag="689">
<marc:subfield code="5">DE-101</marc:subfield>
<marc:subfield code="5">DE-101</marc:subfield>
<marc:subfield code="8">6\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1="1" tag="689">
<marc:subfield code="5">DE-605</marc:subfield>
<marc:subfield code="8">7\p</marc:subfield>
<marc:subfield code="8">9\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2=" " ind1="2" tag="689">
<marc:subfield code="5">DE-605</marc:subfield>
<marc:subfield code="8">7\p</marc:subfield>
<marc:subfield code="8">9\p</marc:subfield>
</marc:datafield>
</marc:record>
<marc:record>
<marc:leader>00000nam a2200000 a 4500</marc:leader>
<marc:controlfield tag="001">CG_2_2019-12-08T10:23:42.074Z</marc:controlfield>
<marc:controlfield tag="003">DE-101</marc:controlfield>
<marc:subfield code="a">(DE-605)HT013166356</marc:subfield>
<marc:subfield code="8">1\p</marc:subfield>
<marc:datafield ind2=" " ind1=" " tag="035">
<marc:subfield code="a">(DE-605)HT018625006</marc:subfield>
<marc:subfield code="8">9\p</marc:subfield>
</marc:datafield>
<marc:datafield ind2="7" ind1="1" tag="600">
<marc:subfield code="0">(DE-588)118572350</marc:subfield>
<marc:subfield code="0">(DE-627)135606780</marc:subfield>
<marc:subfield code="0">(DE-576)20901315X</marc:subfield>
<marc:subfield code="a">Lévinas, Emmanuel</marc:subfield>
<marc:subfield code="d">1906-1995</marc:subfield>
<marc:subfield code="2">gnd</marc:subfield>
<marc:subfield code="8">11\p</marc:subfield>
</marc:datafield>
</marc:record>
</marc:collection>
10 changes: 10 additions & 0 deletions Concordance-RVK-Verbundbibliothek/bulkIndexingEs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#Upload sizes are limited to the Elasticsearch HTTP receive buffer size (default 100 Mb).
# 1. split bulk.ndjson
# Because every two lines are one complete bulk index request one must split them
# even-numbered. E.g.:
split --lines=1000000 bulk.ndjson
# 2. now all the resulting files can be indexed:
for i in $(ls x*); do
echo $i;
curl -H "Content-Type: application/x-ndjson" -XPOST locahost:9200/_bulk --data-binary "@$i" 2>&1>/dev/null
done
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
// Die flux filtert mittels morph-cg-to-es.xml die Isil DE-605 aus
// der culturegraph aggregate marcxml raus und baut eine json bulk
// Datei, snippet daraus:
//
//{"index":{"_index":"cgrvk","_type":"rvk"}}
//{"rvk":["CI 1100","5,1"],"hbzId":"HT018839495, HT018625006"}
//
// Diese Datei kann per curl in einen Elasticsearch-Index geladen werden:
//
// curl -XPOST --header 'Content-Type: application/x-ndjson' -d @bulk.ndjson 'http://localhost:9200/_bulk'

default outfile = "bulk.ndjson";
default infile = FLUX_DIR + "aggregate_auslieferung_20191212.small.marcxml";
default morphfile = FLUX_DIR + "morph-cg-to-es.xml";


infile|
open-file|
decode-xml|
split-xml-elements(topLevelElement="marc:collection",elementName="record")|
literal-to-object|
read-string|
decode-xml|
handle-marcxml|
filter(morphfile)|
morph(morphfile)|
encode-json|
json-to-elasticsearch-bulk(type="rvk", index="cgrvk")|
write(outfile);

29 changes: 29 additions & 0 deletions Concordance-RVK-Verbundbibliothek/morph-cg-to-es.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<?xml version="1.0" encoding="UTF-8"?>
<metamorph xmlns="http://www.culturegraph.org/metamorph" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1">
<!-- Filter hbz ids from culturegraph XML. Build a hbzId-rvk concordance -->
<!-- structure of marc xml: $field$indicator1$indicator2$subfield -->
<rules>
<!-- ####################### -->
<!-- ####### Get rvk and hbz id -->
<!-- ####################### -->
<combine name="@rvk" value="${rvk}" >
<data source="084??.a" name="rvk"/>
</combine>
<combine name="@hbzId" value="${id}">
<data source="035??.a" name="id">
<regexp match="^\(DE-605\)(.*)" format="${1}"/>
</data>
</combine>
<combine name="rvk" value="${rvk}" >
<data source="@hbzId"/>
<data source="@rvk" name="rvk"/>
</combine>
<combine name="hbzId" value="${hbzId}">
<concat delimiter=", " name="hbzId" >
<data source="@hbzId"/>
</concat>
<data source="@rvk" />
</combine>
</rules>
</metamorph>

0 comments on commit c372986

Please sign in to comment.