-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add how to build a concordance of RVK <-> Library Union IDs
Originates from hbz/lobid-resources#1058. Created for a ligthning talk at Dini-Kim-Workshop 2020.
- Loading branch information
Showing
5 changed files
with
239 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
Create a json file of concordance of Verbundbibliotheks-IDs <-> RVK | ||
==================== | ||
This is an example of how to load the `culturegraph aggregate MARC21 XML` file, | ||
selecting a Library Union Catalog based on an ISIL, associate the IDs of the | ||
libraries with the [RVK](https://de.wikipedia.org/wiki/Regensburger_Verbundklassifikation) and generate an json bulk file which can be indexed into elasticsearch. | ||
|
||
- [Create a json file of concordance of Verbundbibliotheks-IDs <-> RVK](#create-a-json-file-of-concordance-of-verbundbibliotheks-ids-----rvk) | ||
- [Installation](#installation) | ||
- [Run](#run) | ||
- [Index](#index) | ||
- [Enrichment](#enrichment) | ||
|
||
# Installation | ||
Until the next metafacture release (> 5.1.0, coming this year) the easiest way is to | ||
load this prebundle: | ||
```bash | ||
wget http://lobid.org/download/tmp/dini-kim-2020_rvk/metafacture-core-rvk-dist.zip | ||
unzip metafacture-core-rvk-dist.zip | ||
cd metafacture-core-rvk-dist | ||
``` | ||
# Run | ||
```bash | ||
bash flux.sh culturegraph_to_Rvk-Verbundbibliothek_concordance_jsonbulk.flux | ||
``` | ||
The parameter after `flux.sh` (the flux file) can be the path to the flux file, e.g. the | ||
flux from this repo. Just make sure that the files used in the flux (input, morph) | ||
reside in the same directory as the flux itself (as it is in this repo). | ||
|
||
Get the real aggregate data dump (~7GB) from somewhere, adjust the path in the morph. | ||
|
||
# Index | ||
_This shall work at least for all elasticsearch versions <8.0 where the "index-type" setting is still valid._ | ||
The generated `bulk.ndjson` looks like this: | ||
|
||
> {"index":{"_index":"cgrvk","_type":"rvk"}} | ||
> {"rvk":["CI 1100","5,1"],"hbzId":"HT018839495, HT018625006"} | ||
This is elasticsearch's bulk format, where the odd-numbered lines are the index' | ||
metadata and the follwing even-numbered the actual data to be indexed. | ||
|
||
Make sure your Elasticsearch is up and running. Index: | ||
``` | ||
curl -XPOST --header 'Content-Type: application/x-ndjson' --data-binary @bulk.ndjson 'http://localhost:9200/_bulk' | ||
``` | ||
*Note*: elasticsearch's default upload sizes are limited to the Elasticsearch HTTP receive buffer size (default 100 Mb). See the script `bulkIndexingEs.sh` how to split | ||
the `bulk.ndjson` and index when you use the culturegraph's complete aggreagte data. | ||
|
||
Test-query: | ||
```bash | ||
curl localhost:9200/cg/_search?q="hbzId:HT018625006" | ||
``` | ||
|
||
# Enrichment | ||
The resulting elasticsearch index can be used to enrich your data. | ||
*Note*: as you may have quite a lot of records (several millions) don't use | ||
HTTP-Requests when doing lookups against the index but use native `TransportClients` | ||
of Elasticsearch, thus avoiding the HTTP overhead for performance reasons. Elasticsearch | ||
provide the libraries for nearly all programming languages. |
112 changes: 112 additions & 0 deletions
112
Concordance-RVK-Verbundbibliothek/aggregate_auslieferung_20191212.small.marcxml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim"> | ||
<marc:record> | ||
<marc:leader>00000nam a2200000 a 4500</marc:leader> | ||
<marc:controlfield tag="001">CG_1_2019-12-08T10:23:41.073Z</marc:controlfield> | ||
<marc:controlfield tag="003">DE-101</marc:controlfield> | ||
<marc:datafield ind2=" " ind1=" " tag="035"> | ||
<marc:subfield code="a">(AT-OBV)990034557380203331</marc:subfield> | ||
<marc:subfield code="8">7\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2="4" ind1="2" tag="689"> | ||
<marc:subfield code="a">Quelle</marc:subfield> | ||
<marc:subfield code="A">f</marc:subfield> | ||
<marc:subfield code="8">7\p</marc:subfield> | ||
</marc:datafield> | ||
</marc:record> | ||
<marc:record> | ||
<marc:leader>00000nam a2200000 a 4500</marc:leader> | ||
<marc:controlfield tag="001">CG_2_2019-12-08T10:23:41.073Z</marc:controlfield> | ||
<marc:controlfield tag="003">DE-101</marc:controlfield> | ||
<marc:datafield ind2=" " ind1=" " tag="035"> | ||
<marc:subfield code="a">(AT-OBV)990032216710203331</marc:subfield> | ||
<marc:subfield code="8">4\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1=" " tag="035"> | ||
<marc:subfield code="a">(DE-101)962757853</marc:subfield> | ||
<marc:subfield code="8">6\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1=" " tag="035"> | ||
<marc:subfield code="a">(DE-605)HT018839495</marc:subfield> | ||
<marc:subfield code="8">1\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1=" " tag="035"> | ||
<marc:subfield code="a">(DE-605)HT018625006</marc:subfield> | ||
<marc:subfield code="8">9\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1=" " tag="035"> | ||
<marc:subfield code="a">(DE-607)HT01862500i7</marc:subfield> | ||
<marc:subfield code="8">9\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1=" " tag="084"> | ||
<marc:subfield code="a">CI 1100</marc:subfield> | ||
<marc:subfield code="2">rvk</marc:subfield> | ||
<marc:subfield code="0">(DE-625)18356:</marc:subfield> | ||
<marc:subfield code="0">(DE-603)407647848</marc:subfield> | ||
<marc:subfield code="8">3\p</marc:subfield> | ||
<marc:subfield code="8">5\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1=" " tag="084"> | ||
<marc:subfield code="a">5,1</marc:subfield> | ||
<marc:subfield code="2">ssgn</marc:subfield> | ||
<marc:subfield code="8">10\p</marc:subfield> | ||
<marc:subfield code="8">11\p</marc:subfield> | ||
<marc:subfield code="8">12\p</marc:subfield> | ||
<marc:subfield code="8">13\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2="7" ind1="1" tag="600"> | ||
<marc:subfield code="0">(DE-588)118572350</marc:subfield> | ||
<marc:subfield code="0">(DE-627)135606780</marc:subfield> | ||
<marc:subfield code="0">(DE-576)20901315X</marc:subfield> | ||
<marc:subfield code="a">Lévinas, Emmanuel</marc:subfield> | ||
<marc:subfield code="d">1906-1995</marc:subfield> | ||
<marc:subfield code="2">gnd</marc:subfield> | ||
<marc:subfield code="8">11\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2="4" ind1=" " tag="650"> | ||
<marc:subfield code="a">Ethics</marc:subfield> | ||
<marc:subfield code="8">11\p</marc:subfield> | ||
<marc:subfield code="8">12\p</marc:subfield> | ||
<marc:subfield code="8">13\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2="0" ind1=" " tag="653"> | ||
<marc:subfield code="a">Phenomenology</marc:subfield> | ||
<marc:subfield code="8">10\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1="0" tag="689"> | ||
<marc:subfield code="5">DE-101</marc:subfield> | ||
<marc:subfield code="5">DE-101</marc:subfield> | ||
<marc:subfield code="8">6\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1="1" tag="689"> | ||
<marc:subfield code="5">DE-605</marc:subfield> | ||
<marc:subfield code="8">7\p</marc:subfield> | ||
<marc:subfield code="8">9\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2=" " ind1="2" tag="689"> | ||
<marc:subfield code="5">DE-605</marc:subfield> | ||
<marc:subfield code="8">7\p</marc:subfield> | ||
<marc:subfield code="8">9\p</marc:subfield> | ||
</marc:datafield> | ||
</marc:record> | ||
<marc:record> | ||
<marc:leader>00000nam a2200000 a 4500</marc:leader> | ||
<marc:controlfield tag="001">CG_2_2019-12-08T10:23:42.074Z</marc:controlfield> | ||
<marc:controlfield tag="003">DE-101</marc:controlfield> | ||
<marc:subfield code="a">(DE-605)HT013166356</marc:subfield> | ||
<marc:subfield code="8">1\p</marc:subfield> | ||
<marc:datafield ind2=" " ind1=" " tag="035"> | ||
<marc:subfield code="a">(DE-605)HT018625006</marc:subfield> | ||
<marc:subfield code="8">9\p</marc:subfield> | ||
</marc:datafield> | ||
<marc:datafield ind2="7" ind1="1" tag="600"> | ||
<marc:subfield code="0">(DE-588)118572350</marc:subfield> | ||
<marc:subfield code="0">(DE-627)135606780</marc:subfield> | ||
<marc:subfield code="0">(DE-576)20901315X</marc:subfield> | ||
<marc:subfield code="a">Lévinas, Emmanuel</marc:subfield> | ||
<marc:subfield code="d">1906-1995</marc:subfield> | ||
<marc:subfield code="2">gnd</marc:subfield> | ||
<marc:subfield code="8">11\p</marc:subfield> | ||
</marc:datafield> | ||
</marc:record> | ||
</marc:collection> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
#Upload sizes are limited to the Elasticsearch HTTP receive buffer size (default 100 Mb). | ||
# 1. split bulk.ndjson | ||
# Because every two lines are one complete bulk index request one must split them | ||
# even-numbered. E.g.: | ||
split --lines=1000000 bulk.ndjson | ||
# 2. now all the resulting files can be indexed: | ||
for i in $(ls x*); do | ||
echo $i; | ||
curl -H "Content-Type: application/x-ndjson" -XPOST locahost:9200/_bulk --data-binary "@$i" 2>&1>/dev/null | ||
done |
30 changes: 30 additions & 0 deletions
30
...nce-RVK-Verbundbibliothek/culturegraph_to_Rvk-Verbundbibliothek_concordance_jsonbulk.flux
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
// Die flux filtert mittels morph-cg-to-es.xml die Isil DE-605 aus | ||
// der culturegraph aggregate marcxml raus und baut eine json bulk | ||
// Datei, snippet daraus: | ||
// | ||
//{"index":{"_index":"cgrvk","_type":"rvk"}} | ||
//{"rvk":["CI 1100","5,1"],"hbzId":"HT018839495, HT018625006"} | ||
// | ||
// Diese Datei kann per curl in einen Elasticsearch-Index geladen werden: | ||
// | ||
// curl -XPOST --header 'Content-Type: application/x-ndjson' -d @bulk.ndjson 'http://localhost:9200/_bulk' | ||
|
||
default outfile = "bulk.ndjson"; | ||
default infile = FLUX_DIR + "aggregate_auslieferung_20191212.small.marcxml"; | ||
default morphfile = FLUX_DIR + "morph-cg-to-es.xml"; | ||
|
||
|
||
infile| | ||
open-file| | ||
decode-xml| | ||
split-xml-elements(topLevelElement="marc:collection",elementName="record")| | ||
literal-to-object| | ||
read-string| | ||
decode-xml| | ||
handle-marcxml| | ||
filter(morphfile)| | ||
morph(morphfile)| | ||
encode-json| | ||
json-to-elasticsearch-bulk(type="rvk", index="cgrvk")| | ||
write(outfile); | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<metamorph xmlns="http://www.culturegraph.org/metamorph" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
version="1"> | ||
<!-- Filter hbz ids from culturegraph XML. Build a hbzId-rvk concordance --> | ||
<!-- structure of marc xml: $field$indicator1$indicator2$subfield --> | ||
<rules> | ||
<!-- ####################### --> | ||
<!-- ####### Get rvk and hbz id --> | ||
<!-- ####################### --> | ||
<combine name="@rvk" value="${rvk}" > | ||
<data source="084??.a" name="rvk"/> | ||
</combine> | ||
<combine name="@hbzId" value="${id}"> | ||
<data source="035??.a" name="id"> | ||
<regexp match="^\(DE-605\)(.*)" format="${1}"/> | ||
</data> | ||
</combine> | ||
<combine name="rvk" value="${rvk}" > | ||
<data source="@hbzId"/> | ||
<data source="@rvk" name="rvk"/> | ||
</combine> | ||
<combine name="hbzId" value="${hbzId}"> | ||
<concat delimiter=", " name="hbzId" > | ||
<data source="@hbzId"/> | ||
</concat> | ||
<data source="@rvk" /> | ||
</combine> | ||
</rules> | ||
</metamorph> |