1058-enrichWithCulturegraphRvkWithFix #1921

TobiasNx · 2023-10-12T13:34:34Z

I tried to update the draft from https://github.com/hbz/lobid-resources/tree/1058-enrichWithCulturegraphRvk to a version working with fix.
I merged the master into this branch.

In contrast to the morph approach:

We do not need a separate filter step since the fix already can do this.

Also I do not know if we need:

lobid-resources/src/main/java/org/lobid/resources/run/CulturegraphXmlFilterHbzToJson.java

Lines 30 to 45 in d12b7a3

    
           public static void main(String... args) { 
        
           	final FileOpener opener = new FileOpener(); 
        
           	opener.setReceiver(new XmlDecoder()) 
        
           			.setReceiver( 
        
           					new XmlElementSplitter("marc:collection", "record")) // 
        
           			.setReceiver(new LiteralToObject()) 
        
           			.setReceiver(new ObjectThreader<String>())// 
        
           			.addReceiver(receiverThread()); // one thread for it's working 
        
           											// on one file atm 
        
           	opener.process(new File(args[0]).getAbsolutePath()); 
        
           	try { 
        
           		opener.closeStream(); 
        
           	} catch (final NullPointerException e) { 
        
           		// ignore, see https://github.com/hbz/lobid-resources/issues/1030 
        
           	} 
        
           }

The decode-marcXml mechanism correctly identified single records.

TobiasNx · 2023-10-12T14:04:41Z

Somehow it does not find the fix file yet.

TobiasNx · 2023-10-16T09:26:14Z

Additionally we would need zdbId as mapping parameter, because new zdb resources dont get hbzIds anymore.

Filters out all resources belonging to hbz, get the RVK and build an lasticsearch bulk json file from this. - use master-snapshot of metafacture to ommit id key for elasticsearch index - add morph converting rules from marcxml to json - add tests - add runner This is a prerequesite for #1058.

Not all input records are of interest. They are passed empty. With this filter empty records are ignored, not passed. See #1058.

hbz-Ids will be concatenated into one field delimited by a space. - shrink unnecessary test data - update test See #1058.

See #1058.

- We do not need a separate filter step since the fix already can do this.

Related to #1813

This reflects if the almaMmsId is properly ETLed. It is :)

Follows metafacture/metafacture-examples#8.

dr0i · 2024-05-28T14:59:41Z

Your proposal re reducing complexity is taken into account. Test data is updated. A CSV export is introduced - I propose that the concordance table won't be too big (<100MB) so we could use this as it's very performant.
Re CSV: can you @TobiasNx change the FIX so that:

id is always in the first column
id consists always of one almaMmsId
all id's have their rows (as in: also those records where more than one id (btw.: probably those are all doublettes?) are present)
the RVK's are appendend to one string so that we always have a two-column table ?

TobiasNx · 2024-05-29T08:25:50Z

Your proposal re reducing complexity is taken into account. Test data is updated. A CSV export is introduced - I propose that the concordance table won't be too big (<100MB) so we could use this as it's very performant. Re CSV: can you @TobiasNx change the FIX so that:
* `id` is always in the first column

This is only possible for encode-csv, (or encode-json alone) the json-to-elasticsearch-bulk(type="rvk", index="cgrvk")| command changes the order within the json record. See: metafacture/metafacture-examples@239469d

* `id` consists always of _one_  `almaMmsId`

* all `id's` have their rows (as in: also those records where more than one `id` (btw.: probably those are all doublettes?) are present)

i dont think that this is possible.
The records are no always duplicates: See here: http://lobid.org/resources/search?q=almaMmsId%3A990063057720206441+OR+almaMmsId%3A990019247190206441+OR+almaMmsId%3A990063668050206441 (3 records, one duplicate, one translation)
Also I am not sure how splitting the records with multiple ids in individual rows is possible. At least not at the level of fix.

* the `RVK's` are appendend to _one_ string so that we always have a two-column table ?

dr0i · 2024-05-31T13:28:38Z

id is always in the first column

a) As said in #1921 (comment) this should be done and makes only sense when creating a CSV. If it's not possible using one FIX could you provide a second one?

b) > all id's have their rows
Hm, should'nt that be possible with using triples somehow ? Anyway. Then we shoud make the first ID to the only ID (the only value for column one).

(did so in eae4a69)

Ensure exactly one ID. We silently drop the others atm.

These files are generated by ES when doing tests and may violate the editorconfig rules.

dr0i · 2024-05-31T14:59:51Z

The build of the concordance just started, based on the 9,2 GB file aggregate_20240507.marcxml.gz mentioned in #1058 (comment).
At a first glance - there are many doublette RVK entries, e.g.:

"990191558260206441","HU 8423, HU 8424, HU 8423, HU 8424, HU 8424, HU 8423, HU 8424"

Can you prevent these doublettes via the FIX @TobiasNx ?

TobiasNx · 2024-06-03T11:54:45Z

I found a way how to create single records for every id there is in a record: metafacture/metafacture-examples@06c1955

TobiasNx · 2024-06-03T13:42:22Z

src/main/resources/rvk/cg-to-rvk-json.fix

Why do we need two fixes?

In json we don't need a record for every id on its own - the search is done by the search engine, in contrast to a csv where we need a single unique key. The json is more performant when using search engines, the csv is the only way to go when using tables.
Could also be, if the csv is working great, that we can get rid of jsonaltogether.

src/main/resources/rvk/cg-to-rvk-json.fix

Co-authored-by: TobiasNx <61879957+TobiasNx@users.noreply.github.com>

dr0i · 2024-06-04T11:56:47Z

I am going to merge this.
Note that the branch name is not good, resp. that a new PR is going to be made where we not only generate the concordance but enrich our lobid-resources data with it.

TobiasNx assigned dr0i Oct 12, 2023

TobiasNx requested a review from dr0i October 12, 2023 13:34

TobiasNx changed the base branch from master to 1058-enrichWithCulturegraphRvk October 12, 2023 13:36

TobiasNx changed the base branch from 1058-enrichWithCulturegraphRvk to master October 12, 2023 13:36

TobiasNx force-pushed the 1058-enrichWithCulturegraphRvkWithFix branch 2 times, most recently from b56f03b to 2a661a1 Compare October 12, 2023 13:48

TobiasNx force-pushed the 1058-enrichWithCulturegraphRvkWithFix branch from 1d18dc6 to 76ad5df Compare October 16, 2023 09:20

TobiasNx marked this pull request as ready for review October 16, 2023 09:29

TobiasNx marked this pull request as draft October 16, 2023 14:32

dr0i and others added 12 commits May 28, 2024 10:05

Add filter to omit empty records

ed0344f

Not all input records are of interest. They are passed empty. With this filter empty records are ignored, not passed. See #1058.

Add rule to make id AND rvk mandatory

f535aa3

hbz-Ids will be concatenated into one field delimited by a space. - shrink unnecessary test data - update test See #1058.

Add junit test

95d87a6

See #1058.

Test runner

573cf05

See #1058.

WIP

d9235a3

Use fix instead of morph for culturegraph #1058

3454ae9

- We do not need a separate filter step since the fix already can do this.

Update logger dependencies

bdbdefa

Related to #1813

Catch FileNotFoundException #1058

87f63af

Catch Exception (#1058)

3fdcc5c

Remove metafacture-triples; add metafacture-elasticsearch (#1058)

8fb1fe3

Update test data (#1058)

26a55a7

This reflects if the almaMmsId is properly ETLed. It is :)

dr0i force-pushed the 1058-enrichWithCulturegraphRvkWithFix branch from 8352ce0 to 26a55a7 Compare May 28, 2024 13:01

dr0i added 3 commits May 28, 2024 15:30

Format

554949b

Reduce complexity of flow (#1058)

848c948

Follows metafacture/metafacture-examples#8.

Generate also CSV from CultureGraph as hbz-RVK concordance (#1085)

f6c150f

dr0i assigned TobiasNx and unassigned dr0i May 28, 2024

dr0i added 2 commits May 31, 2024 16:21

Enforce two columns' CSV; set ID to first column (#1058)

eae4a69

Ensure exactly one ID. We silently drop the others atm.

Exclude tmp directory for editorconfig

ee6ad10

These files are generated by ES when doing tests and may violate the editorconfig rules.

dr0i force-pushed the 1058-enrichWithCulturegraphRvkWithFix branch from fe57781 to ee6ad10 Compare May 31, 2024 14:38

Create record for each ID; separate CSV and JSON (#1058)

9c4a21a

TobiasNx commented Jun 3, 2024

View reviewed changes

src/main/resources/rvk/cg-to-rvk-json.fix Outdated Show resolved Hide resolved

Update src/main/resources/rvk/cg-to-rvk-json.fix

e297026

Co-authored-by: TobiasNx <61879957+TobiasNx@users.noreply.github.com>

dr0i assigned dr0i and unassigned TobiasNx Jun 4, 2024

dr0i approved these changes Jun 4, 2024

View reviewed changes

dr0i marked this pull request as ready for review June 4, 2024 11:58

dr0i merged commit eb223f8 into master Jun 4, 2024
2 checks passed

dr0i deleted the 1058-enrichWithCulturegraphRvkWithFix branch June 4, 2024 12:03

dr0i mentioned this pull request Jun 4, 2024

Enrich with RVK based on Culturegraph #1058

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1058-enrichWithCulturegraphRvkWithFix #1921

1058-enrichWithCulturegraphRvkWithFix #1921

TobiasNx commented Oct 12, 2023 •

edited

Loading

TobiasNx commented Oct 12, 2023

TobiasNx commented Oct 16, 2023

dr0i commented May 28, 2024

TobiasNx commented May 29, 2024

dr0i commented May 31, 2024 •

edited

Loading

dr0i commented May 31, 2024

TobiasNx commented Jun 3, 2024

TobiasNx Jun 3, 2024

dr0i Jun 3, 2024 •

edited

Loading

dr0i commented Jun 4, 2024

	public static void main(String... args) {
	final FileOpener opener = new FileOpener();
	opener.setReceiver(new XmlDecoder())
	.setReceiver(
	new XmlElementSplitter("marc:collection", "record")) //
	.setReceiver(new LiteralToObject())
	.setReceiver(new ObjectThreader<String>())//
	.addReceiver(receiverThread()); // one thread for it's working
	// on one file atm
	opener.process(new File(args[0]).getAbsolutePath());
	try {
	opener.closeStream();
	} catch (final NullPointerException e) {
	// ignore, see https://github.com/hbz/lobid-resources/issues/1030
	}
	}

1058-enrichWithCulturegraphRvkWithFix #1921

1058-enrichWithCulturegraphRvkWithFix #1921

Conversation

TobiasNx commented Oct 12, 2023 • edited Loading

TobiasNx commented Oct 12, 2023

TobiasNx commented Oct 16, 2023

dr0i commented May 28, 2024

TobiasNx commented May 29, 2024

dr0i commented May 31, 2024 • edited Loading

dr0i commented May 31, 2024

TobiasNx commented Jun 3, 2024

TobiasNx Jun 3, 2024

Choose a reason for hiding this comment

dr0i Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

dr0i commented Jun 4, 2024

TobiasNx commented Oct 12, 2023 •

edited

Loading

dr0i commented May 31, 2024 •

edited

Loading

dr0i Jun 3, 2024 •

edited

Loading