Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1058-enrichWithCulturegraphRvkWithFix #1921

Merged
merged 19 commits into from
Jun 4, 2024

Conversation

TobiasNx
Copy link
Contributor

@TobiasNx TobiasNx commented Oct 12, 2023

See #1058

I tried to update the draft from https://github.com/hbz/lobid-resources/tree/1058-enrichWithCulturegraphRvk to a version working with fix.
I merged the master into this branch.

In contrast to the morph approach:

  • We do not need a separate filter step since the fix already can do this.

Also I do not know if we need:

public static void main(String... args) {
final FileOpener opener = new FileOpener();
opener.setReceiver(new XmlDecoder())
.setReceiver(
new XmlElementSplitter("marc:collection", "record")) //
.setReceiver(new LiteralToObject())
.setReceiver(new ObjectThreader<String>())//
.addReceiver(receiverThread()); // one thread for it's working
// on one file atm
opener.process(new File(args[0]).getAbsolutePath());
try {
opener.closeStream();
} catch (final NullPointerException e) {
// ignore, see https://github.com/hbz/lobid-resources/issues/1030
}
}

The decode-marcXml mechanism correctly identified single records.

@TobiasNx TobiasNx requested a review from dr0i October 12, 2023 13:34
@TobiasNx TobiasNx changed the base branch from master to 1058-enrichWithCulturegraphRvk October 12, 2023 13:36
@TobiasNx TobiasNx changed the base branch from 1058-enrichWithCulturegraphRvk to master October 12, 2023 13:36
@TobiasNx TobiasNx force-pushed the 1058-enrichWithCulturegraphRvkWithFix branch 2 times, most recently from b56f03b to 2a661a1 Compare October 12, 2023 13:48
@TobiasNx
Copy link
Contributor Author

Somehow it does not find the fix file yet.

@TobiasNx TobiasNx force-pushed the 1058-enrichWithCulturegraphRvkWithFix branch from 1d18dc6 to 76ad5df Compare October 16, 2023 09:20
@TobiasNx
Copy link
Contributor Author

Additionally we would need zdbId as mapping parameter, because new zdb resources dont get hbzIds anymore.

@TobiasNx TobiasNx marked this pull request as ready for review October 16, 2023 09:29
@TobiasNx TobiasNx marked this pull request as draft October 16, 2023 14:32
dr0i and others added 12 commits May 28, 2024 10:05
Filters out all resources belonging to hbz, get the RVK and build an
 lasticsearch bulk json file from this.

- use master-snapshot of metafacture to ommit id key for elasticsearch index
- add morph converting rules from marcxml to json
- add tests
- add runner

This is a prerequesite for #1058.
Not all input records are of interest. They are passed empty. With this filter
empty records are ignored, not passed.

See #1058.
hbz-Ids will be concatenated into one field delimited by a space.

- shrink unnecessary test data
- update test

See #1058.
- We do not need a separate filter step since the fix already can do this.
This reflects if the almaMmsId is properly ETLed.
It is :)
@dr0i dr0i force-pushed the 1058-enrichWithCulturegraphRvkWithFix branch from 8352ce0 to 26a55a7 Compare May 28, 2024 13:01
@dr0i
Copy link
Member

dr0i commented May 28, 2024

Your proposal re reducing complexity is taken into account. Test data is updated. A CSV export is introduced - I propose that the concordance table won't be too big (<100MB) so we could use this as it's very performant.
Re CSV: can you @TobiasNx change the FIX so that:

  • id is always in the first column
  • id consists always of one almaMmsId
  • all id's have their rows (as in: also those records where more than one id (btw.: probably those are all doublettes?) are present)
  • the RVK's are appendend to one string so that we always have a two-column table ?

@dr0i dr0i assigned TobiasNx and unassigned dr0i May 28, 2024
@TobiasNx
Copy link
Contributor Author

Your proposal re reducing complexity is taken into account. Test data is updated. A CSV export is introduced - I propose that the concordance table won't be too big (<100MB) so we could use this as it's very performant. Re CSV: can you @TobiasNx change the FIX so that:

* `id` is always in the first column

This is only possible for encode-csv, (or encode-json alone) the json-to-elasticsearch-bulk(type="rvk", index="cgrvk")| command changes the order within the json record. See: metafacture/metafacture-examples@239469d

* `id` consists always of _one_  `almaMmsId`

* all `id's` have their rows (as in: also those records where more than one `id` (btw.: probably those are all doublettes?) are present)

i dont think that this is possible.
The records are no always duplicates: See here: http://lobid.org/resources/search?q=almaMmsId%3A990063057720206441+OR+almaMmsId%3A990019247190206441+OR+almaMmsId%3A990063668050206441 (3 records, one duplicate, one translation)
Also I am not sure how splitting the records with multiple ids in individual rows is possible. At least not at the level of fix.

* the `RVK's` are appendend to _one_ string so that we always have a two-column table ?

@dr0i
Copy link
Member

dr0i commented May 31, 2024

  • id is always in the first column

a) As said in #1921 (comment) this should be done and makes only sense when creating a CSV. If it's not possible using one FIX could you provide a second one?

b) > all id's have their rows
Hm, should'nt that be possible with using triples somehow ? Anyway. Then we shoud make the first ID to the only ID (the only value for column one).

(did so in eae4a69)

dr0i added 2 commits May 31, 2024 16:21
Ensure exactly one ID. We silently drop the others atm.
These files are generated by ES when doing tests and may
violate the editorconfig rules.
@dr0i dr0i force-pushed the 1058-enrichWithCulturegraphRvkWithFix branch from fe57781 to ee6ad10 Compare May 31, 2024 14:38
@dr0i
Copy link
Member

dr0i commented May 31, 2024

The build of the concordance just started, based on the 9,2 GB file aggregate_20240507.marcxml.gz mentioned in #1058 (comment).
At a first glance - there are many doublette RVK entries, e.g.:

"990191558260206441","HU 8423, HU 8424, HU 8423, HU 8424, HU 8424, HU 8423, HU 8424"

Can you prevent these doublettes via the FIX @TobiasNx ?

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Jun 3, 2024

I found a way how to create single records for every id there is in a record: metafacture/metafacture-examples@06c1955

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two fixes?

Copy link
Member

@dr0i dr0i Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In json we don't need a record for every id on its own - the search is done by the search engine, in contrast to a csv where we need a single unique key. The json is more performant when using search engines, the csv is the only way to go when using tables.
Could also be, if the csv is working great, that we can get rid of jsonaltogether.

Co-authored-by: TobiasNx <61879957+TobiasNx@users.noreply.github.com>
@dr0i dr0i assigned dr0i and unassigned TobiasNx Jun 4, 2024
@dr0i
Copy link
Member

dr0i commented Jun 4, 2024

I am going to merge this.
Note that the branch name is not good, resp. that a new PR is going to be made where we not only generate the concordance but enrich our lobid-resources data with it.

@dr0i dr0i marked this pull request as ready for review June 4, 2024 11:58
@dr0i dr0i merged commit eb223f8 into master Jun 4, 2024
2 checks passed
@dr0i dr0i deleted the 1058-enrichWithCulturegraphRvkWithFix branch June 4, 2024 12:03
@dr0i dr0i mentioned this pull request Jun 4, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants