-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1058-enrichWithCulturegraphRvkWithFix #1921
Conversation
b56f03b
to
2a661a1
Compare
Somehow it does not find the fix file yet. |
1d18dc6
to
76ad5df
Compare
Additionally we would need zdbId as mapping parameter, because new zdb resources dont get hbzIds anymore. |
Filters out all resources belonging to hbz, get the RVK and build an lasticsearch bulk json file from this. - use master-snapshot of metafacture to ommit id key for elasticsearch index - add morph converting rules from marcxml to json - add tests - add runner This is a prerequesite for #1058.
Not all input records are of interest. They are passed empty. With this filter empty records are ignored, not passed. See #1058.
hbz-Ids will be concatenated into one field delimited by a space. - shrink unnecessary test data - update test See #1058.
See #1058.
See #1058.
- We do not need a separate filter step since the fix already can do this.
Related to #1813
This reflects if the almaMmsId is properly ETLed. It is :)
8352ce0
to
26a55a7
Compare
Your proposal re reducing complexity is taken into account. Test data is updated. A CSV export is introduced - I propose that the concordance table won't be too big (<100MB) so we could use this as it's very performant.
|
This is only possible for encode-csv, (or encode-json alone) the
i dont think that this is possible.
|
a) As said in #1921 (comment) this should be done and makes only sense when creating a CSV. If it's not possible using one FIX could you provide a second one? b) > all (did so in eae4a69) |
Ensure exactly one ID. We silently drop the others atm.
These files are generated by ES when doing tests and may violate the editorconfig rules.
fe57781
to
ee6ad10
Compare
The build of the concordance just started, based on the 9,2 GB file
Can you prevent these doublettes via the FIX @TobiasNx ? |
I found a way how to create single records for every id there is in a record: metafacture/metafacture-examples@06c1955 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two fixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In json
we don't need a record for every id
on its own - the search is done by the search engine, in contrast to a csv
where we need a single unique key. The json
is more performant when using search engines, the csv
is the only way to go when using tables.
Could also be, if the csv
is working great, that we can get rid of json
altogether.
Co-authored-by: TobiasNx <61879957+TobiasNx@users.noreply.github.com>
I am going to merge this. |
See #1058
I tried to update the draft from https://github.com/hbz/lobid-resources/tree/1058-enrichWithCulturegraphRvk to a version working with fix.
I merged the master into this branch.
In contrast to the morph approach:
Also I do not know if we need:
lobid-resources/src/main/java/org/lobid/resources/run/CulturegraphXmlFilterHbzToJson.java
Lines 30 to 45 in d12b7a3
The decode-marcXml mechanism correctly identified single records.