Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode-marcxml should combine split leader fields of decode-marc21 to create valid marc xml #527

Closed
TobiasNx opened this issue Apr 22, 2024 · 17 comments · Fixed by #538
Closed
Assignees
Labels

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented Apr 22, 2024

encode-marcxml creates invalid marc data when processing decoded-marc21 with separated leader elements.

It creates invalid marc xml data. See the leader spec here in the schema: https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd

encode-marcxml should behave in the same way as encode-marc21, if the data is seperated the encoder should combine them.

Status quo:
https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B

decode-marc21(emitLeaderAsWhole="false")` -> encode-marcxml => the separated elements are kept:

		<marc:leader>p</marc:leader>
		<marc:leader>a</marc:leader>
		<marc:leader>m</marc:leader>
		<marc:leader> </marc:leader>
		<marc:leader>a</marc:leader>
		<marc:leader> </marc:leader>
		<marc:leader>c</marc:leader>
		<marc:leader> </marc:leader>

leader should be combined.

Originally posted by @TobiasNx in #526 (comment)

@TobiasNx TobiasNx added the Bug label Apr 22, 2024
@dr0i
Copy link
Member

dr0i commented Apr 22, 2024

Asking @blackwinter @fsteeg @hagbeck @maipet @TobiasNx : this would change the default behaviour - how should we procede:

a) just fix it and mention the fix as we normally do
b) note also in https://github.com/metafacture/metafacture-core/blob/master/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/MarcXmlEncoder.java the change of the behaviour. Somewhere else?
c) should we also provide a possibility for backward compatibility, i.e. allowing the creation of the invalid MARC XML?
d) do we have to increase the major release version (to 7.0.0) ?

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Apr 22, 2024

In my opinion this should similar be handled as #401. It is a bug, It should not be backwards compatible.

@dr0i dr0i self-assigned this Apr 22, 2024
@dr0i dr0i added this to Metafacture Apr 22, 2024
@dr0i dr0i moved this to Working in Metafacture Apr 22, 2024
@dr0i
Copy link
Member

dr0i commented Apr 23, 2024

decode-marc21(emitLeaderAsWhole="false") (where this prarameter set to false is the default) outputs not all necessary elements. We could:
a) enhance decode-marc21 to output all the subfields that are needed to compose the leader in encode-marxml.
Then enhance encode-marxml to parse these subfields.
b) make decode-marc21(emitleaderaswhole="true") as default. And bail out in encode-marxml when leader is encountered as separated subfields with message like e.g. "Please provide one marc:leader$marcleader</marc:leader>"

I would really like to go with b) which saves us (and the code) much extra work in comparison to a).
It's true then that still decode-marc21(emitLeaderAsWhole="false") would not work (well - it would not generate a wrong MARC XML but no MARC XML at all) but as this would no longer be the default I think it's very unlikely that someone explicitly sets the parameter (decode-marc21(emitLeaderAsWhole="false")) - and if so, the Exception message would be clear enough.
WDYT?

@dr0i dr0i moved this from Working to Review in Metafacture Apr 23, 2024
@dr0i dr0i removed their assignment Apr 23, 2024
@TobiasNx
Copy link
Contributor Author

TobiasNx commented Apr 24, 2024

In my opinion decode-marc21 should not be changed at all.
I thought you could reuse something from encode-marc21 for encode-marcxml, because encode-marc21 can handle the "missing" elements. (Probably with some JAVA stuff here)

All leader info should already be available either since they are a) static, b) can be generated by the encoder (as #524) or c) provided by the decode-marc21.

LeaderPos Name Variant
00-04 Record length b) generated by encoder as in #524
05 record status c) leader.status
06 record type c) leader.type
07 bibliographic level c) leader.bibliographicLevel
08 type of control c) leader.typeOfControl
09 character coding scheme c) leader.characterCodingScheme
10 Indicator count a) always 2
11 Subfield code count a) always 2
12-16 Base address of data b) generated by encoder as in #524
17 encoding level c) leader.encodingLevel
18 descriptive cataloging form c) leader.catalogingForm
19 multipart resource record level c) leader.multipartLevel
20 Length of the length-of-field portion a) always 4
21 Length of the starting-character-position portion a) always 5
22 Length of the implementation-defined portion a) always 0
23 Undefined a) always 0

encode-marc21 can create the leader even with the missing leader positions since the missing elements are either generated (see #524) or static see
https://github.com/metafacture/metafacture-core/blob/bd719d2ee87834461fd8594a7a1b69db49658e02/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/Marc21Constants.java#L31-37

@dr0i
Copy link
Member

dr0i commented Apr 25, 2024

I know this all. My proposal is, if it's true that we always need decode-marc21 to pipe to encode-marcxml we could

go with b) which saves us (and the code) much extra work in comparison to a).

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Apr 25, 2024

Talked to @dr0i and @maipet the easy solution is instead of generating the Leader Pos 00-04 and Pos 12-16 is just to set them to 0 if the leader is provided incomplete and in separated elements:
like this:
leader = 0000[leader.status][leader.type][leader.bibliographicLevel][leader.typeOfControl][leader.characterCodingScheme]2200000[leader.encodingLevel][leader.catalogingForm][leader.multipartLevel]4500

That also echos voices from the literature:

„The construction of the element in a MARCXML record has some minor nuances. Typically, a MARCXML element takes, as content, the leader string unchanged from the way it would appear in a traditional MARC record serialization for magnetic tape, including any spaces. However, implementers should recognize, especially when creating MARCXML records from scratch rather than transforming existing MARC records, that a few character positions of the element are ambiguous in a MARCXML serialization. In traditional MARC, the first five characters of the leader string (character positions 0 to 4) are meant to hold the length of the record. They can be used for this purpose in MARCXML, but XML-applications processing MARCXML records do not use this information; they have other (more reliable) options to determine the „length of an XML-serialized MARC record or to move on to the next element in a MARCXML instance. If transformed from a traditional MARC record, these first five characters often are left as is (i.e., representing the length of the record when it was serialized according to the tape-based traditional MARC record format). In other instances, these first five characters of the leader may be set to all zeros in the MARCXML record (the authors’ personal preference). Similar logic is applied to leader character positions 12 to 16, which are designed to hold the base address for a traditionally serialized MARC record on tape. Finally, leader character positions 20 to 23 make up the entry map for the MARC record directory. Since directory entries are not relevant to the MARCXML serialization of the record, these character positions are either left as they would appear in the traditional MARC serialization (always “4500”) or left blank (i.e., four spaces).“

from: XML for Catalogers and Metadata Librarians by Timothy W. Cole and Myung-Ja Han, https://dl.acm.org/doi/10.5555/2531760

and examples from DNB, ALMA and others:
https://d-nb.info/1165051362
https://lobid.org/marcxml/990045392320206443
https://wiki.deutsche-digitale-bibliothek.de/pages/viewpage.action?pageId=78513719

@TobiasNx TobiasNx assigned dr0i and unassigned TobiasNx Apr 26, 2024
@dr0i dr0i moved this from Review to Selected in Metafacture Apr 29, 2024
@dr0i dr0i moved this from Selected to Working in Metafacture Apr 29, 2024
@dr0i
Copy link
Member

dr0i commented Apr 29, 2024

We discussed offline: we have to properly parse the leader, e.g. because the order of the leader's subfields cannot be ensured when e.g. using a fix to build the leader. The easiest way ensuring a valid MARC21 as XML is to reuse the eloborate mechanisms encode-marc21 provides. A wrapper could be implemented as abbrevation for doing this .
So, we would make this:

| decode-marc21
| encode-marc21
| decode-marc21(emitleaderaswhole="true")
| encode-marcxml

to this:

| decode-marc21
| encode-marcxml

i.e. encode-marcxml does | encode-marc21 | decode-marc21(emitleaderaswhole="true") | encode-marcxml .
Never mind performance. We can tackle that if someone complains - and maybe that's not even an issue.

dr0i added a commit that referenced this issue May 6, 2024
Marc21XmlEncoder acts as a wrapper. It makes use of Marc21Encoder, Marc21Decoder
and MarcXmlEncoder to ensure a proper MarcXml, especially regarding the leader.
Also - in contrast to MarcXmlEncoder - the record id (field 001) is mandatory.
dr0i added a commit that referenced this issue May 6, 2024
blackwinter added a commit that referenced this issue Jun 19, 2024
See also #336.

Co-authored-by: Pascal Christoph <pascal.christoph@hbz-nrw.de>
@blackwinter blackwinter moved this from Working to Review in Metafacture Jun 19, 2024
@dr0i dr0i closed this as completed in #538 Jun 20, 2024
@github-project-automation github-project-automation bot moved this from Review to Done in Metafacture Jun 20, 2024
@TobiasNx
Copy link
Contributor Author

TobiasNx commented Jun 25, 2024

It seems that the initial usecase is not solved. The leader seems to come out wrong:
https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B

		<marc:leader>pam a c </marc:leader>
		<marc:leader>pam a c pam a c </marc:leader>
		<marc:leader>pam a c pam a c pam a c </marc:leader>
		<marc:leader>pam a c pam a c pam a c pam a c </marc:leader>

@TobiasNx TobiasNx reopened this Jun 25, 2024
@dr0i
Copy link
Member

dr0i commented Jun 25, 2024

@blackwinter can you have a look? Unfortunately I've already released this buggy version.
note to self: don't forget to always review functional

@TobiasNx
Copy link
Contributor Author

Also the changes introduced a new bug: https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%28emitLeaderAsWhole%3D%22true%22%29%0A%7C+encode-marcxml%0A%7C+print%0A%3B (this worked before)

		<marc:leader>02602pam a2200529 c 4500</marc:leader>
		<marc:leader>02602pam a2200529 c 450001387pam a2200349 c 4500</marc:leader>
		<marc:leader>02602pam a2200529 c 450001387pam a2200349 c 450001266pam a2200361 c 4500</marc:leader>
		...

@dr0i dr0i assigned blackwinter and unassigned dr0i Jun 25, 2024
@blackwinter
Copy link
Member

The leader seems to come out wrong

Also the changes introduced a new bug

These are actually the same bug. It's fixed in 32beffb.

@blackwinter
Copy link
Member

note to self: don't forget to always review functional

Absolutely!

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Jun 25, 2024

I tested it locally the leader is not repeated and concatenated anymore which is great!!

It seems that the initial usecase is not solved. The leader seems to come out wrong: test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B

		<marc:leader>pam a c </marc:leader>
		<marc:leader>pam a c pam a c </marc:leader>
		<marc:leader>pam a c pam a c pam a c </marc:leader>
		<marc:leader>pam a c pam a c pam a c pam a c </marc:leader>

now

                <marc:leader>nam a c </marc:leader>
                <marc:leader>nam a c </marc:leader>
                <marc:leader>nam a c </marc:leader>

But this still outputs a wrong leader because it misses the counted elements:
#524

It should look something like: <marc:leader>02602pam a2200529 c 4500</marc:leader>
It should have 24 characters.

@blackwinter
Copy link
Member

But this still outputs a wrong leader because it misses the counted elements:

You haven't set ensureCorrectMarc21Xml="true".

@TobiasNx
Copy link
Contributor Author

Great: with this option this looks good!! <marc:leader>01269nam a2200373 c 4500</marc:leader>

@TobiasNx
Copy link
Contributor Author

+1

TobiasNx added a commit that referenced this issue Jun 26, 2024
Instead of just appending the incoming leader elements, this adds the static values for Pos: 00-04, 10-16, 20-23.

By using `0`as default value for Pos: 00-04 and 12-16.

#527 (comment)
TobiasNx added a commit that referenced this issue Jun 26, 2024
Instead of just appending the incoming leader elements, this adds the static values for Pos: 00-04, 10-16, 20-23.

By using `0`as default value for Pos: 00-04 and 12-16.

#527 (comment)
TobiasNx added a commit that referenced this issue Nov 19, 2024
If the incoming leader is 8 Metafacture is now able to create correct leaders for marc xml with static default values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
3 participants