Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

Add encoding option to marcxml.record_to_xml #105

Open
aaronhelton opened this issue Aug 8, 2017 · 4 comments
Open

Add encoding option to marcxml.record_to_xml #105

aaronhelton opened this issue Aug 8, 2017 · 4 comments

Comments

@aaronhelton
Copy link

aaronhelton commented Aug 8, 2017

Unless I am missing something with respect to the marcxml functionality, the record_to_xml function seems to return text encoded in us-ascii, which causes problems when systems are expecting utf-8 encoding. Tracing this issue to its source revealed that xml.etree.ElementTree.tostring takes an optional encoding parameter, which defaults to us-ascii. I am proposing to be able to pass an optional encoding parameter from marcxml.record_to_xml's invocation of ET.tostring.

In my local fork, I have made the following change:

def record_to_xml(record, quiet=False, namespace=False, encoding='us-ascii'):
  node = record_to_xml_node(record, quiet, namespace)
  return ET.tostring(node, encoding=encoding)

Without the change, my output for record_to_xml on UTF-8 strings that contain diacritics looks like this:

<record>
    <leader>          22        4500</leader>
    <datafield ind1=" " ind2=" " tag="246">
      <subfield code="a">Nouvelles-H&#233;brides, communiqu&#233;s par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
      <subfield code="b">Lois et r&#232;glements promulgu&#233;s pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et r&#233;glementer la distribution des stup&#233;fiants, amend&#233;e par le Protocole du 11 d&#233;cembre 1946</subfield>
    </datafield>
</record>

And the resulting file ends up with a us-ascii encoding, which causes import of the record to fail on the MARC based system we are using.

With the change, I get output that looks like this when I pass the optional encoding:

<record>
	<leader>          22        4500</leader>
	<datafield ind1=" " ind2=" " tag="246">
		<subfield code="a">Nouvelles-Hébrides, communiqués par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
		<subfield code="b">Lois et règlements promulgués pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et réglementer la distribution des stupéfiants, amendée par le Protocole du 11 décembre 1946</subfield>
	</datafield>
</record>

I invoke as follows:

out_file.write(marcxml.record_to_xml(record,encoding='utf-8'))

And the resulting file ends up with a utf-8 encoding.

Note that I tried forcing encoding to utf-8 at each successive level beginning with the open() function and working backward to the record itself. The only thing I found that actually works is to pass an encoding parameter in this particular function. If I am missing something (obvious or not), I'd be interested in correcting my oversight.

The change looks trivial to me and preserves the default functionality, but I don't know if there are tests that depend on it.

@edsu
Copy link
Owner

edsu commented Aug 8, 2017

I'm curious what MARC based system you are using that rejected the record with the unicode character entities.

It seems to me that utf-8 should be the default encoding for XML so hard coding ET.tostring(node, encoding='utf-8') should be fine.

@aaronhelton
Copy link
Author

We're using Invenio, but I don't really know what the internals are doing, since I don't have access to the source code our vendor is maintaining.

It's not that the system rejected the unicode characters. It's that the output file ended up with a MIME encoding of us-ascii (as reported by the file --mime-encoding command), and the Invenio batch uploader module rejected it as not being encoded properly.

Agreed that utf-8 should be default for XML encoding, which is why I find the function description in ET.tostring so strange:

xml.etree.ElementTree.tostring(element, encoding="us-ascii", method="xml", *, short_empty_elements=True)

Generates a string representation of an XML element, including all subelements. element is an Element instance. encoding [1] is the output encoding (default is US-ASCII). Use encoding="unicode" to generate a Unicode string (otherwise, a bytestring is generated). method is either "xml", "html" or "text" (default is "xml"). short_empty_elements has the same meaning as in ElementTree.write(). Returns an (optionally) encoded string containing the XML data.

@edsu
Copy link
Owner

edsu commented Aug 8, 2017

If you have time to put together a pull request for the change and an accompanying test I would be grateful.

@aaronhelton
Copy link
Author

This might cover it, but I admit I am still new to writing tests and may have taken the wrong approach.

https://github.com/dag-hammarskjold-library/pymarc/tree/marcxml-encode

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants