Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

General: Fix loading of unused chars in xml format #2729

Merged
merged 5 commits into from
Feb 15, 2022

Conversation

iLLiCiTiT
Copy link
Member

Brief description

Class ElementTree in xml parser don't know how to handle all escaped values which cause parse error.

Description

Not sure what is proper fix. Propably would be to modify xml parser which is more complicated or define all possible espace values (e.g. from this source). It currently breaks loading of data for some exr.

Changes

  • replace & in some unused xml ampresand characters with & so ElementTree can parse it

@BigRoy
Copy link
Collaborator

BigRoy commented Feb 15, 2022

@iLLiCiTiT does this resolve itself when parsing from a unicode string? Or see this other topic about it.

@iLLiCiTiT
Copy link
Member Author

Encoding is not an issue in this case. The issue is that one attribute has value with escaped xml value but ElementTree can't handle that.

Example

<attrib name="tech_details_color_space" type="string">&#02;</attrib>

The &#02; should be hexadecimal string 0x02 but the & raises ParseError.

@antirotor
Copy link
Member

antirotor commented Feb 15, 2022

the only predefined character sequences in xml are:

Name Character Unicode code point (decimal) Standard Name
quot " U+0022 (34) XML 1.0 quotation mark
amp & U+0026 (38) XML 1.0 ampersand
apos ' U+0022 (39) XML 1.0 apostrophe (1.0: apostrophe-quote)
lt < U+003C (60) XML 1.0 less-than sign
qt > U+003E (3E) XML 1.0 greater-than sign

everything else is illegal except character and entity references:

'&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

but:

Well-formedness constraint: Legal Character

Characters referred to using character references must match the production for Char.

And that is defined as:

Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

so &#02; is really invalid character in xml

So my point is that replacing ampersands is not trivial, you would need to validate the value

Copy link
Member

@antirotor antirotor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid this is not enough, see my comment. This must be solved with character ranges.

@iLLiCiTiT
Copy link
Member Author

iLLiCiTiT commented Feb 15, 2022

This must be solved with character ranges.

To be honest I don't know how. Right now this breaks extract review because xml export from oiiotool put &#02; into some irrelevant node which is not used. If you know how, do it.

So my point is that replacing ampersands is not trivial, you would need to validate the value

I'm trying to find values from XML_UNUSED_CHARS in the output and only for them replace the ampresand.
They should be probably renamed to HTML_UNUSED_CHARS....

@BigRoy
Copy link
Collaborator

BigRoy commented Feb 15, 2022

Maybe it makes more sense to rely on BeautifulSoup or lxml since they appear to have some possible solutions to this? It does add dependencies. :(

@antirotor
Copy link
Member

or just find with regex all character references, parse them and escape it only if it doesnt fit into range for Char - and feed it to xml parser then?

@iLLiCiTiT
Copy link
Member Author

Only these are valid ranges? #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

@antirotor
Copy link
Member

antirotor commented Feb 15, 2022

Only these are valid ranges? #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

this is in xml specs.

@iLLiCiTiT
Copy link
Member Author

Modified to use regex which checks if xml from oiiotool contain valid values and replace ampresand of invalid values. The loaded value matches string value from xml text. These values are metadata that are not needed for us so I would not care about their real value until they're needed.

<attrib name="tech_details_color_space" type="string">&#02;</attrib>

Is loaded as node with &#02; text value.

Copy link
Member

@antirotor antirotor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add there comment what is a result of this - this will affect all character entities, even the valid one. This fix is quick and dirty, but I can imagine that someone will try in the future get something from these information and he'll hit it?

openpype/lib/transcoding.py Show resolved Hide resolved
iLLiCiTiT and others added 2 commits February 15, 2022 17:41
Co-authored-by: Ondřej Samohel <33513211+antirotor@users.noreply.github.com>
@iLLiCiTiT iLLiCiTiT merged commit 1b1c614 into develop Feb 15, 2022
@mkolar mkolar added the type: bug Something isn't working label Feb 16, 2022
@iLLiCiTiT iLLiCiTiT deleted the bugfix/oiio_xml_parse_ampresand_values branch February 16, 2022 12:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants