General: Fix loading of unused chars in xml format #2729

iLLiCiTiT · 2022-02-15T11:19:25Z

Brief description

Class ElementTree in xml parser don't know how to handle all escaped values which cause parse error.

Description

Not sure what is proper fix. Propably would be to modify xml parser which is more complicated or define all possible espace values (e.g. from this source). It currently breaks loading of data for some exr.

Changes

replace & in some unused xml ampresand characters with & so ElementTree can parse it

BigRoy · 2022-02-15T11:33:49Z

@iLLiCiTiT does this resolve itself when parsing from a unicode string? Or see this other topic about it.

iLLiCiTiT · 2022-02-15T12:25:52Z

Encoding is not an issue in this case. The issue is that one attribute has value with escaped xml value but ElementTree can't handle that.

Example

<attrib name="tech_details_color_space" type="string">&#02;</attrib>

The  should be hexadecimal string 0x02 but the & raises ParseError.

antirotor · 2022-02-15T14:42:26Z

the only predefined character sequences in xml are:

Name	Character	Unicode code point (decimal)	Standard	Name
quot	"	U+0022 (34)	XML 1.0	quotation mark
amp	&	U+0026 (38)	XML 1.0	ampersand
apos	'	U+0022 (39)	XML 1.0	apostrophe (1.0: apostrophe-quote)
lt	<	U+003C (60)	XML 1.0	less-than sign
qt	>	U+003E (3E)	XML 1.0	greater-than sign

everything else is illegal except character and entity references:

'&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

but:

Well-formedness constraint: Legal Character

Characters referred to using character references must match the production for Char.

And that is defined as:

Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

so  is really invalid character in xml

So my point is that replacing ampersands is not trivial, you would need to validate the value

antirotor

I am afraid this is not enough, see my comment. This must be solved with character ranges.

iLLiCiTiT · 2022-02-15T15:25:36Z

This must be solved with character ranges.

To be honest I don't know how. Right now this breaks extract review because xml export from oiiotool put  into some irrelevant node which is not used. If you know how, do it.

So my point is that replacing ampersands is not trivial, you would need to validate the value

I'm trying to find values from XML_UNUSED_CHARS in the output and only for them replace the ampresand.
They should be probably renamed to HTML_UNUSED_CHARS....

BigRoy · 2022-02-15T15:49:41Z

Maybe it makes more sense to rely on BeautifulSoup or lxml since they appear to have some possible solutions to this? It does add dependencies. :(

antirotor · 2022-02-15T16:00:41Z

or just find with regex all character references, parse them and escape it only if it doesnt fit into range for Char - and feed it to xml parser then?

iLLiCiTiT · 2022-02-15T16:01:44Z

openpype/lib/transcoding.py

antirotor · 2022-02-15T16:15:49Z

Only these are valid ranges? #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

this is in xml specs.

iLLiCiTiT · 2022-02-15T16:30:09Z

Modified to use regex which checks if xml from oiiotool contain valid values and replace ampresand of invalid values. The loaded value matches string value from xml text. These values are metadata that are not needed for us so I would not care about their real value until they're needed.

<attrib name="tech_details_color_space" type="string">&#02;</attrib>

Is loaded as node with  text value.

antirotor

I would just add there comment what is a result of this - this will affect all character entities, even the valid one. This fix is quick and dirty, but I can imagine that someone will try in the future get something from these information and he'll hit it?

openpype/lib/transcoding.py

Co-authored-by: Ondřej Samohel <33513211+antirotor@users.noreply.github.com>

fix loading of unused chars in xml format

fcb38b8

iLLiCiTiT self-assigned this Feb 15, 2022

iLLiCiTiT requested review from 64qam, antirotor and jakubjezek001 February 15, 2022 11:19

iLLiCiTiT added the backend label Feb 15, 2022

antirotor suggested changes Feb 15, 2022

View reviewed changes

use regex rather then explicit values

7e1203e

antirotor reviewed Feb 15, 2022

View reviewed changes

openpype/lib/transcoding.py Outdated Show resolved Hide resolved

antirotor reviewed Feb 15, 2022

View reviewed changes

openpype/lib/transcoding.py Outdated Show resolved Hide resolved

use single regex

646eb2e

antirotor suggested changes Feb 15, 2022

View reviewed changes

openpype/lib/transcoding.py Show resolved Hide resolved

iLLiCiTiT and others added 2 commits February 15, 2022 17:41

Added warning comment

7a81e8d

Co-authored-by: Ondřej Samohel <33513211+antirotor@users.noreply.github.com>

hound fix

985a6c8

antirotor approved these changes Feb 15, 2022

View reviewed changes

iLLiCiTiT merged commit 1b1c614 into develop Feb 15, 2022

mkolar added the type: bug Something isn't working label Feb 16, 2022

iLLiCiTiT deleted the bugfix/oiio_xml_parse_ampresand_values branch February 16, 2022 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General: Fix loading of unused chars in xml format #2729

General: Fix loading of unused chars in xml format #2729

iLLiCiTiT commented Feb 15, 2022

BigRoy commented Feb 15, 2022

iLLiCiTiT commented Feb 15, 2022

antirotor commented Feb 15, 2022 •

edited

Loading

antirotor left a comment

iLLiCiTiT commented Feb 15, 2022 •

edited

Loading

BigRoy commented Feb 15, 2022 •

edited

Loading

antirotor commented Feb 15, 2022

iLLiCiTiT commented Feb 15, 2022

antirotor commented Feb 15, 2022 •

edited

Loading

iLLiCiTiT commented Feb 15, 2022

antirotor left a comment

General: Fix loading of unused chars in xml format #2729

General: Fix loading of unused chars in xml format #2729

Conversation

iLLiCiTiT commented Feb 15, 2022

Brief description

Description

Changes

BigRoy commented Feb 15, 2022

iLLiCiTiT commented Feb 15, 2022

antirotor commented Feb 15, 2022 • edited Loading

antirotor left a comment

Choose a reason for hiding this comment

iLLiCiTiT commented Feb 15, 2022 • edited Loading

BigRoy commented Feb 15, 2022 • edited Loading

antirotor commented Feb 15, 2022

iLLiCiTiT commented Feb 15, 2022

antirotor commented Feb 15, 2022 • edited Loading

iLLiCiTiT commented Feb 15, 2022

antirotor left a comment

Choose a reason for hiding this comment

antirotor commented Feb 15, 2022 •

edited

Loading

iLLiCiTiT commented Feb 15, 2022 •

edited

Loading

BigRoy commented Feb 15, 2022 •

edited

Loading

antirotor commented Feb 15, 2022 •

edited

Loading