Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML elements out of order after serializing #213

Closed
ansFourtyTwo opened this issue Jul 13, 2020 · 11 comments
Closed

XML elements out of order after serializing #213

ansFourtyTwo opened this issue Jul 13, 2020 · 11 comments
Labels
bug Something isn't working mixed content

Comments

@ansFourtyTwo
Copy link

ansFourtyTwo commented Jul 13, 2020

After creating DataClasses for the ReqIF schema with

xsdata https://www.omg.org/spec/ReqIF/20110401/reqif.xsd --package reqif.models --ns-struct

I tried to parse an example ReqIf file and serialize back to XML. It seems that internally elements are stored withing lists and serializing does not preserve the correct order of the elements.

This snippet is from the original document:

<reqif-xhtml:div>
<reqif-xhtml:strong>Titel</reqif-xhtml:strong>: 				Some title<reqif-xhtml:br/>
<reqif-xhtml:strong>Bearbeiter</reqif-xhtml:strong>: 			Some name<reqif-xhtml:br/>
...
</reqif-xhtml:div>

This is the corresponding section from the serialized document

<ns1:div xml:space="preserve">
<ns1:br xml:space="preserve"/><ns1:br xml:space="preserve"/><ns1:br xml:space="preserve"/><ns1:br xml:space="preserve"/><ns1:br xml:space="preserve"/><ns1:br xml:space="preserve"/><ns1:br xml:space="preserve"/><ns1:br xml:space="preserve"/><ns1:br xml:space="preserve"/>
<ns1:strong xml:space="preserve">Titel</ns1:strong>: 				Some title
<ns1:strong xml:space="preserve">Bearbeiter</ns1:strong>: 			Some name
...
</ns1:div>

This is the code I used for this back and forth parsing and serializing:

from pathlib import Path

from xsdata.formats.dataclass.parsers.xml import XmlParser
from xsdata.formats.dataclass.serializers.xml import XmlSerializer
from xsdata.formats.dataclass.parsers.config import ParserConfig
from reqif.models.omg_org_spec_req_if_20110401_reqif import ReqIf

config = ParserConfig(fail_on_unknown_properties=True)
parser = XmlParser(config=config)
reqif = parser.from_path(Path('./KuFu_Testfunktion_00270cf8.reqif'), ReqIf)

serializer = XmlSerializer(pretty_print=True, encoding='ascii')

with open('./KuFu_Testfunktion_out.reqif.reqif', 'w') as f:
    f.write(serializer.render(reqif))
@ansFourtyTwo
Copy link
Author

Another issue is that the namespace xml was never declared. Not in the original document, neither in the serialized document. Thus, xml:space="preserve" is not recognized as valid attribute. But this might be a seperate issue.

@ansFourtyTwo
Copy link
Author

If you need a real *.reqif file for testing, let me know and I'll prepare one. The one I use contains some sensitive data.

@tefra
Copy link
Owner

tefra commented Jul 13, 2020

yes please a sample xml would be great, it seems like something with the sequential flag is not generating properly maybe

@ansFourtyTwo
Copy link
Author

ansFourtyTwo commented Jul 13, 2020

Here is an example of such a *.reqif file:
example.reqif.txt

Please note, that I added the ".txt" extension, as otherwise Github wouldn't have let me upload the file.

@tefra
Copy link
Owner

tefra commented Jul 13, 2020

Just for sanity reasons, you can grab the namespaces registry from the parser and pass it to the serializer.render method in order to use the same prefixes, easier to compare.

parser = XmlParser()
obj = parser.from_path(Path("example.reqif.txt"), ReqIf)

serializer = XmlSerializer(pretty_print=True, encoding="ascii")
output = serializer.render(obj, parser.namespaces)
  1. The xml namespace I think it's implied and lxml is making sure it's omitted, I have to check with lxml, but I see the ns map being prepared correctly before passing it to lxml

  2. All those xml:space="preserve" attributes are fixed value attributes and the serializer renders them anyway no matter what namespace they belong to, hrm maybe we could change that in the future through an option but anyway it's not important.

  3. For the ordering let's take a look at how for example the DATATYPES is defined

      <xsd:element maxOccurs="1" minOccurs="0" name="DATATYPES">
        <xsd:complexType>
          <xsd:choice maxOccurs="unbounded" minOccurs="0">
            <xsd:element name="DATATYPE-DEFINITION-BOOLEAN" type="REQIF:DATATYPE-DEFINITION-BOOLEAN"/>
            <xsd:element name="DATATYPE-DEFINITION-DATE" type="REQIF:DATATYPE-DEFINITION-DATE"/>
            <xsd:element name="DATATYPE-DEFINITION-ENUMERATION" type="REQIF:DATATYPE-DEFINITION-ENUMERATION"/>
            <xsd:element name="DATATYPE-DEFINITION-INTEGER" type="REQIF:DATATYPE-DEFINITION-INTEGER"/>
            <xsd:element name="DATATYPE-DEFINITION-REAL" type="REQIF:DATATYPE-DEFINITION-REAL"/>
            <xsd:element name="DATATYPE-DEFINITION-STRING" type="REQIF:DATATYPE-DEFINITION-STRING"/>
            <xsd:element name="DATATYPE-DEFINITION-XHTML" type="REQIF:DATATYPE-DEFINITION-XHTML"/>
          </xsd:choice>
        </xsd:complexType>
      </xsd:element>

Any of the DATATYPE-DEFINITION-BOOLEAN, DATATYPE-DEFINITION-DATE, DATATYPE-DEFINITION-ENUMERATION etc etc can appear from 0 to unlimited times inside the DATATYPES element. The order is not restricted through a <xs:sequence> this implies that the order is not important, that's why the serializer is using the order of the fields as they are defined in the schema to build the xml tree.

Also since these fields can appear more than once they are generated as lists elements. The serializer is going through the values of each list before moving to the next field.

In a nutshell that's the normal behavior, I also run the output through xsd validation and passes without issues

parser = XmlParser()
obj = parser.from_path(Path("example.reqif.txt"), ReqIf)
serializer = XmlSerializer(pretty_print=True, encoding="ascii")
tree = serializer.render_tree(obj, parser.namespaces)
xmlschema_doc = etree.parse("schemas/reqif.xsd")
xmlschema = etree.XMLSchema(xmlschema_doc)
xmlschema.assertValid(tree)

Are you seeing any specific errors from the server side that's consuming these xml files?

@ansFourtyTwo
Copy link
Author

ansFourtyTwo commented Jul 14, 2020

@tefra
Thank you for the hint with passing parser.namespaces toserializer.render(). Makes the document more clear.

Regarding ordering, I think we talk of two different things. For the ordering of DATATYPES elements, I agree that ordering is not important.

The snippets I posted in the initial issue description however refer to XHTML elements, i.e. <div>, <strong> and <br/> elements. So the result is a (X)HTML formatted text. So ordering is important there and otherwise results in a differently formatted output.

The serialized may still be valid, but the result when opening the document looks different.

@tefra
Copy link
Owner

tefra commented Jul 14, 2020

Oh from the sample this part right?

<xhtml:b>Titel</xhtml:b> : Test Titel  <xhtml:br/>
 <xhtml:b>Bearbeiter</xhtml:b> : Test Bearbeiter  <xhtml:br/>
 <xhtml:b>Abt./OE</xhtml:b> : Test Abteilung  <xhtml:br/>
 <xhtml:b>Telefon</xhtml:b> : Test Telefon  <xhtml:br/>
 <xhtml:b>E-Mail:</xhtml:b>  some.body@example.com  <xhtml:br/>
 <xhtml:b>Erstausgabe:</xhtml:b>  09.03.2020  <xhtml:br/>
 <xhtml:b>Datum &#196;nderungsstand</xhtml:b> : 09.03.2020  <xhtml:br/>
 <xhtml:b>&#196;nderungsstand</xhtml:b> : V1.0

in xsdata output all the br elements are rendered first....

     <ns1:br xml:space="preserve"/>
                  <ns1:br xml:space="preserve"/>
                  <ns1:br xml:space="preserve"/>
                  <ns1:br xml:space="preserve"/>
                  <ns1:br xml:space="preserve"/>
                  <ns1:br xml:space="preserve"/>
                  <ns1:br xml:space="preserve"/>
                  <ns1:b xml:space="preserve">Titel</ns1:b>: Test Titel
                  <ns1:b xml:space="preserve">Bearbeiter</ns1:b>: Test Bearbeiter
                  <ns1:b xml:space="preserve">Abt./OE</ns1:b>: Test Abteilung
                  <ns1:b xml:space="preserve">Telefon</ns1:b>: Test Telefon
                  <ns1:b xml:space="preserve">E-Mail:</ns1:b>some.body@example.com
                  <ns1:b xml:space="preserve">Erstausgabe:</ns1:b>09.03.2020
                  <ns1:b xml:space="preserve">Datum &#196;nderungsstand</ns1:b>: 09.03.2020
                  <ns1:b xml:space="preserve">&#196;nderungsstand</ns1:b>: V1.0

@tefra tefra added bug Something isn't working mixed content labels Jul 14, 2020
@ansFourtyTwo
Copy link
Author

ansFourtyTwo commented Jul 14, 2020 via email

@tefra tefra closed this as completed in a0c4efe Jul 18, 2020
@tefra
Copy link
Owner

tefra commented Jul 18, 2020

The issue was that when a class has a mixed content field and a field that matches exactly an element qualified name, the exact match always had higher priority, which shouldn't happen.

I refactored a lot of the mixed content handling, you will need to re-generate your models because a new metadata key was introduced with name:mixed.

The sample you provided is now rendered correctly, give it a try from master and let me know if it works as expected in other use cases as well.

              <THE-VALUE>
                <xhtml:p xml:space="preserve"><xhtml:b xml:space="preserve">Titel</xhtml:b>
                  : Test Titel
                  <xhtml:br xml:space="preserve"/>
                  <xhtml:b xml:space="preserve">Bearbeiter</xhtml:b>: Test Bearbeiter
                  <xhtml:br xml:space="preserve"/>
                  <xhtml:b xml:space="preserve">Abt./OE</xhtml:b>: Test Abteilung
                  <xhtml:br xml:space="preserve"/>
                  <xhtml:b xml:space="preserve">Telefon</xhtml:b>: Test Telefon
                  <xhtml:br xml:space="preserve"/>
                  <xhtml:b xml:space="preserve">E-Mail:</xhtml:b>some.body@example.com
                  <xhtml:br xml:space="preserve"/>
                  <xhtml:b xml:space="preserve">Erstausgabe:</xhtml:b>09.03.2020
                  <xhtml:br xml:space="preserve"/>
                  <xhtml:b xml:space="preserve">Datum &#196;nderungsstand</xhtml:b>:
                  09.03.2020
                  <xhtml:br xml:space="preserve"/>
                  <xhtml:b xml:space="preserve">&#196;nderungsstand</xhtml:b>: V1.0</xhtml:p>
              </THE-VALUE>

Thank you for reporting @ansFourtyTwo, this issue helped to improve mixed content handling a lot!

@ansFourtyTwo
Copy link
Author

ansFourtyTwo commented Jul 21, 2020

Hi @tefra

The file I currently work with no looks good. Thank you very much. Once again, you are doing a great job. I decided to use xsdata for one of my projects now as we do a lot of XML parsing stuff. I hope at some point, I can dig deeper into your code and do some coding on my own somewhen.

All the best.

@tefra
Copy link
Owner

tefra commented Jul 21, 2020

Thank you @ansFourtyTwo ,

This library is something I wanted to create since the original suds soap client stopped being maintained, It gives me great pleasure to see other people use it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mixed content
Projects
None yet
Development

No branches or pull requests

2 participants