Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming elements not directly below the root #32

Open
lpil opened this issue Jan 31, 2018 · 3 comments
Open

Streaming elements not directly below the root #32

lpil opened this issue Jan 31, 2018 · 3 comments

Comments

@lpil
Copy link

lpil commented Jan 31, 2018

Hello! First, thank you for this library. I'm very happy with the performance improvement over xmerl.

I have a large XML document I wish to stream, it looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<tns:Envelope xmlns:tns="urn:Service:Integration">
  <tns:Body>
    <tns:RangeResponse>
      <tns:RangeResult>
        <tns:Detail> ...more content here... </tns:Detail>
        <tns:Detail> ...more content here... </tns:Detail>
        <tns:Detail> ...more content here... </tns:Detail>
        <tns:Detail> ...more content here... </tns:Detail>
        <tns:Detail> ...more content here... </tns:Detail>
        <tns:Detail> ...more content here... </tns:Detail>

        ... and so on...

      </tns:RangeResult>
    </tns:RangeResponse>
  </tns:Body>
</tns:Envelope>

I wish to process each Detail element in turn, getting messages like so:

{'$gen_event', {xmlstreamelement, {xmlel, "Detail", [], ...}}}

However when I stream this document by passing each line to fxml_stream:parse/2 I get a single message containing the Body element, meaning I am parsing and processing the Detail elements eagerly.

Is it possible to instruct fast_xml to use RangeResult as the root?

Thanks,
Louis

@prefiks
Copy link
Member

prefiks commented Jan 31, 2018

No, this is not configurable at this moment

@lpil
Copy link
Author

lpil commented Jan 31, 2018

Thank you

@thbar
Copy link

thbar commented Dec 7, 2024

Hello! I didn't realise in full initially, but this means that for any slightly nested XML document, like the one below where the actual data sits under the dataObjects node, we are actually holding the whole node in memory (in that case, it can be > 100MB - I've replaced the lengthy content with a fake ManyElementsHere element):

<?xml version="1.0" encoding="UTF-8"?>
<PublicationDelivery xmlns="http://www.netex.org.uk/netex" version="1.04:FR1-NETEX-1.6-1.8">
  <PublicationTimestamp>2024-12-06T07:27:50Z</PublicationTimestamp>
  <ParticipantRef>FR1_OFFRE</ParticipantRef>
  <dataObjects>
    <GeneralFrame id="FR1:GeneralFrame:NETEX_CALENDRIER-20241206T072750Z:LOC" version="1.8" dataSourceRef="FR1-OFFRE_AUTO">
      <ManyElementsHere>
        <Element></Element>
        <Element></Element>
        <Element></Element>
        <Element></Element>
        <Element></Element>
        <Element></Element>
      </ManyElementsHere>
    </GeneralFrame>
  </dataObjects>
</PublicationDelivery>

In my current use case (XML files > 100MB) this ends up being closer to a DOM parsing than a stream parsing, and as a result it puts a hard limit on maximum file size I can process...

I wonder how complicated it would be to programmatically let the user decide (via a function) if a message must be emitted (based on a business logic)? (in some cases I'm happy with the 2/3 upper nodes, in some cases diving into some nodes, but not into others, is what will make the parsing scalable).

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants