Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a pull parser #10

Closed
hadley opened this issue Feb 20, 2015 · 7 comments
Closed

Implement a pull parser #10

hadley opened this issue Feb 20, 2015 · 7 comments

Comments

@hadley
Copy link
Member

hadley commented Feb 20, 2015

For large xml files, or when you just want to extract a little bit of data without loading the entire file into memory, e.g. http://lxml.de/parsing.html#incremental-event-parsing

If you need this, please 👍 this issue with a brief use case

@cboettig
Copy link

👍 This would be awesome (if I've understood it correctly). I often work with some large dump of XML files, like this: https://github.com/TreeBASE/supertreebase/tree/master/data/treebase

The standard use case is that I want to filter the collection to isolate only those files that have certain attributes, e.g. identifying the documents that include my species of interest as one of the taxa (e.g. matches the <otu label="species name"> value).

Loading all the documents into memory first is often prohibitive in terms of memory, forcing the user to parse, search, and then remove each object, and at least with XML this can all be very slow.

@hadley hadley mentioned this issue May 4, 2016
@jimhester
Copy link
Member

libxml2 documentation of the SAX interface is http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html

The relevant Cython code from lxml is https://github.com/lxml/lxml/blob/572e10843774a5d6300125d89bdc423d53c92971/src/lxml/saxparser.pxi

Implementing this is clearly non-trivial and has entirely different semantics than we are currently using (SAX callbacks vs tree / DOM based).

This feature is a long way away and may be better implemented in a separate package entirely.

@hadley
Copy link
Member Author

hadley commented Dec 22, 2016

Let's close for now.

@hadley hadley closed this as completed Dec 22, 2016
@randomgambit
Copy link

hey mister @jimhester , any news on this? ultra large XML files are becoming the norm today... SAX looks like a necessity even tho I dont like saxophones (haha)

Thanks!

@jimhester
Copy link
Member

This feature is a long way away and may be better implemented in a separate package entirely.

@randomgambit
Copy link

dude its been more than a year!!! :) Im kidding. thanks anyway for the headsup

@wkumler
Copy link
Contributor

wkumler commented Sep 22, 2024

Upvoting this issue! I know it's been a long while but a new instrument we've got in the lab produces 1-10 GB XML files that need to be parsed and while I can currently use read_xml without any major problems, trying to find specific nodes with find_xml_all or find_xml_first rapidly consumes all available memory and errors out. Currently switching to the XML library for this but I've found the syntax is a lot messier there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants