-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a pull parser #10
Comments
👍 This would be awesome (if I've understood it correctly). I often work with some large dump of XML files, like this: https://github.com/TreeBASE/supertreebase/tree/master/data/treebase The standard use case is that I want to filter the collection to isolate only those files that have certain attributes, e.g. identifying the documents that include my species of interest as one of the taxa (e.g. matches the Loading all the documents into memory first is often prohibitive in terms of memory, forcing the user to parse, search, and then remove each object, and at least with |
libxml2 documentation of the SAX interface is http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html The relevant Cython code from lxml is https://github.com/lxml/lxml/blob/572e10843774a5d6300125d89bdc423d53c92971/src/lxml/saxparser.pxi Implementing this is clearly non-trivial and has entirely different semantics than we are currently using (SAX callbacks vs tree / DOM based). This feature is a long way away and may be better implemented in a separate package entirely. |
Let's close for now. |
hey mister @jimhester , any news on this? ultra large XML files are becoming the norm today... SAX looks like a necessity even tho I dont like saxophones (haha) Thanks! |
|
dude its been more than a year!!! :) Im kidding. thanks anyway for the headsup |
Upvoting this issue! I know it's been a long while but a new instrument we've got in the lab produces 1-10 GB XML files that need to be parsed and while I can currently use |
For large xml files, or when you just want to extract a little bit of data without loading the entire file into memory, e.g. http://lxml.de/parsing.html#incremental-event-parsing
If you need this, please 👍 this issue with a brief use case
The text was updated successfully, but these errors were encountered: