Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use generator in streaming mode #88

Open
sirex opened this issue Mar 25, 2015 · 18 comments
Open

Use generator in streaming mode #88

sirex opened this issue Mar 25, 2015 · 18 comments

Comments

@sirex
Copy link

sirex commented Mar 25, 2015

Callbacks are not very convenient way to handle streaming. For example:

def handle_artist(_, artist):
    print(artist['name'])
    return True

xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
    item_depth=2, item_callback=handle_artist)

Could look like this:

data = GzipFile('discogs_artists.xml.gz')
for _, artist in xmltodict.parse(data, item_depth=2):
    print(artist['name'])

This looks much more pythonic.

@sirex sirex changed the title Use generator for streaming mode Use generator in streaming mode Mar 25, 2015
@martinblech
Copy link
Owner

I agree it's much more pythonic. If someone figures out how to turn the SAX parser into a generator, I'd be glad to review and merge the pull request.

sirex added a commit to sirex/xmltodict that referenced this issue Mar 25, 2015
The only way I could implement generator is by using `threading` module. I
tried to play with `generator.send`, but without luck. I could not find any
other solution, how to give control outside of callback.

With this implementation using `threading` module I see the only issue with
incomplete parsing. For example:

    data = '<a x="y"><b>1</b><b>2</b><b>3</b></a>'
    next(parse(data, item_depth=2))

Here, we take only single item and a thread is left waiting for queue forever.
Since threads are daemonic, they will not block process termination, but in
cases, where many daemonic threads will be left running, then it will leak
memory.
@bzamecnik
Copy link
Contributor

+1 for this problem.

I've came across using the results of streaming parse() in a generator. In my case I'd like to feed it into unparse() and transform one XML to another in a streaming way.

Feeding unparse() with lazy-generated input should be solved in pull request #92 (to issue #91). What remains is connecting parse() callback to an unparse() generator. I've tried to use a single-element coroutine-based queue (asyncio.Queue) for this purpose. My code is not complete yet, but I have a working prototype of the queue usage on a simple producer-consumer problem. I hope to integrate it with xmltodict in a similar way to sirex/xmltodict@2c8002a, without the need to use threads.

@sirex
Copy link
Author

sirex commented Apr 13, 2015

@bzamecnik since you started to work on a different approach I assume you saw some issues with my solution involving threads? What are the issues you see?

To the end user threads will not be visible, there should be no thread safety issues, because separate thread that handles SAX callbacks is completely isolated and is not exposed via api in any way.

By the way, first thing I tried, was actually asyncio, but unfortunately, I could not find a solution with it, without using threads. I could not find a way to turn a callback function to a generator that can be registered to the event loop. If you will succeed in doing this, I would be really interested in seeing that... :)

@bzamecnik
Copy link
Contributor

So I delved deeper into asyncio, coroutines, generators with the goal to turn callbacks into generators and unfortunately without success. The problem is that on one end I need a generator (pulling data from the source) and on the other end I have a coroutine (pushing data from the source). I need to use asyncio event loop to start both the producer and consumer but it doesn't allow me to use a for loop for driving the computation. In order to synchronize both I tried to use a Queue but the problem is then that putting into the queue is a coroutine and it has to be wrapped into a task, which gets executed later in the loop, thus the computation is not lazy at all. What I'd like is that putting an item into the queue in the callback blocks and switches the execution to the consumer.

After many tries I've created a thread on stackoverflow, so possibly someone more experienced might help us: http://stackoverflow.com/questions/29724273/transform-callbacks-to-generator-in-python

@bzamecnik
Copy link
Contributor

At least I tried to get the code with threads working. After a lot of fiddling with the producer-consumer pattern it works. The producer can be notified when the generator was closed and finish the thread. So it is possible to break from within the for loop and the producer thread finishes correctly. The trick to coordinate the consumer and producer is to use two singleton queues (one for requests and the other for responses).

A generator can either finish correctly or be closed. Eg. when break occurs before the generator completes. In this case we can signal to the producer, that it should finish. The failing test case was a bit strange usage since since it neither completely iterated the generator not closed it. It would correspond to a situation when processing one item hangs. A possible behavior is to finish the producer thread after some time of consumer inactivity.

The new code seems to work ok. I'll try to clean it up a bit and commit it tomorrow.

@martinblech
Copy link
Owner

That's excellent news! I think that this can be a great contribution to xmltodict, as generators are much more pythonic than callbacks.
Creating threads under the hood can be a very unexpected side-effect for the user, so I think we must first make sure that they will be destroyed every time they're not being used anymore, no matter how the iteration was stopped.

@sirex
Copy link
Author

sirex commented Apr 20, 2015

Did not tried, but I guess it would work if generator had a destructor, destructors are automatically called by garbage collector, so it could take care of threads.

bzamecnik added a commit to bzamecnik/xmltodict that referenced this issue Apr 20, 2015
…roducer thread after a timeout.

Add another test case - close the generator by a break in the for loop.
@bzamecnik
Copy link
Contributor

So I commited the code I have. It seems to work in all Python envs defined in tox. https://github.com/bzamecnik/xmltodict/tree/%2388-streaming-parse-with-generator

I'm afraid if we'd like to get rid of the thread and use asyncio it couldn't be used from 2.x Python versions.

@sirex
Copy link
Author

sirex commented Apr 21, 2015

@bzamecnik what would happen in this case:

for _, item in xmltodict.parse(data, item_depth=2):
    process_item(item)  # takes 2 seconds to process

?

If I understood correctly, generator will terminate? Maybe it is possible to clean threads by using destructor, if generator gets garbage collected it means, that it is completely safe to free thread, because there is no reference that points to generator.

@bzamecnik
Copy link
Contributor

@sirex The timeout is configurable, so you can increase it as a user if you expect processing items would take long. But anyway, the important point is that the proper usage of a generator is either to fully consume it or close it. If the user do not close the generator, it is wrong and the user cannot be surprised that some resource might get leaked. Your suggestion with automatically cleaning up the resource after usage seems to be a promising way. Since we're using some resource in the generator a proper pattern might be to wrap the whole generator usage into a with statement which would take care of resource allocation and closing (just an idea). I'm not a Python expert but I'd be worried that the destructor doesn't get called or gets called too late (compared to 'with') and also the 'with' statement clearly indicates that there's a kind of resource allocation inside. This way we could elimitate completely the hack with the timeout.

with xmltodict.parse(data, item_depth=2) as gen:
    for _, item in gen:
        process_item(item)  # takes 2 seconds to process

Btw: it seems that the timeout doesn't work on pypy. https://travis-ci.org/martinblech/xmltodict/builds/59315990

bzamecnik added a commit to bzamecnik/xmltodict that referenced this issue Apr 21, 2015
@kmonsoor
Copy link

kmonsoor commented May 4, 2015

@bzamecnik i spent quite a bit time to utilize your code to use the input file as stream handled by generator. But, couldn't get my head around it, stuck at getting at something like "Producer started" message.

Failing to use so, i have to use lxml.objectify.parse as intermediary. After init-ing i have used root.iterchildren like below:

from lxml import objectify
import xmltodict as x2d

tree = objectify.parse(file_path)
root = tree.getroot()
generator = (x for x in root.iterchildren()) 
...
data = generator.next()
data_dict = x2d.parse(data)

of course, instead of objectify, i could use typical lxml.parse(); but it helped me doing some cleanup.

@bzamecnik
Copy link
Contributor

Hi @kmonsoor, this approach seems interesting, I should try it as well. As for that debug message, it must have been from some old commit. Please have a look at the fixed code (in the #99 pull request from the #88-streaming-parse-with-generator branch). The usage would be like:

>>> with xmltodict.parse(file_path, item_depth=2) as gen:
...     for path, item in gen:
...         print(path, item) # do whatever...

@jonlooney
Copy link

Pull request #104 has support for an interator/generator. It lets you do things like this:

>>> for (path, item) in xmltodict.parse(fileObj, item_depth=2, generator=True):
...     print("%r, %r" % (path, item))
...     # Or, whatever other code you want.

In all but Jython, it does this through incremental reads. On Jython, it appears that the entire document is read first. (At least, the Travis CI engine shows it is failing my unit test that checks for this. It may be a problem with my unit test, or this may legitimately not work quite as expected on Jython.)

@mcrowson
Copy link

mcrowson commented Dec 7, 2016

it looks like #104 had generator stuff and it hasn't been touched since May? If I pulled that out and made it a standalone PR would you accept it @martinblech ?

@martinblech
Copy link
Owner

@mcrowson absolutely! If you pull the generator stuff to a standalone PR I'll take a look.

@harryjubb
Copy link

Any news on this?

@mcrowson
Copy link

mcrowson commented Jun 21, 2017 via email

@geekscrapy
Copy link

Was this ever implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants