Use generator in streaming mode #88

sirex · 2015-03-25T07:57:58Z

Callbacks are not very convenient way to handle streaming. For example:

def handle_artist(_, artist):
    print(artist['name'])
    return True

xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
    item_depth=2, item_callback=handle_artist)

Could look like this:

data = GzipFile('discogs_artists.xml.gz')
for _, artist in xmltodict.parse(data, item_depth=2):
    print(artist['name'])

This looks much more pythonic.

martinblech · 2015-03-25T13:08:56Z

I agree it's much more pythonic. If someone figures out how to turn the SAX parser into a generator, I'd be glad to review and merge the pull request.

The only way I could implement generator is by using `threading` module. I tried to play with `generator.send`, but without luck. I could not find any other solution, how to give control outside of callback. With this implementation using `threading` module I see the only issue with incomplete parsing. For example: data = '<a x="y">123</a>' next(parse(data, item_depth=2)) Here, we take only single item and a thread is left waiting for queue forever. Since threads are daemonic, they will not block process termination, but in cases, where many daemonic threads will be left running, then it will leak memory.

bzamecnik · 2015-04-12T21:35:16Z

+1 for this problem.

I've came across using the results of streaming parse() in a generator. In my case I'd like to feed it into unparse() and transform one XML to another in a streaming way.

Feeding unparse() with lazy-generated input should be solved in pull request #92 (to issue #91). What remains is connecting parse() callback to an unparse() generator. I've tried to use a single-element coroutine-based queue (asyncio.Queue) for this purpose. My code is not complete yet, but I have a working prototype of the queue usage on a simple producer-consumer problem. I hope to integrate it with xmltodict in a similar way to sirex/xmltodict@2c8002a, without the need to use threads.

sirex · 2015-04-13T07:20:02Z

@bzamecnik since you started to work on a different approach I assume you saw some issues with my solution involving threads? What are the issues you see?

To the end user threads will not be visible, there should be no thread safety issues, because separate thread that handles SAX callbacks is completely isolated and is not exposed via api in any way.

By the way, first thing I tried, was actually asyncio, but unfortunately, I could not find a solution with it, without using threads. I could not find a way to turn a callback function to a generator that can be registered to the event loop. If you will succeed in doing this, I would be really interested in seeing that... :)

bzamecnik · 2015-04-19T00:21:28Z

So I delved deeper into asyncio, coroutines, generators with the goal to turn callbacks into generators and unfortunately without success. The problem is that on one end I need a generator (pulling data from the source) and on the other end I have a coroutine (pushing data from the source). I need to use asyncio event loop to start both the producer and consumer but it doesn't allow me to use a for loop for driving the computation. In order to synchronize both I tried to use a Queue but the problem is then that putting into the queue is a coroutine and it has to be wrapped into a task, which gets executed later in the loop, thus the computation is not lazy at all. What I'd like is that putting an item into the queue in the callback blocks and switches the execution to the consumer.

After many tries I've created a thread on stackoverflow, so possibly someone more experienced might help us: http://stackoverflow.com/questions/29724273/transform-callbacks-to-generator-in-python

bzamecnik · 2015-04-19T23:32:19Z

At least I tried to get the code with threads working. After a lot of fiddling with the producer-consumer pattern it works. The producer can be notified when the generator was closed and finish the thread. So it is possible to break from within the for loop and the producer thread finishes correctly. The trick to coordinate the consumer and producer is to use two singleton queues (one for requests and the other for responses).

A generator can either finish correctly or be closed. Eg. when break occurs before the generator completes. In this case we can signal to the producer, that it should finish. The failing test case was a bit strange usage since since it neither completely iterated the generator not closed it. It would correspond to a situation when processing one item hangs. A possible behavior is to finish the producer thread after some time of consumer inactivity.

The new code seems to work ok. I'll try to clean it up a bit and commit it tomorrow.

martinblech · 2015-04-20T00:01:15Z

That's excellent news! I think that this can be a great contribution to xmltodict, as generators are much more pythonic than callbacks.
Creating threads under the hood can be a very unexpected side-effect for the user, so I think we must first make sure that they will be destroyed every time they're not being used anymore, no matter how the iteration was stopped.

sirex · 2015-04-20T06:50:33Z

Did not tried, but I guess it would work if generator had a destructor, destructors are automatically called by garbage collector, so it could take care of threads.

…roducer thread after a timeout. Add another test case - close the generator by a break in the for loop.

bzamecnik · 2015-04-20T22:09:55Z

So I commited the code I have. It seems to work in all Python envs defined in tox. https://github.com/bzamecnik/xmltodict/tree/%2388-streaming-parse-with-generator

I'm afraid if we'd like to get rid of the thread and use asyncio it couldn't be used from 2.x Python versions.

sirex · 2015-04-21T06:34:17Z

@bzamecnik what would happen in this case:

for _, item in xmltodict.parse(data, item_depth=2):
    process_item(item)  # takes 2 seconds to process

?

If I understood correctly, generator will terminate? Maybe it is possible to clean threads by using destructor, if generator gets garbage collected it means, that it is completely safe to free thread, because there is no reference that points to generator.

bzamecnik · 2015-04-21T07:06:41Z

@sirex The timeout is configurable, so you can increase it as a user if you expect processing items would take long. But anyway, the important point is that the proper usage of a generator is either to fully consume it or close it. If the user do not close the generator, it is wrong and the user cannot be surprised that some resource might get leaked. Your suggestion with automatically cleaning up the resource after usage seems to be a promising way. Since we're using some resource in the generator a proper pattern might be to wrap the whole generator usage into a with statement which would take care of resource allocation and closing (just an idea). I'm not a Python expert but I'd be worried that the destructor doesn't get called or gets called too late (compared to 'with') and also the 'with' statement clearly indicates that there's a kind of resource allocation inside. This way we could elimitate completely the hack with the timeout.

with xmltodict.parse(data, item_depth=2) as gen:
    for _, item in gen:
        process_item(item)  # takes 2 seconds to process

Btw: it seems that the timeout doesn't work on pypy. https://travis-ci.org/martinblech/xmltodict/builds/59315990

…streaming-parse-with-generator

kmonsoor · 2015-05-04T06:07:49Z

@bzamecnik i spent quite a bit time to utilize your code to use the input file as stream handled by generator. But, couldn't get my head around it, stuck at getting at something like "Producer started" message.

Failing to use so, i have to use lxml.objectify.parse as intermediary. After init-ing i have used root.iterchildren like below:

from lxml import objectify
import xmltodict as x2d

tree = objectify.parse(file_path)
root = tree.getroot()
generator = (x for x in root.iterchildren()) 
...
data = generator.next()
data_dict = x2d.parse(data)

of course, instead of objectify, i could use typical lxml.parse(); but it helped me doing some cleanup.

bzamecnik · 2015-05-04T07:22:42Z

Hi @kmonsoor, this approach seems interesting, I should try it as well. As for that debug message, it must have been from some old commit. Please have a look at the fixed code (in the #99 pull request from the #88-streaming-parse-with-generator branch). The usage would be like:

>>> with xmltodict.parse(file_path, item_depth=2) as gen:
...     for path, item in gen:
...         print(path, item) # do whatever...

jonlooney · 2015-05-23T19:35:12Z

Pull request #104 has support for an interator/generator. It lets you do things like this:

>>> for (path, item) in xmltodict.parse(fileObj, item_depth=2, generator=True):
...     print("%r, %r" % (path, item))
...     # Or, whatever other code you want.

In all but Jython, it does this through incremental reads. On Jython, it appears that the entire document is read first. (At least, the Travis CI engine shows it is failing my unit test that checks for this. It may be a problem with my unit test, or this may legitimately not work quite as expected on Jython.)

mcrowson · 2016-12-07T21:02:33Z

it looks like #104 had generator stuff and it hasn't been touched since May? If I pulled that out and made it a standalone PR would you accept it @martinblech ?

martinblech · 2017-01-05T00:39:16Z

@mcrowson absolutely! If you pull the generator stuff to a standalone PR I'll take a look.

harryjubb · 2017-06-21T13:04:57Z

Any news on this?

mcrowson · 2017-06-21T13:26:03Z

I ended up using the standard XML lib to yield dictionaries. Not in progress on my end. That original PR looked good though if you want to harvest its generator.

…

Sent from my iPhone

On Jun 21, 2017, at 9:05 AM, Harry Jubb ***@***.***> wrote: Any news on this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

geekscrapy · 2020-06-18T21:46:38Z

Was this ever implemented?

sirex changed the title ~~Use generator for streaming mode~~ Use generator in streaming mode Mar 25, 2015

martinblech added the enhancement label Mar 25, 2015

sirex mentioned this issue Apr 15, 2015

Posibility to use generators (fixes #88) #89

Closed

bzamecnik added a commit to bzamecnik/xmltodict that referenced this issue Apr 20, 2015

martinblech#88 Enable breaking from the generator and terminate the p…

b87cdb1

…roducer thread after a timeout. Add another test case - close the generator by a break in the for loop.

bzamecnik added a commit to bzamecnik/xmltodict that referenced this issue Apr 21, 2015

Merge remote-tracking branch 'martinblech/master' into martinblech#88-…

dfc6e06

…streaming-parse-with-generator

jonlooney mentioned this issue May 23, 2015

Several Enhancements #104

Closed

mcrowson mentioned this issue Dec 7, 2016

Stream Results refindlyllc/rets#38

Closed

sirex mentioned this issue May 6, 2017

Replace sax with etree pull API #158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use generator in streaming mode #88

Use generator in streaming mode #88

sirex commented Mar 25, 2015

martinblech commented Mar 25, 2015

bzamecnik commented Apr 12, 2015

sirex commented Apr 13, 2015

bzamecnik commented Apr 19, 2015

bzamecnik commented Apr 19, 2015

martinblech commented Apr 20, 2015

sirex commented Apr 20, 2015

bzamecnik commented Apr 20, 2015

sirex commented Apr 21, 2015

bzamecnik commented Apr 21, 2015

kmonsoor commented May 4, 2015

bzamecnik commented May 4, 2015

jonlooney commented May 23, 2015

mcrowson commented Dec 7, 2016

martinblech commented Jan 5, 2017

harryjubb commented Jun 21, 2017

mcrowson commented Jun 21, 2017 via email

geekscrapy commented Jun 18, 2020

Use generator in streaming mode #88

Use generator in streaming mode #88

Comments

sirex commented Mar 25, 2015

martinblech commented Mar 25, 2015

bzamecnik commented Apr 12, 2015

sirex commented Apr 13, 2015

bzamecnik commented Apr 19, 2015

bzamecnik commented Apr 19, 2015

martinblech commented Apr 20, 2015

sirex commented Apr 20, 2015

bzamecnik commented Apr 20, 2015

sirex commented Apr 21, 2015

bzamecnik commented Apr 21, 2015

kmonsoor commented May 4, 2015

bzamecnik commented May 4, 2015

jonlooney commented May 23, 2015

mcrowson commented Dec 7, 2016

martinblech commented Jan 5, 2017

harryjubb commented Jun 21, 2017

mcrowson commented Jun 21, 2017 via email

geekscrapy commented Jun 18, 2020