-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use generator in streaming mode #88
Comments
I agree it's much more pythonic. If someone figures out how to turn the SAX parser into a generator, I'd be glad to review and merge the pull request. |
The only way I could implement generator is by using `threading` module. I tried to play with `generator.send`, but without luck. I could not find any other solution, how to give control outside of callback. With this implementation using `threading` module I see the only issue with incomplete parsing. For example: data = '<a x="y"><b>1</b><b>2</b><b>3</b></a>' next(parse(data, item_depth=2)) Here, we take only single item and a thread is left waiting for queue forever. Since threads are daemonic, they will not block process termination, but in cases, where many daemonic threads will be left running, then it will leak memory.
+1 for this problem. I've came across using the results of streaming Feeding |
@bzamecnik since you started to work on a different approach I assume you saw some issues with my solution involving threads? What are the issues you see? To the end user threads will not be visible, there should be no thread safety issues, because separate thread that handles SAX callbacks is completely isolated and is not exposed via api in any way. By the way, first thing I tried, was actually asyncio, but unfortunately, I could not find a solution with it, without using threads. I could not find a way to turn a callback function to a generator that can be registered to the event loop. If you will succeed in doing this, I would be really interested in seeing that... :) |
So I delved deeper into asyncio, coroutines, generators with the goal to turn callbacks into generators and unfortunately without success. The problem is that on one end I need a generator (pulling data from the source) and on the other end I have a coroutine (pushing data from the source). I need to use asyncio event loop to start both the producer and consumer but it doesn't allow me to use a for loop for driving the computation. In order to synchronize both I tried to use a Queue but the problem is then that putting into the queue is a coroutine and it has to be wrapped into a task, which gets executed later in the loop, thus the computation is not lazy at all. What I'd like is that putting an item into the queue in the callback blocks and switches the execution to the consumer. After many tries I've created a thread on stackoverflow, so possibly someone more experienced might help us: http://stackoverflow.com/questions/29724273/transform-callbacks-to-generator-in-python |
At least I tried to get the code with threads working. After a lot of fiddling with the producer-consumer pattern it works. The producer can be notified when the generator was closed and finish the thread. So it is possible to break from within the for loop and the producer thread finishes correctly. The trick to coordinate the consumer and producer is to use two singleton queues (one for requests and the other for responses). A generator can either finish correctly or be closed. Eg. when break occurs before the generator completes. In this case we can signal to the producer, that it should finish. The failing test case was a bit strange usage since since it neither completely iterated the generator not closed it. It would correspond to a situation when processing one item hangs. A possible behavior is to finish the producer thread after some time of consumer inactivity. The new code seems to work ok. I'll try to clean it up a bit and commit it tomorrow. |
That's excellent news! I think that this can be a great contribution to |
Did not tried, but I guess it would work if generator had a destructor, destructors are automatically called by garbage collector, so it could take care of threads. |
…roducer thread after a timeout. Add another test case - close the generator by a break in the for loop.
So I commited the code I have. It seems to work in all Python envs defined in tox. https://github.com/bzamecnik/xmltodict/tree/%2388-streaming-parse-with-generator I'm afraid if we'd like to get rid of the thread and use asyncio it couldn't be used from 2.x Python versions. |
@bzamecnik what would happen in this case: for _, item in xmltodict.parse(data, item_depth=2):
process_item(item) # takes 2 seconds to process ? If I understood correctly, generator will terminate? Maybe it is possible to clean threads by using destructor, if generator gets garbage collected it means, that it is completely safe to free thread, because there is no reference that points to generator. |
@sirex The timeout is configurable, so you can increase it as a user if you expect processing items would take long. But anyway, the important point is that the proper usage of a generator is either to fully consume it or close it. If the user do not close the generator, it is wrong and the user cannot be surprised that some resource might get leaked. Your suggestion with automatically cleaning up the resource after usage seems to be a promising way. Since we're using some resource in the generator a proper pattern might be to wrap the whole generator usage into a with statement which would take care of resource allocation and closing (just an idea). I'm not a Python expert but I'd be worried that the destructor doesn't get called or gets called too late (compared to 'with') and also the 'with' statement clearly indicates that there's a kind of resource allocation inside. This way we could elimitate completely the hack with the timeout. with xmltodict.parse(data, item_depth=2) as gen:
for _, item in gen:
process_item(item) # takes 2 seconds to process Btw: it seems that the timeout doesn't work on pypy. https://travis-ci.org/martinblech/xmltodict/builds/59315990 |
…streaming-parse-with-generator
@bzamecnik i spent quite a bit time to utilize your code to use the input file as stream handled by generator. But, couldn't get my head around it, stuck at getting at something like "Producer started" message. Failing to use so, i have to use lxml.objectify.parse as intermediary. After init-ing i have used root.iterchildren like below: from lxml import objectify
import xmltodict as x2d
tree = objectify.parse(file_path)
root = tree.getroot()
generator = (x for x in root.iterchildren())
...
data = generator.next()
data_dict = x2d.parse(data) of course, instead of objectify, i could use typical lxml.parse(); but it helped me doing some cleanup. |
Hi @kmonsoor, this approach seems interesting, I should try it as well. As for that debug message, it must have been from some old commit. Please have a look at the fixed code (in the #99 pull request from the #88-streaming-parse-with-generator branch). The usage would be like: >>> with xmltodict.parse(file_path, item_depth=2) as gen:
... for path, item in gen:
... print(path, item) # do whatever... |
Pull request #104 has support for an interator/generator. It lets you do things like this: >>> for (path, item) in xmltodict.parse(fileObj, item_depth=2, generator=True):
... print("%r, %r" % (path, item))
... # Or, whatever other code you want. In all but Jython, it does this through incremental reads. On Jython, it appears that the entire document is read first. (At least, the Travis CI engine shows it is failing my unit test that checks for this. It may be a problem with my unit test, or this may legitimately not work quite as expected on Jython.) |
it looks like #104 had generator stuff and it hasn't been touched since May? If I pulled that out and made it a standalone PR would you accept it @martinblech ? |
@mcrowson absolutely! If you pull the generator stuff to a standalone PR I'll take a look. |
Any news on this? |
I ended up using the standard XML lib to yield dictionaries. Not in progress on my end. That original PR looked good though if you want to harvest its generator.
…Sent from my iPhone
On Jun 21, 2017, at 9:05 AM, Harry Jubb ***@***.***> wrote:
Any news on this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Was this ever implemented? |
Callbacks are not very convenient way to handle streaming. For example:
Could look like this:
This looks much more pythonic.
The text was updated successfully, but these errors were encountered: