Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper stream support #15

Open
rufuspollock opened this issue Jul 6, 2013 · 8 comments
Open

Proper stream support #15

rufuspollock opened this issue Jul 6, 2013 · 8 comments

Comments

@rufuspollock
Copy link
Member

At the moment e.g. head operations read the whole file even if we just want the first part.

Instead we should allow cut-off in a sensible way. To see how to hack this up with csv module see https://github.com/okfn/data.okfn.org/blob/78dc75bad035c36744b46badb258fbf63ed8d016/routes/index.js#L84

@andylolz
Copy link
Collaborator

I haven’t quite figured out how to accomplish this with transforms. It appears that whenever any transform (e.g. head) has finished, it can push(null), which will trigger the 'end' event to bubble upstream. So then we just need an 'end' event handler that will halt the flow from the file stream.

@rufuspollock
Copy link
Member Author

@andylolz that sounds about right. We basically need a way to say: EOF (or rather EOS = End of Stream) in a sensible way. The complication used to be that we returned null to indicate drop this row so couldn't use null as indicator for end of file but maybe that is now possible with latest refactors :-)

@andylolz
Copy link
Collaborator

Yep – so now we have to explicitly push rows downstream – not pushing the row indicates “drop this row”.

@rufuspollock
Copy link
Member Author

@andylolz can we close this issue now? My understanding from your comments is that we've fixed the issue this was addressing.

@andylolz one thing it would be good to clarify though - do we still end up processing all rows in a file even when we are finished. e.g. suppose we have head as an op - then the delete should basically want to stop parsing any new rows after the first X - how does this work atm? How does a transform tell things upstream to stop processing? This is important for large files - e.g. if you give me a 1GB CSV and i do head on it i don't want to keep streaming the CSV!

@andylolz
Copy link
Collaborator

we've fixed the issue this was addressing

Yep – agreed. I think this can be closed.

do we still end up processing all rows in a file even when we are finished.

Nope. null gets pushed downstream to indicate the input has ended, and then the finish event bubbles back upstream, eventually telling the filestream to stop reading. I think that’s right?

Anyway, you can actually test this out and see. Here’s a 20mb file… Let’s delete rows 500-1000, and then take the first 20 rows.

http://datapipes.okfnlabs.org/csv/delete%20500:1000/head%20-n%2020/?url=http://cl.ly/0o190O3y1f1Z/DfTRoadSafety_Accidents_2009.csv

In practice, we don’t ever perform the delete, because after taking the first 20 rows, head sends the signal to stop streaming.

Pretty cool.

@rufuspollock
Copy link
Member Author

@andylolz somehow missed this comment first time round. This is super cool. So, to confirm, we're sure the example you gave only ends processing a tiny fraction of the 20MB file as a whole (is there a way for us to check this?)

If so can we close this :-) - and again - add a nice comment to the front page about this awesome feature with your example (btw if we did go with your example i think it would be even nicer to have you delete first 500 rows - it makes more sense given the head 20)

@andylolz
Copy link
Collaborator

andylolz commented Dec 1, 2013

Aha… adding some logging shows this is not quite working yet :\ although the response completes, we do continue processing the whole file currently. I’ll have another look at this one.

@rufuspollock
Copy link
Member Author

@andylolz cool - i'll be on #okfn much of the day if you want to chat!

@andylolz andylolz mentioned this issue Dec 3, 2013
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants