-
-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline failure #119
Comments
So just from a quick look, the |
When I run the command I receive the error http://pastebin.ca/2306364 |
When you write |
Ah. Silly mistake. Sorry about that. Can you check with me that is the correct command? I'm running it, and it's returning (no errors), but no output is being returned and no processes are being made. |
You're passing "dd" as the first argument to |
Why that doesn't raise an exception though is something to look at... |
Sorry to keep poking you, but is The correct form? I'm still seeing a lot of memory/cpu usage inside python. (> 1 gig of ram, that isn't freed after the command returns) |
If you tell me what you're looking to do specifically, I could probably advise you better. |
I have a series of essentially bash pipelines I use to process data. At the moment I manage these pipelines using a simple wrapper around Popen, however it's ugly and a pain to use. I'd much prefer to use sh, however i'm having performance problems. The test case I wrote was a simplified test case to work out what the problem was. If I can find a way to run a test case similar to the dd example above (ie. move large quantities of data quickly through a pipeline) using sh, then all is well. |
Apologies for the delay in this. There's a couple of things to take note of. One is that the memory used by a Command object will not be released until that object is garbage collected. You can take a look at some of the past memory issues in #99. The other issue is that Command objects internally buffer some stdout. This can be limited with the Let me know if that works on you. I personally haven't used sh for very high bandwidth processes, so the practical limits haven't been pushed as far as you may be pushing them |
Thanks for the argument, and i'll try that out shortly. The problem that I see is that sh is copying data between the processes in a pipeline rather then just handing it off to the OS to do. From what I can see we don't gain anything from it, it simply burns massive amounts of CPU time and slows progress to a crawl (...And unexpectedly OOMing the box, if you're running this on a system with memory overcommit enabled...) That being said, it should be a real simple change, I just can't seem to work out where exactly to put it., nor what sort of interface it should have to keep in line with the rest of the project. I'm more than happy to help though. |
A process launched from sh doesn't know if it's going to be piped, so it has to internally buffer output. That's really the crux of the problem. For example: import sh
print sh.wc(sh.ls())
Don't be afraid to dig around deeply in the source either...I tried to comment the non-obvious things pretty well. You have a Command object, which represents a program to be run. It spawns a RunningCommand object when the Command object is called. The RunningCommand object spawns a OProc object to do the low-level process interaction. The OProc object spawns a few StreamReader and StreamWriter objects to read and write to stdin/stdout. |
Ah, yeah. That makes sense then. Thanks for the explanation :) Just thinking out loud here
This then gives us a bit more leeway as the inner sh.ls returns essentially a future object the outer sh.wc can work out what's going on and a more efficient pipeline can be built. By also running on del, it means that a single standalone command like
would still run instantly. (After that line the reference drops, the refcount of the sh.rm "future object" drops to zero and del is called). What are your thoughts? |
|
2 Could you provide an example of when that wouldn't work? |
p1 = sh.cat("first", _out=someFile)
p2 = sh.cat("second", _out=someFile)
p2.stdout
p1.stdout The file contains "second" then "first"
|
…ween 2 processes directly using OS pipes and connecting them with dup2, instead of buffering the output of process 1 into python memory before sending to process 2. This is in response to issue amoffat#119 amoffat#119 The effect can be enabled by setting call arg _piping="direct" on the inner process. I had issues with getting 'Input/Output error reading stdin' from dd, until I set _tty_out=False. In the code I have therefore set _tty_out to False when _piping == "direct". This has caused 3 of the unit tests to fail. test2.py on my PC shows 13s vs 0.5s performance difference
…ween 2 processes directly using OS pipes and connecting them with dup2, instead of buffering the output of process 1 into python memory before sending to process 2. This is in response to issue amoffat#119 amoffat#119 The effect can be enabled by setting call arg _piping="direct" on the inner process. I had issues with getting 'Input/Output error reading stdin' from dd, until I set _tty_out=False. In the code I have therefore set _tty_out to False when _piping == "direct". This has caused 3 of the unit tests to fail. test2.py on my PC shows 13s vs 0.5s performance difference
I've taken a look at this issue. I managed to get asharp's nested dd call to go from 12 seconds to 0.5 seconds by connecting the pipes directly. However it caused a few unit test problems, so I separated the change out into a dependent parameter (_piping="direct"). My changes are at https://github.com/mcclymont/sh/tree/direct-piping. More details available in the commit message |
Looks good so far @mcclymont ..I will take a closer look and merge this weekend |
closing, since @mcclymont 's direct piping are on release-1.10 branch. |
Hi,
Given the code
import sh
sh.dd(sh.dd('if=/dev/zero', 'bs=1M', "count=1024"), 'of=/dev/null')
sh dies with a MemoryError as seen in http://pastebin.ca/2306288
It looks as though sh is trying to buffer the input rather than connecting the stdout of the inner dd process to the stdin of the outer dd process, and is such blowing up.
To give you an example of the behaviour I was expecting, see the example below.
I've taken a look through sh.py but I can't seem to work out what exactly I'd need to do in order to patch this. Would someone mind lending a hand?
The text was updated successfully, but these errors were encountered: