Pipeline failure #119

asharp · 2013-01-23T07:06:52Z

Hi,

Given the code
import sh
sh.dd(sh.dd('if=/dev/zero', 'bs=1M', "count=1024"), 'of=/dev/null')

sh dies with a MemoryError as seen in http://pastebin.ca/2306288

It looks as though sh is trying to buffer the input rather than connecting the stdout of the inner dd process to the stdin of the outer dd process, and is such blowing up.

To give you an example of the behaviour I was expecting, see the example below.

import subprocess

s = subprocess.Popen(['dd', 'if=/dev/zero', 'bs=1M', 'count=10240'], stdout=subprocess.PIPE)
r = subprocess.Popen(['dd', 'of=/dev/null', 'bs=1M'], stdin=s.stdout)

r.wait()

I've taken a look through sh.py but I can't seem to work out what exactly I'd need to do in order to patch this. Would someone mind lending a hand?

The text was updated successfully, but these errors were encountered:

amoffat · 2013-01-23T07:15:35Z

So just from a quick look, the _piped=True option is not being used on the inner sh.dd command. Without it, the outer command waits for the inner command to complete, which would be never :) Try re-running with _piped=True

asharp · 2013-01-23T07:25:35Z

When I run the command
sh.dd(sh.dd('if=/dev/zero', 'bs=1M', 'count=1', _piped=True), of='/dev/null')

I receive the error http://pastebin.ca/2306364

amoffat · 2013-01-23T07:39:12Z

  RAN: '/bin/dd --of=/dev/null'

  STDOUT:


  STDERR:
/bin/dd: invalid option -- '-'
Try `/bin/dd --help' for more information.

When you write of='/dev/null' it expands to --of=/dev/null, so you'll probably want to just pass the argument "of=/dev/null"

asharp · 2013-01-23T08:49:32Z

Ah. Silly mistake. Sorry about that. Can you check with me that
sh.dd(sh.dd("dd", "if=/dev/zero", "bs=1M", _piped=True), "of=/dev/null")

is the correct command? I'm running it, and it's returning (no errors), but no output is being returned and no processes are being made.

amoffat · 2013-01-23T23:34:19Z

sh.dd(sh.dd("dd", "if=/dev/zero", "bs=1M", _piped=True), "of=/dev/null")

You're passing "dd" as the first argument to sh.dd

amoffat · 2013-01-23T23:35:00Z

Why that doesn't raise an exception though is something to look at...

asharp · 2013-01-24T01:32:35Z

Sorry to keep poking you, but is
sh.dd(sh.dd("if=/dev/zero", "bs=1M", 'count=512', _piped=True), "of=/dev/null")

The correct form?

I'm still seeing a lot of memory/cpu usage inside python. (> 1 gig of ram, that isn't freed after the command returns)

amoffat · 2013-01-24T02:05:18Z

If you tell me what you're looking to do specifically, I could probably advise you better.

asharp · 2013-01-24T10:43:46Z

I have a series of essentially bash pipelines I use to process data. At the moment I manage these pipelines using a simple wrapper around Popen, however it's ugly and a pain to use. I'd much prefer to use sh, however i'm having performance problems. The test case I wrote was a simplified test case to work out what the problem was. If I can find a way to run a test case similar to the dd example above (ie. move large quantities of data quickly through a pipeline) using sh, then all is well.

amoffat · 2013-01-29T06:30:00Z

Apologies for the delay in this. There's a couple of things to take note of. One is that the memory used by a Command object will not be released until that object is garbage collected. You can take a look at some of the past memory issues in #99. The other issue is that Command objects internally buffer some stdout. This can be limited with the _internal_bufsize argument on http://amoffat.github.com/sh/special_arguments.html#special-arguments

Let me know if that works on you. I personally haven't used sh for very high bandwidth processes, so the practical limits haven't been pushed as far as you may be pushing them

asharp · 2013-01-30T06:49:19Z

Thanks for the argument, and i'll try that out shortly.

The problem that I see is that sh is copying data between the processes in a pipeline rather then just handing it off to the OS to do. From what I can see we don't gain anything from it, it simply burns massive amounts of CPU time and slows progress to a crawl (...And unexpectedly OOMing the box, if you're running this on a system with memory overcommit enabled...)

That being said, it should be a real simple change, I just can't seem to work out where exactly to put it., nor what sort of interface it should have to keep in line with the rest of the project.

I'm more than happy to help though.

amoffat · 2013-01-31T04:09:17Z

The problem that I see is that sh is copying data between the processes in a pipeline rather then just handing it off to the OS to do. From what I can see we don't gain anything from it, it simply burns massive amounts of CPU time and slows progress to a crawl

A process launched from sh doesn't know if it's going to be piped, so it has to internally buffer output. That's really the crux of the problem. For example:

import sh
print sh.wc(sh.ls())

ls() doesn't know about wc() so it needs to store its own output. When wc() sees that an argument to it is another Command object, it knows to look in its pipe buffer for data to pipe in.

Don't be afraid to dig around deeply in the source either...I tried to comment the non-obvious things pretty well. You have a Command object, which represents a program to be run. It spawns a RunningCommand object when the Command object is called. The RunningCommand object spawns a OProc object to do the low-level process interaction. The OProc object spawns a few StreamReader and StreamWriter objects to read and write to stdin/stdout.

asharp · 2013-01-31T06:15:10Z

Ah, yeah. That makes sense then. Thanks for the explanation :)

Just thinking out loud here

Could we specify an option to, say the Command to let it know that it should expose its stdout pipe (rather than buffering it)? I'm, thinking that the next Command could then check to see if that pipe exists on creation so it can then completely join on.
Could we defer the actual execution until somebody actually tries to read out of it (or it gets destroyed)? As an example

x = sh.ls() #Wouldn't actually run ls yet
x.stdout #Now it would run ls and produce output

This then gives us a bit more leeway as the inner sh.ls returns essentially a future object the outer sh.wc can work out what's going on and a more efficient pipeline can be built.

By also running on del, it means that a single standalone command like

sh.rm('/tmp/somefile')

would still run instantly. (After that line the reference drops, the refcount of the sh.rm "future object" drops to zero and del is called).

What are your thoughts?

amoffat · 2013-02-06T01:53:48Z

This would defy expectations and introduce problems with order of operations. Say you wanted to do process1, then process2, and finanly process3, and all 3 mutate some resource. You would expect that calling them in order would work.
It may be possible, I haven't really had a performance problem for my use cases to warrant doing this. Work up a proof of concept and I can dig deeper.

asharp · 2013-02-28T15:42:37Z

Should work fine, I'm just not entirely sure where I need to poke. Also, although you might not have a pressing need for it, it is a fairly crippling bug in sh that renders it essentially useless for a decent number of common tasks.

2 Could you provide an example of when that wouldn't work?

amoffat · 2013-02-28T17:41:32Z

p1 = sh.cat("first", _out=someFile)
p2 = sh.cat("second", _out=someFile)

p2.stdout
p1.stdout

The file contains "second" then "first"

I'll be more useful answering specific questions about how things work. Your starting place will probably be class OProc

…ween 2 processes directly using OS pipes and connecting them with dup2, instead of buffering the output of process 1 into python memory before sending to process 2. This is in response to issue amoffat#119 amoffat#119 The effect can be enabled by setting call arg _piping="direct" on the inner process. I had issues with getting 'Input/Output error reading stdin' from dd, until I set _tty_out=False. In the code I have therefore set _tty_out to False when _piping == "direct". This has caused 3 of the unit tests to fail. test2.py on my PC shows 13s vs 0.5s performance difference

mcclymont · 2013-10-09T22:28:29Z

I've taken a look at this issue. I managed to get asharp's nested dd call to go from 12 seconds to 0.5 seconds by connecting the pipes directly. However it caused a few unit test problems, so I separated the change out into a dependent parameter (_piping="direct"). My changes are at https://github.com/mcclymont/sh/tree/direct-piping. More details available in the commit message

amoffat · 2013-10-10T04:28:48Z

Looks good so far @mcclymont ..I will take a closer look and merge this weekend

amoffat · 2014-12-30T00:19:02Z

closing, since @mcclymont 's direct piping are on release-1.10 branch.

mcclymont mentioned this issue Oct 10, 2013

Direct piping #157

Closed

amoffat closed this as completed Dec 30, 2014

pyup-bot mentioned this issue Jun 21, 2016

Initial Update chrisdev/wagtail-cookiecutter-foundation#292

Merged

This was referenced Oct 6, 2016

Initial Update Dallinger/psiTurk#1

Closed

Update sh to 1.11 Dallinger/psiTurk#13

Closed

This was referenced Nov 27, 2016

Update sh to 1.12.0 Dallinger/psiTurk#35

Closed

Update sh to 1.12.1 Dallinger/psiTurk#36

Closed

Update sh to 1.12.2 Dallinger/psiTurk#37

Closed

Update sh to 1.12.3 Dallinger/psiTurk#38

Closed

This was referenced Nov 30, 2016

Initial Update Sanji-IO/sanji-bundle-cellular#92

Merged

Update sh to 1.12.4 Dallinger/psiTurk#40

Closed

Update sh to 1.12.5 Dallinger/psiTurk#41

Closed

Update sh to 1.12.6 Dallinger/psiTurk#43

Closed

pyup-bot mentioned this issue Dec 7, 2016

Update sh to 1.12.7 Dallinger/psiTurk#44

Closed

pyup-bot mentioned this issue Dec 16, 2016

Update sh to 1.12.8 Dallinger/psiTurk#49

Closed

pyup-bot mentioned this issue Jan 4, 2017

Update sh to 1.12.9 Dallinger/psiTurk#54

Open

pyup-bot mentioned this issue Feb 13, 2017

Update sh to 1.12.9 dimagi/commcare-hq#14923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Pipeline failure #119

Pipeline failure #119

asharp commented Jan 23, 2013

amoffat commented Jan 23, 2013

asharp commented Jan 23, 2013

amoffat commented Jan 23, 2013

asharp commented Jan 23, 2013

amoffat commented Jan 23, 2013

amoffat commented Jan 23, 2013

asharp commented Jan 24, 2013

amoffat commented Jan 24, 2013

asharp commented Jan 24, 2013

amoffat commented Jan 29, 2013

asharp commented Jan 30, 2013

amoffat commented Jan 31, 2013

asharp commented Jan 31, 2013

amoffat commented Feb 6, 2013

asharp commented Feb 28, 2013

amoffat commented Feb 28, 2013

mcclymont commented Oct 9, 2013

amoffat commented Oct 10, 2013

amoffat commented Dec 30, 2014

Pipeline failure #119

Pipeline failure #119

Comments

asharp commented Jan 23, 2013

amoffat commented Jan 23, 2013

asharp commented Jan 23, 2013

amoffat commented Jan 23, 2013

asharp commented Jan 23, 2013

amoffat commented Jan 23, 2013

amoffat commented Jan 23, 2013

asharp commented Jan 24, 2013

amoffat commented Jan 24, 2013

asharp commented Jan 24, 2013

amoffat commented Jan 29, 2013

asharp commented Jan 30, 2013

amoffat commented Jan 31, 2013

asharp commented Jan 31, 2013

amoffat commented Feb 6, 2013

asharp commented Feb 28, 2013

amoffat commented Feb 28, 2013

mcclymont commented Oct 9, 2013

amoffat commented Oct 10, 2013

amoffat commented Dec 30, 2014