Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stress test notes #139

Closed
buchanae opened this issue May 30, 2017 · 24 comments
Closed

Stress test notes #139

buchanae opened this issue May 30, 2017 · 24 comments

Comments

@buchanae
Copy link
Contributor

I'll keep some notes about stress test results here and then come up with a more concrete list of issues later.

@buchanae
Copy link
Contributor Author

buchanae commented May 30, 2017

for m in {1..500}; do funnel run --cmd 'echo $msg' -e msg="$m"; done

Is a decent place to start. That found a couple issues:

  • Unhandled panic #138 panic from NewDockerClient
    • Possibly we are creating lost of docker clients when we only need one
  • docker ps is locked up after submitting those tasks
    • likely this has something to do with trying to start 500 containers at once
    • is the local worker backend not limiting its resources appropriately? the local scheduler backend does not match resources. Perhaps it should.
      • I also forgot to set a CPU request

@buchanae
Copy link
Contributor Author

The terminal dashboard truncates the task IDs which make a long series of IDs the same:
screen shot 2017-05-30 at 2 21 21 pm

@buchanae
Copy link
Contributor Author

Improved:

time for m in {1..500}; do funnel run --cmd 'echo $msg' --cpu 1 -e msg="$m"; done

Take about 10 seconds to submit 500 tasks on my laptop. No hangups when I have things set up right:

  • using `--cpu``
  • using manual scheduler backend

@buchanae
Copy link
Contributor Author

Submitted 5000 tasks without trouble on my laptop. Took 1m32.777s

@buchanae
Copy link
Contributor Author

buchanae commented May 30, 2017

1000 tasks on buchanan01 in exastack: 2m9.332s

Not sure what the difference is here. File system access maybe? More CPUs mean more concurrent workers writing to the DB? There are a lot of possible factors.

Theoretically, since boltdb locks the entire database for every transaction, the more clients reading/writing the database, the lower the performance. A database with row-level locking, a write-ahead log, compare-and-set, etc. would have much higher performance.

@buchanae
Copy link
Contributor Author

#55 seems important, since I've repeatedly forgotten to use funnel run --cpu 1 today, which results in a locked up docker/VM/etc.

@buchanae
Copy link
Contributor Author

Another test command uses ping -w 30 to ping for 30 seconds, which puts some traffic on the task log streaming.

time for i in {1..100}; do funnel run -S http://localhost:8070 --cpu 1 --cmd 'ping -w 30 google.com'; done

@buchanae
Copy link
Contributor Author

buchanae commented May 30, 2017

Testing with:
1 server, no workers on buchanan01
1 worker on each tes-master, tes-worker-2, tes-worker-3 (12, 12, 4 CPUs)
Running funnel run on my laptop, to buchanan01 over ssh tunnel + http.
Running at home over a relatively slow cell network (LTE).

100 tasks takes 32 seconds
1000 tasks took 5m20.806s

@buchanae
Copy link
Contributor Author

At some point, the terminal dashboard starts lagging and becomes unusable. Probably because A) it's listing everything, and B) the server is busy communicating with workers (~20-30 tasks)

@buchanae
Copy link
Contributor Author

The web dashboard seems to hold up fine in terms of interactivity. Of course, it's not nice to sort through 50 pages of completed tasks :)

@buchanae
Copy link
Contributor Author

buchanae commented May 31, 2017

Ideas for improvement:

  • Badger (no transaction rollback though)
    • RocksDB
    • etcd
  • RPC based funnel client (i.e. reduce network traffic during task creation)
  • batch task endpoint
  • don't write task logs to the database, but write them to a file.
    • and/or substantially reduce the task streaming update rate
  • remove/reduce writes during worker sync loop
  • remove/reduce writes during task assignment

Ideas for more stress:

  • larger cluster of workers
  • worker file IO
  • introduce network latency/interruptions

@buchanae
Copy link
Contributor Author

I have some boltDB test code here: https://gist.github.com/buchanae/38cc3c0ccb0a092417a14e7abdb4a0f8

Which helped me figure out that only a single update transaction can exist at a time, regardless of the bucket.

@buchanae
Copy link
Contributor Author

Testing with:
1 server, no workers on buchanan01
1 worker on each tes-master, tes-worker-2, tes-worker-3 (12, 12, 4 CPUs)
Submitting from buchanan01 via funnel run

1000 tasks: 1m 36s
5000 tasks: 6m 38.167s

@buchanae
Copy link
Contributor Author

Running funnel tasks get on a single task ID while the server is busy with the >5000 tasks submitted (previous comment)

1000 iterations: ~15s
5000 iterations: ~1m 11s

@buchanae
Copy link
Contributor Author

buchanae commented May 31, 2017

Amazingly, funnel tasks list -v FULL runs in ~1 second and returns 6100 tasks

~3.5 MB of data

@buchanae
Copy link
Contributor Author

1 server, no workers on buchanan01
1 worker on each tes-master, tes-worker-2, tes-worker-3 (12, 12, 4 CPUs)
Submitting from buchanan01 via funnel run

time for i in {1..1000}; do ./funnel run -S http://localhost:8000 --cpu 1 --cmd 'sleep 30'; done

This command doesn't output any logs, reducing the write traffic to the database.

1000 tasks: 50s (~30-40s less than tasks with log traffic)
5000 tasks: 4m 12.402s (~2.5 minutes less)

@buchanae
Copy link
Contributor Author

Another idea for improvement:
Keep a separate bolt database for task stdout/err logs.

@buchanae
Copy link
Contributor Author

Trying out what happens when I set config.Worker.UpdateRate = 30 seconds and config.Worker.LogUpdateRate too

time for i in {1..1000}; do ./funnel run -S http://localhost:8000 --cpu 1 --cmd 'ping -w 30 google.com' > /dev/null; done

real	1m4.622s
user	0m5.536s
sys	0m3.684s
buchanan01 ~
time for i in {1..5000}; do ./funnel run -S http://localhost:8000 --cpu 1 --cmd 'ping -w 30 google.com' > /dev/null; done

real	4m37.338s
user	0m27.284s
sys	0m18.432s

@buchanae
Copy link
Contributor Author

I implemented a quick and dirty badger database backend, which looks 4x faster for creating 5000 tasks:

time for i in {1..5000}; do ./funnel run -S http://localhost:8000 --cpu 1 --cmd 'ping -w 30 google.com' > /dev/null; done

real	1m17.590s
user	0m30.504s
sys	0m19.080s

@buchanae
Copy link
Contributor Author

This task in JSON repeated 5000 times is ~ 1.8 MB

{
	"name": "Funnel run: ping -w 30 google.com",
	"resources": {
		"cpuCores": 1
	},
	"executors": [
		{
			"imageName": "alpine",
			"cmd": [
				"ping",
				"-w",
				"30",
				"google.com"
			],
			"workdir": "/opt/funnel",
			"stdout": "/opt/funnel/outputs/stdout-0",
			"stderr": "/opt/funnel/outputs/stderr-0",
			"environ": {
			}
		}
	],
	"tags": {
	}
}

@kellrott
Copy link
Contributor

kellrott commented Jun 1, 2017

So we're switching to badger?

@buchanae
Copy link
Contributor Author

buchanae commented Jun 1, 2017

I think so. I don't see a substantial downside yet.

@kellrott
Copy link
Contributor

kellrott commented Jun 2, 2017

Hopefully the doing the new db driver will let you identify the common elements with the original boltdb, so doing more db plugins will be easier.

@buchanae
Copy link
Contributor Author

buchanae commented Jun 7, 2017

Summary:

The comments below are for bolt DB:

  • SSD is faster, no surprise there.
  • Exastack is super slow.
  • A batch endpoint would be great.
  • We might consider tuning the worker's LogUpdateRate to be less frequent, to reduce the number of writes going to the server.
  • Benchmarks are written in tests/e2e/perf and runnable with go test -bench=. ./tests/e2e/perf
  • Default minimum resources are needed to avoid overwhelming workers: Default resource requirements #55
  • The dashboards (term and web) need pagination: Termdash not interactive when thousands of tasks exist #149
  • As far as I can tell, the performance doesn't degrade as the database gets bigger.

Benchmark output:

OpenStack, 12 CPUs

go test ./tests/e2e/perf/ -bench=. -benchtime 30s

BenchmarkRunSerialNoWorkers-12                  	    1000	  42497604 ns/op
BenchmarkRun5000ConcurrentNoWorkers-12          	    1000	  46668322 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-12    	    1000	 252317192 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	366.745s

macbook

go test ./tests/e2e/perf/ -bench=. -benchtime 30s

BenchmarkRunSerialNoWorkers-8             	  100000	    607494 ns/op
BenchmarkRunConcurrentNoWorkers-8         	  100000	    669869 ns/op
BenchmarkRunConcurrentWithFakeWorkers-8   	   10000	  12054479 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	271.135s

Google cloud VM, n1-standard-8 (8 vCPUs, 30 GB memory) with non-SSD

buchanae@release-testing-ssd-8:~/gopath/src/github.com/ohsu-comp-bio/funnel$ go test ./tests/e2e/perf/ -bench=. -benchtime 30s
BenchmarkRunSerialNoWorkers-8                 	   10000	   3993544 ns/op
BenchmarkRun5000ConcurrentNoWorkers-8         	   10000	   3814512 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-8   	   10000	  14873398 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	237.094s

Google cloud VM, n1-standard-8 (8 vCPUs, 30 GB memory) with SSD

buchanae@release-testing-ssd-8-again-again:~/gopath/src/github.com/ohsu-comp-bio/funnel$ go test ./tests/e2e/perf/ -bench=. -benchtime 30s

BenchmarkRunSerialNoWorkers-8                 	   20000	   2745559 ns/op
BenchmarkRun5000ConcurrentNoWorkers-8         	   20000	   2979645 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-8   	   10000	  10270777 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	278.110s

The tests below failed with -benchtime 30s, so reduced to 10s. Seems like there is some limit to the number of concurrent workers writing to the DB (these are 2 CPU machines)

buchanae@release-testing-ssd:~/gopath/src/github.com/ohsu-comp-bio/funnel$ go test -bench=. ./tests/e2e/perf/ -benchtime 10s
BenchmarkRunSerialNoWorkers-2                 	    5000	   2686676 ns/op
BenchmarkRun5000ConcurrentNoWorkers-2         	   10000	   2983078 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-2   	    3000	  66853308 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	253.948s
buchanae@release-testing:~/gopath/src/github.com/ohsu-comp-bio/funnel$ go test -bench=. ./tests/e2e/perf/ -benchtime 10s
funnel               grpc
time                 2017-06-07T22:45:48Z
msg                  [connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:14931: getsockopt: connection refused" {localhost:14931 <nil>}]

BenchmarkRunSerialNoWorkers-2                 	    5000	   3378909 ns/op
BenchmarkRun5000ConcurrentNoWorkers-2         	    5000	   3689531 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-2   	    3000	  43641943 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	176.501s

@buchanae buchanae removed the feature label Jul 2, 2017
@mrsleclerc mrsleclerc modified the milestone: 2017.07.26 Jul 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants