Stress test notes #139

buchanae · 2017-05-30T21:10:15Z

I'll keep some notes about stress test results here and then come up with a more concrete list of issues later.

buchanae · 2017-05-30T21:11:36Z

for m in {1..500}; do funnel run --cmd 'echo $msg' -e msg="$m"; done

Is a decent place to start. That found a couple issues:

Unhandled panic #138 panic from NewDockerClient
- Possibly we are creating lost of docker clients when we only need one
docker ps is locked up after submitting those tasks
- likely this has something to do with trying to start 500 containers at once
- ~~is the local worker backend not limiting its resources appropriately?~~ the local scheduler backend does not match resources. Perhaps it should.
  - I also forgot to set a CPU request

buchanae · 2017-05-30T21:22:19Z

The terminal dashboard truncates the task IDs which make a long series of IDs the same:

buchanae · 2017-05-30T22:31:02Z

Improved:

time for m in {1..500}; do funnel run --cmd 'echo $msg' --cpu 1 -e msg="$m"; done

Take about 10 seconds to submit 500 tasks on my laptop. No hangups when I have things set up right:

using `--cpu``
using manual scheduler backend

buchanae · 2017-05-30T22:34:19Z

Submitted 5000 tasks without trouble on my laptop. Took 1m32.777s

buchanae · 2017-05-30T22:57:12Z

1000 tasks on buchanan01 in exastack: 2m9.332s

Not sure what the difference is here. File system access maybe? More CPUs mean more concurrent workers writing to the DB? There are a lot of possible factors.

Theoretically, since boltdb locks the entire database for every transaction, the more clients reading/writing the database, the lower the performance. A database with row-level locking, a write-ahead log, compare-and-set, etc. would have much higher performance.

buchanae · 2017-05-30T23:38:11Z

#55 seems important, since I've repeatedly forgotten to use funnel run --cpu 1 today, which results in a locked up docker/VM/etc.

buchanae · 2017-05-30T23:42:01Z

Another test command uses ping -w 30 to ping for 30 seconds, which puts some traffic on the task log streaming.

time for i in {1..100}; do funnel run -S http://localhost:8070 --cpu 1 --cmd 'ping -w 30 google.com'; done

buchanae · 2017-05-30T23:44:44Z

Testing with:
1 server, no workers on buchanan01
1 worker on each tes-master, tes-worker-2, tes-worker-3 (12, 12, 4 CPUs)
Running funnel run on my laptop, to buchanan01 over ssh tunnel + http.
Running at home over a relatively slow cell network (LTE).

100 tasks takes 32 seconds
1000 tasks took 5m20.806s

buchanae · 2017-05-30T23:52:43Z

At some point, the terminal dashboard starts lagging and becomes unusable. Probably because A) it's listing everything, and B) the server is busy communicating with workers (~20-30 tasks)

buchanae · 2017-05-30T23:53:49Z

The web dashboard seems to hold up fine in terms of interactivity. Of course, it's not nice to sort through 50 pages of completed tasks :)

buchanae · 2017-05-31T00:22:46Z

Ideas for improvement:

Badger (no transaction rollback though)
- RocksDB
- etcd
RPC based funnel client (i.e. reduce network traffic during task creation)
batch task endpoint
don't write task logs to the database, but write them to a file.
- and/or substantially reduce the task streaming update rate
remove/reduce writes during worker sync loop
remove/reduce writes during task assignment

Ideas for more stress:

larger cluster of workers
worker file IO
introduce network latency/interruptions

buchanae · 2017-05-31T02:38:50Z

I have some boltDB test code here: https://gist.github.com/buchanae/38cc3c0ccb0a092417a14e7abdb4a0f8

Which helped me figure out that only a single update transaction can exist at a time, regardless of the bucket.

buchanae · 2017-05-31T16:40:05Z

Testing with:
1 server, no workers on buchanan01
1 worker on each tes-master, tes-worker-2, tes-worker-3 (12, 12, 4 CPUs)
Submitting from buchanan01 via funnel run

1000 tasks: 1m 36s
5000 tasks: 6m 38.167s

buchanae · 2017-05-31T16:46:52Z

Running funnel tasks get on a single task ID while the server is busy with the >5000 tasks submitted (previous comment)

1000 iterations: ~15s
5000 iterations: ~1m 11s

buchanae · 2017-05-31T16:49:42Z

Amazingly, funnel tasks list -v FULL runs in ~1 second and returns 6100 tasks

~3.5 MB of data

buchanae · 2017-05-31T17:07:41Z

1 server, no workers on buchanan01
1 worker on each tes-master, tes-worker-2, tes-worker-3 (12, 12, 4 CPUs)
Submitting from buchanan01 via funnel run

time for i in {1..1000}; do ./funnel run -S http://localhost:8000 --cpu 1 --cmd 'sleep 30'; done

This command doesn't output any logs, reducing the write traffic to the database.

1000 tasks: 50s (~30-40s less than tasks with log traffic)
5000 tasks: 4m 12.402s (~2.5 minutes less)

buchanae · 2017-05-31T17:08:28Z

Another idea for improvement:
Keep a separate bolt database for task stdout/err logs.

buchanae · 2017-05-31T17:25:19Z

Trying out what happens when I set config.Worker.UpdateRate = 30 seconds and config.Worker.LogUpdateRate too

time for i in {1..1000}; do ./funnel run -S http://localhost:8000 --cpu 1 --cmd 'ping -w 30 google.com' > /dev/null; done

real	1m4.622s
user	0m5.536s
sys	0m3.684s
buchanan01 ~
time for i in {1..5000}; do ./funnel run -S http://localhost:8000 --cpu 1 --cmd 'ping -w 30 google.com' > /dev/null; done

real	4m37.338s
user	0m27.284s
sys	0m18.432s

buchanae · 2017-05-31T21:24:19Z

I implemented a quick and dirty badger database backend, which looks 4x faster for creating 5000 tasks:

time for i in {1..5000}; do ./funnel run -S http://localhost:8000 --cpu 1 --cmd 'ping -w 30 google.com' > /dev/null; done

real	1m17.590s
user	0m30.504s
sys	0m19.080s

buchanae · 2017-05-31T21:27:17Z

This task in JSON repeated 5000 times is ~ 1.8 MB

{
	"name": "Funnel run: ping -w 30 google.com",
	"resources": {
		"cpuCores": 1
	},
	"executors": [
		{
			"imageName": "alpine",
			"cmd": [
				"ping",
				"-w",
				"30",
				"google.com"
			],
			"workdir": "/opt/funnel",
			"stdout": "/opt/funnel/outputs/stdout-0",
			"stderr": "/opt/funnel/outputs/stderr-0",
			"environ": {
			}
		}
	],
	"tags": {
	}
}

kellrott · 2017-06-01T22:33:07Z

So we're switching to badger?

buchanae · 2017-06-01T23:43:48Z

I think so. I don't see a substantial downside yet.

kellrott · 2017-06-02T04:14:18Z

Hopefully the doing the new db driver will let you identify the common elements with the original boltdb, so doing more db plugins will be easier.

buchanae · 2017-06-07T23:15:01Z

Summary:

these tests mostly focused on simple write throughput: CreateTask, UpdateExecutorLogs, etc.
It's possible to write thousands of tasks in a short amount of time without error. It's just not as fast as we'd like.
BoltDB is not optimized for writes; it locks the DB on each write. Badger DB looks like a good alternative, at roughly 4x faster
- An experimental and incomplete Badger backend lives here: https://github.com/buchanae/funnel/tree/stress-work/server/badger
- Create Badger database backend #151 for followup

The comments below are for bolt DB:

SSD is faster, no surprise there.
Exastack is super slow.
A batch endpoint would be great.
We might consider tuning the worker's LogUpdateRate to be less frequent, to reduce the number of writes going to the server.
Benchmarks are written in tests/e2e/perf and runnable with go test -bench=. ./tests/e2e/perf
Default minimum resources are needed to avoid overwhelming workers: Default resource requirements #55
The dashboards (term and web) need pagination: Termdash not interactive when thousands of tasks exist #149
As far as I can tell, the performance doesn't degrade as the database gets bigger.

Benchmark output:

OpenStack, 12 CPUs

go test ./tests/e2e/perf/ -bench=. -benchtime 30s

BenchmarkRunSerialNoWorkers-12                  	    1000	  42497604 ns/op
BenchmarkRun5000ConcurrentNoWorkers-12          	    1000	  46668322 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-12    	    1000	 252317192 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	366.745s

macbook

go test ./tests/e2e/perf/ -bench=. -benchtime 30s

BenchmarkRunSerialNoWorkers-8             	  100000	    607494 ns/op
BenchmarkRunConcurrentNoWorkers-8         	  100000	    669869 ns/op
BenchmarkRunConcurrentWithFakeWorkers-8   	   10000	  12054479 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	271.135s

Google cloud VM, n1-standard-8 (8 vCPUs, 30 GB memory) with non-SSD

buchanae@release-testing-ssd-8:~/gopath/src/github.com/ohsu-comp-bio/funnel$ go test ./tests/e2e/perf/ -bench=. -benchtime 30s
BenchmarkRunSerialNoWorkers-8                 	   10000	   3993544 ns/op
BenchmarkRun5000ConcurrentNoWorkers-8         	   10000	   3814512 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-8   	   10000	  14873398 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	237.094s

Google cloud VM, n1-standard-8 (8 vCPUs, 30 GB memory) with SSD

buchanae@release-testing-ssd-8-again-again:~/gopath/src/github.com/ohsu-comp-bio/funnel$ go test ./tests/e2e/perf/ -bench=. -benchtime 30s

BenchmarkRunSerialNoWorkers-8                 	   20000	   2745559 ns/op
BenchmarkRun5000ConcurrentNoWorkers-8         	   20000	   2979645 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-8   	   10000	  10270777 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	278.110s

The tests below failed with -benchtime 30s, so reduced to 10s. Seems like there is some limit to the number of concurrent workers writing to the DB (these are 2 CPU machines)

buchanae@release-testing-ssd:~/gopath/src/github.com/ohsu-comp-bio/funnel$ go test -bench=. ./tests/e2e/perf/ -benchtime 10s
BenchmarkRunSerialNoWorkers-2                 	    5000	   2686676 ns/op
BenchmarkRun5000ConcurrentNoWorkers-2         	   10000	   2983078 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-2   	    3000	  66853308 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	253.948s

buchanae@release-testing:~/gopath/src/github.com/ohsu-comp-bio/funnel$ go test -bench=. ./tests/e2e/perf/ -benchtime 10s
funnel               grpc
time                 2017-06-07T22:45:48Z
msg                  [connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:14931: getsockopt: connection refused" {localhost:14931 <nil>}]

BenchmarkRunSerialNoWorkers-2                 	    5000	   3378909 ns/op
BenchmarkRun5000ConcurrentNoWorkers-2         	    5000	   3689531 ns/op
BenchmarkRun5000ConcurrentWithFakeWorkers-2   	    3000	  43641943 ns/op
PASS
ok  	github.com/ohsu-comp-bio/funnel/tests/e2e/perf	176.501s

buchanae mentioned this issue Jun 6, 2017

Termdash not interactive when thousands of tasks exist #149

Closed

buchanae removed the feature label Jul 2, 2017

mrsleclerc modified the milestone: 2017.07.26 Jul 27, 2017

buchanae closed this as completed Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress test notes #139

Stress test notes #139

buchanae commented May 30, 2017

buchanae commented May 30, 2017 •

edited

Loading

buchanae commented May 30, 2017

buchanae commented May 30, 2017

buchanae commented May 30, 2017

buchanae commented May 30, 2017 •

edited

Loading

buchanae commented May 30, 2017

buchanae commented May 30, 2017

buchanae commented May 30, 2017 •

edited

Loading

buchanae commented May 30, 2017

buchanae commented May 30, 2017

buchanae commented May 31, 2017 •

edited

Loading

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017 •

edited

Loading

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017

kellrott commented Jun 1, 2017

buchanae commented Jun 1, 2017

kellrott commented Jun 2, 2017

buchanae commented Jun 7, 2017

Stress test notes #139

Stress test notes #139

Comments

buchanae commented May 30, 2017

buchanae commented May 30, 2017 • edited Loading

buchanae commented May 30, 2017

buchanae commented May 30, 2017

buchanae commented May 30, 2017

buchanae commented May 30, 2017 • edited Loading

buchanae commented May 30, 2017

buchanae commented May 30, 2017

buchanae commented May 30, 2017 • edited Loading

buchanae commented May 30, 2017

buchanae commented May 30, 2017

buchanae commented May 31, 2017 • edited Loading

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017 • edited Loading

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017

buchanae commented May 31, 2017

kellrott commented Jun 1, 2017

buchanae commented Jun 1, 2017

kellrott commented Jun 2, 2017

buchanae commented Jun 7, 2017

buchanae commented May 30, 2017 •

edited

Loading

buchanae commented May 30, 2017 •

edited

Loading

buchanae commented May 30, 2017 •

edited

Loading

buchanae commented May 31, 2017 •

edited

Loading

buchanae commented May 31, 2017 •

edited

Loading