Skip to content
This repository has been archived by the owner on Jun 28, 2022. It is now read-only.

High cpu usage #49

Open
diadistis opened this issue Aug 10, 2015 · 1 comment
Open

High cpu usage #49

diadistis opened this issue Aug 10, 2015 · 1 comment

Comments

@diadistis
Copy link

Setup

  1. Latest stream2es (20150720170522978252e) on server (6 cores / 64GB ram) separate from the es cluster
  2. A big (~65GB) file containing 1 large json object per line. There are about 15 million lines/documents and the average line size is ~4.3k characters

Problem

I'm running :

cat bigfile | stream2es stdin --target http://server:9200/index/type --log debug -w 12

I have tried several different options for --bulk-bytes, -w, -d and -q but always the same result. I'm getting a constant indexing speed of ~5MB/s which translates to 4 hours to import the file. While indexing the elasticsearch cluster is heavily under-utilized and the stream2es server has a single core at 100%. I have done extensive testing to ensure that there are no network or elasticsearch performance issues.

Workaround

My final solution was to run stream2es in parallel (not with -w) to see if that would help.

cat bigfile | parallel -j12 -L5000 --pipe "stream2es stdin --target http://server:9200/index/type"

That helped a lot. Now all 6 cores and 12 threads get 100% and the indexing time fell from 4 hours to 35 minutes but the elasticsearch cluster is still pretty much idle. It seems to me that something in stream2es uses way more cpu than it should.

@drewr
Copy link
Contributor

drewr commented May 9, 2016

Thanks for reporting this @diadistis, and sorry for the terrible response time. I've noticed similar, and I've done similar workarounds. I haven't had a chance to do profiling on the internal design to isolate the bottleneck, but I suspect at the very least the single LinkedBlockingQueue that feeds the pipeline is part of it.

I did just push a fix for some extraneous string copying, but it won't speed anything up 8x. If you still have this environment available I'd love to know its effect.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants