-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.4.2] CPU 100% #4639
Comments
How many batches are you writing per second? |
I'm running no queries. Only writing. This is what "show stats" is showing:
If I can provide anymore data, please let me know. |
@clanstyles What volume are you using to back the m4.large? A SSD? |
I removed writing points from the system, response time went down from 1-2s to 60ms. Influx servers all still sitting at 100% CPU. |
@clanstyles if you are writing a million points per day in batches of 1 or 2 points each, that's tens of thousands of http requests per second to InfluxDB. That's too much HTTP overhead for the server. Batch the writes into 100+ points per batch, ideally 5k points or so, for best performance. Alternately use UDP for the writes. The high CPU load is likely because the system is trying to drain the WAL (write ahead log), which can accept points faster than the database can persist them to the permanent storage engine. |
@beckettsean at this moment, there are no writes happening the CPU is still idle at 100%. The current issue with batching points is the need to write a batch writer. Since Influx's client doesn't support multiple go-routines, this instantly limits us to a "per context" write. In the event that the server is shutting down, you'll need to gracefully write any remaining logs. There are various steps that will have to be implemented that I felt the influx client should do, and doesn't. I monitored the influx disk IO on the drives, and it's next to nothing. Watching the logs, I don't understand how this is even possible:
|
That looks like the internal stats being written to storage. We need to surpress those log messages as they are too verbose in default mode. Are you running InfluxDB as a service? |
Yes on the standard AMI from Amazon following your guide. After restarting all 3 nodes the CPU went down on all but the 3rd (and last node). It's still 100% |
At this stage it's pretty difficult to work out why, so a restart is probably best. However, we can run with profiling enabled to see if it re-occurs. https://github.com/influxdb/influxdb/blob/master/CONTRIBUTING.md#profiling Add |
I'll do that. Just to update, we're seeing about 1200 /requests a minute we want to record stats for. |
Any continuous queries in effect? |
No, at the time it started to write, I didn't see more than 6 requests through ELB before the system went 100% CPU and died. |
So after the restart, it seems like influx isn't spiking anymore on any of the 3 nodes. |
OK, thanks @clanstyles -- if you are up for it, we can run with profiling enabled in case it re-occurs. |
Ah I think I figured out a trigger. Try to delete a large amount of data -> CPU spikes -> http times out. |
Can you show us your query? |
DELETE FROM foo WHERE time > '2014-06-30' I'm trying to truncate data from the table. |
Now the daemon is crashing. How do I upload the prof file? |
I will escalate #4404 for the next release, I honestly forgot we hadn't fixed that.
|
Closing this since it's related to the |
I have a 3 node cluster. Each are m4.Large on Amazon. the CPU is 100%.
Each request I batch write all analytics when the request closes. There are usually 2-3 types of analytics I log.
Query times are ~1 second (through the web interface). To load balance the requests, I've hooked up elastic load balancer to all the nodes and allow them to rotate that way (not sure if this causes issues?).
I'm running influx 0.9.4.
The text was updated successfully, but these errors were encountered: