[0.9.4.2] CPU 100% #4639

clanstyles · 2015-11-03T00:37:46Z

I have a 3 node cluster. Each are m4.Large on Amazon. the CPU is 100%.

Each request I batch write all analytics when the request closes. There are usually 2-3 types of analytics I log.

tags := map[string]string{
        "id":    imp.ID,
        "country":   imp.IPAddress.Country,
        "ipaddress": imp.IPAddress.Address,
        "os": imp.OS,
    }

    fields := map[string]interface{}{
        "value": 1,
    }

    batchPoints.AddPoint(client.NewPoint(
        "impressions",
        tags,
        fields,
        time.Now().UTC(),
    ))

Query times are ~1 second (through the web interface). To load balance the requests, I've hooked up elastic load balancer to all the nodes and allow them to rotate that way (not sure if this causes issues?).

I'm running influx 0.9.4.

The text was updated successfully, but these errors were encountered:

beckettsean · 2015-11-03T00:40:49Z

How many batches are you writing per second?
How many points per batch?
What queries are running and how many?

clanstyles · 2015-11-03T00:49:21Z

I'm running no queries. Only writing.
A batch consists of either 1 or two points. Initially a flood of two points.
I can't tell how many I'm writing per second. We see an average of 2M requests a day. It's spread out pretty evenly (from the CDN). This does NOT mean all these hit the web server. At this moment, I can't tell because the latency is causing timeouts. I'll be removing the influx stats recording to see if I can track requests via ELB and go from there.

This is what "show stats" is showing:

{"results":[{"series":[{"name":"engine","tags":{"path":"/var/opt/influxdb/data/_internal/monitor/3","version":"bz1"},"columns":["blks_write","blks_write_bytes","blks_write_bytes_c","points_write","points_write_dedupe"],"values":[[17762,1331034,1169989,17762,17762]]},{"name":"engine","tags":{"path":"/var/opt/influxdb/data/_internal/monitor/7","version":"bz1"},"columns":["blks_write","blks_write_bytes","blks_write_bytes_c","points_write","points_write_dedupe"],"values":[[1200,95652,83881,1228,1224]]},{"name":"engine","tags":{"path":"/var/opt/influxdb/data/adserver/default/5","version":"bz1"},"columns":["blks_write","blks_write_bytes","blks_write_bytes_c","points_write","points_write_dedupe"],"values":[[20826,642075,682501,88818,30575]]},{"name":"httpd","tags":{"bind":":8086"},"columns":["points_written_ok","query_req","query_resp_bytes","req","write_req","write_req_bytes"],"values":[[60894,27,672898,705,678,7768479]]},{"name":"shard","tags":{"engine":"bz1","id":"3","path":"/var/opt/influxdb/data/_internal/monitor/3"},"columns":["fields_create","series_create","write_points_ok","write_req"],"values":[[14,4,17762,8639]]},{"name":"shard","tags":{"engine":"bz1","id":"7","path":"/var/opt/influxdb/data/_internal/monitor/7"},"columns":["fields_create","series_create","write_points_ok","write_req"],"values":[[28,5,1233,247]]},{"name":"shard","tags":{"engine":"bz1","id":"5","path":"/var/opt/influxdb/data/adserver/default/5"},"columns":["fields_create","series_create","write_points_ok","write_req"],"values":[[3,447,92980,1137]]},{"name":"wal","tags":{"path":"/var/opt/influxdb/wal/_internal/monitor/3"},"columns":["auto_flush","flush_duration","idle_flush","mem_size","meta_flush","points_flush","points_write","points_write_req","series_flush"],"values":[[88039,33.85889806100007,8639,0,148,17762,17762,8639,17762]]},{"name":"wal","tags":{"path":"/var/opt/influxdb/wal/_internal/monitor/7"},"columns":["auto_flush","flush_duration","idle_flush","mem_size","meta_flush","points_flush","points_write","points_write_req","series_flush"],"values":[[1560,185.93142649100008,240,373,4,1228,1233,247,1200]]},{"name":"wal","tags":{"path":"/var/opt/influxdb/wal/adserver/default/5"},"columns":["auto_flush","flush_duration","idle_flush","mem_size","meta_flush","points_flush","points_write","points_write_req","series_flush"],"values":[[3846,246.15171807599987,389,22627,8,88818,92980,1137,20826]]},{"name":"write","columns":["point_req","point_req_hh","point_req_local","point_req_remote","req","write_ok"],"values":[[60894,1695,60894,121788,678,674]]},{"name":"runtime","columns":["Alloc","Frees","HeapAlloc","HeapIdle","HeapInUse","HeapObjects","HeapReleased","HeapSys","Lookups","Mallocs","NumGC","NumGoroutine","PauseTotalNs","Sys","TotalAlloc"],"values":[[4506208,76268800,4506208,203055104,6594560,18098,136126464,209649664,38720,76286898,6472,58,3051072666,566470904,21025083896]]}]}]}

If I can provide anymore data, please let me know.

jackzampolin · 2015-11-03T01:12:14Z

@clanstyles What volume are you using to back the m4.large? A SSD?

clanstyles · 2015-11-03T01:12:26Z

I removed writing points from the system, response time went down from 1-2s to 60ms. Influx servers all still sitting at 100% CPU.

beckettsean · 2015-11-03T01:31:47Z

@clanstyles if you are writing a million points per day in batches of 1 or 2 points each, that's tens of thousands of http requests per second to InfluxDB. That's too much HTTP overhead for the server.

Batch the writes into 100+ points per batch, ideally 5k points or so, for best performance. Alternately use UDP for the writes.

The high CPU load is likely because the system is trying to drain the WAL (write ahead log), which can accept points faster than the database can persist them to the permanent storage engine.

clanstyles · 2015-11-03T01:37:48Z

@beckettsean at this moment, there are no writes happening the CPU is still idle at 100%. The current issue with batching points is the need to write a batch writer.

Since Influx's client doesn't support multiple go-routines, this instantly limits us to a "per context" write. In the event that the server is shutting down, you'll need to gracefully write any remaining logs. There are various steps that will have to be implemented that I felt the influx client should do, and doesn't.

I monitored the influx disk IO on the drives, and it's next to nothing.

Watching the logs, I don't understand how this is even possible:

[monitor] 2015/11/03 01:33:59 failed to store statistics: timeout
[wal] 2015/11/03 01:33:59 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:02 write to index of partition 1 took 3.197536498s
[wal] 2015/11/03 01:34:10 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:10 write to index of partition 1 took 5.159944ms
[wal] 2015/11/03 01:34:19 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:19 write to index of partition 1 took 3.162219ms
[wal] 2015/11/03 01:34:27 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:27 write to index of partition 1 took 2.518651ms
[wal] 2015/11/03 01:34:39 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:40 write to index of partition 1 took 1.605196271s
[wal] 2015/11/03 01:34:54 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:56 write to index of partition 1 took 747.790113ms
[monitor] 2015/11/03 01:35:01 failed to store statistics: timeout
[wal] 2015/11/03 01:35:02 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:04 write to index of partition 1 took 1.579186636s
[wal] 2015/11/03 01:35:09 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:10 write to index of partition 1 took 1.566484512s
[wal] 2015/11/03 01:35:19 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:22 write to index of partition 1 took 1.638281477s
[wal] 2015/11/03 01:35:30 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:30 write to index of partition 1 took 2.431985ms
[wal] 2015/11/03 01:35:41 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:41 write to index of partition 1 took 2.17642ms
[wal] 2015/11/03 01:35:48 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:51 write to index of partition 1 took 2.540024959s
[wal] 2015/11/03 01:35:59 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:01 write to index of partition 1 took 1.58264594s
[wal] 2015/11/03 01:36:06 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:11 write to index of partition 1 took 5.009611341s
[wal] 2015/11/03 01:36:18 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:19 write to index of partition 1 took 1.557241405s
[wal] 2015/11/03 01:36:30 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:30 write to index of partition 1 took 2.475459ms
[wal] 2015/11/03 01:36:36 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:40 write to index of partition 1 took 3.262602201s
[wal] 2015/11/03 01:36:48 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:53 write to index of partition 1 took 5.099431167s
[wal] 2015/11/03 01:36:59 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:37:02 write to index of partition 1 took 3.122957098s
[monitor] 2015/11/03 01:37:07 failed to store statistics: timeout
[wal] 2015/11/03 01:37:07 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1

otoolep · 2015-11-03T02:01:48Z

That looks like the internal stats being written to storage.

We need to surpress those log messages as they are too verbose in default mode.

Are you running InfluxDB as a service?

clanstyles · 2015-11-03T02:04:50Z

Yes on the standard AMI from Amazon following your guide.

After restarting all 3 nodes the CPU went down on all but the 3rd (and last node). It's still 100%

otoolep · 2015-11-03T02:12:07Z

At this stage it's pretty difficult to work out why, so a restart is probably best. However, we can run with profiling enabled to see if it re-occurs.

https://github.com/influxdb/influxdb/blob/master/CONTRIBUTING.md#profiling

Add -cpuprofile /tmp/influxd.prof to /etc/default/influxdb, which will enable profiling data to that file. So if it hits 100% we'd like that file, plus you need to tell us exactly which version you are running.

clanstyles · 2015-11-03T02:19:10Z

I'll do that.

Just to update, we're seeing about 1200 /requests a minute we want to record stats for.

otoolep · 2015-11-03T02:21:27Z

Any continuous queries in effect?

clanstyles · 2015-11-03T02:23:35Z

No, at the time it started to write, I didn't see more than 6 requests through ELB before the system went 100% CPU and died.

clanstyles · 2015-11-03T02:35:17Z

So after the restart, it seems like influx isn't spiking anymore on any of the 3 nodes.

otoolep · 2015-11-03T02:35:51Z

OK, thanks @clanstyles -- if you are up for it, we can run with profiling enabled in case it re-occurs.

clanstyles · 2015-11-03T04:03:40Z

Ah I think I figured out a trigger. Try to delete a large amount of data -> CPU spikes -> http times out.

otoolep · 2015-11-03T04:06:22Z

Can you show us your query?

clanstyles · 2015-11-03T04:08:29Z

DELETE FROM foo WHERE time > '2014-06-30'
DROP SERIES FROM impressions -- causes high load on 1 server, 0 on the others.

I'm trying to truncate data from the table.

clanstyles · 2015-11-03T17:35:47Z

Now the daemon is crashing. How do I upload the prof file?

beckettsean · 2015-11-03T17:35:52Z

DELETE is not a valid query. You're hitting a combination of that not being valid #1647 and the error not being thrown #4404.

I will escalate #4404 for the next release, I honestly forgot we hadn't fixed that.

DROP SERIES is a valid command and will cause high load for a while. InfluxDB assumes deletes are infrequent and usually in bulk, e.g. DROP MEASUREMENT or retention policy expiration. DROP SERIES is not as performant.

jwilder · 2016-03-18T01:53:02Z

Closing this since it's related to the b1 storage engine which has been replaced by the tsm engine in 0.10.

clanstyles changed the title ~~CPU 100%~~ CPU 100% with 0.9.4.2 Nov 3, 2015

beckettsean changed the title ~~CPU 100% with 0.9.4.2~~ [0.9.4.2] CPU 100% Nov 3, 2015

beckettsean added the area/performance label Nov 3, 2015

jwilder closed this as completed Mar 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.4.2] CPU 100% #4639

[0.9.4.2] CPU 100% #4639

clanstyles commented Nov 3, 2015

beckettsean commented Nov 3, 2015

clanstyles commented Nov 3, 2015

jackzampolin commented Nov 3, 2015

clanstyles commented Nov 3, 2015

beckettsean commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

clanstyles commented Nov 3, 2015

beckettsean commented Nov 3, 2015

jwilder commented Mar 18, 2016

[0.9.4.2] CPU 100% #4639

[0.9.4.2] CPU 100% #4639

Comments

clanstyles commented Nov 3, 2015

beckettsean commented Nov 3, 2015

clanstyles commented Nov 3, 2015

jackzampolin commented Nov 3, 2015

clanstyles commented Nov 3, 2015

beckettsean commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

otoolep commented Nov 3, 2015

clanstyles commented Nov 3, 2015

clanstyles commented Nov 3, 2015

beckettsean commented Nov 3, 2015

jwilder commented Mar 18, 2016