Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.4.2] CPU 100% #4639

Closed
clanstyles opened this issue Nov 3, 2015 · 20 comments
Closed

[0.9.4.2] CPU 100% #4639

clanstyles opened this issue Nov 3, 2015 · 20 comments

Comments

@clanstyles
Copy link

I have a 3 node cluster. Each are m4.Large on Amazon. the CPU is 100%.

Each request I batch write all analytics when the request closes. There are usually 2-3 types of analytics I log.

tags := map[string]string{
        "id":    imp.ID,
        "country":   imp.IPAddress.Country,
        "ipaddress": imp.IPAddress.Address,
        "os": imp.OS,
    }

    fields := map[string]interface{}{
        "value": 1,
    }

    batchPoints.AddPoint(client.NewPoint(
        "impressions",
        tags,
        fields,
        time.Now().UTC(),
    ))

Query times are ~1 second (through the web interface). To load balance the requests, I've hooked up elastic load balancer to all the nodes and allow them to rotate that way (not sure if this causes issues?).

I'm running influx 0.9.4.

@clanstyles clanstyles changed the title CPU 100% CPU 100% with 0.9.4.2 Nov 3, 2015
@beckettsean beckettsean changed the title CPU 100% with 0.9.4.2 [0.9.4.2] CPU 100% Nov 3, 2015
@beckettsean
Copy link
Contributor

How many batches are you writing per second?
How many points per batch?
What queries are running and how many?

@clanstyles
Copy link
Author

I'm running no queries. Only writing.
A batch consists of either 1 or two points. Initially a flood of two points.
I can't tell how many I'm writing per second. We see an average of 2M requests a day. It's spread out pretty evenly (from the CDN). This does NOT mean all these hit the web server. At this moment, I can't tell because the latency is causing timeouts. I'll be removing the influx stats recording to see if I can track requests via ELB and go from there.

y8eaiis

This is what "show stats" is showing:

{"results":[{"series":[{"name":"engine","tags":{"path":"/var/opt/influxdb/data/_internal/monitor/3","version":"bz1"},"columns":["blks_write","blks_write_bytes","blks_write_bytes_c","points_write","points_write_dedupe"],"values":[[17762,1331034,1169989,17762,17762]]},{"name":"engine","tags":{"path":"/var/opt/influxdb/data/_internal/monitor/7","version":"bz1"},"columns":["blks_write","blks_write_bytes","blks_write_bytes_c","points_write","points_write_dedupe"],"values":[[1200,95652,83881,1228,1224]]},{"name":"engine","tags":{"path":"/var/opt/influxdb/data/adserver/default/5","version":"bz1"},"columns":["blks_write","blks_write_bytes","blks_write_bytes_c","points_write","points_write_dedupe"],"values":[[20826,642075,682501,88818,30575]]},{"name":"httpd","tags":{"bind":":8086"},"columns":["points_written_ok","query_req","query_resp_bytes","req","write_req","write_req_bytes"],"values":[[60894,27,672898,705,678,7768479]]},{"name":"shard","tags":{"engine":"bz1","id":"3","path":"/var/opt/influxdb/data/_internal/monitor/3"},"columns":["fields_create","series_create","write_points_ok","write_req"],"values":[[14,4,17762,8639]]},{"name":"shard","tags":{"engine":"bz1","id":"7","path":"/var/opt/influxdb/data/_internal/monitor/7"},"columns":["fields_create","series_create","write_points_ok","write_req"],"values":[[28,5,1233,247]]},{"name":"shard","tags":{"engine":"bz1","id":"5","path":"/var/opt/influxdb/data/adserver/default/5"},"columns":["fields_create","series_create","write_points_ok","write_req"],"values":[[3,447,92980,1137]]},{"name":"wal","tags":{"path":"/var/opt/influxdb/wal/_internal/monitor/3"},"columns":["auto_flush","flush_duration","idle_flush","mem_size","meta_flush","points_flush","points_write","points_write_req","series_flush"],"values":[[88039,33.85889806100007,8639,0,148,17762,17762,8639,17762]]},{"name":"wal","tags":{"path":"/var/opt/influxdb/wal/_internal/monitor/7"},"columns":["auto_flush","flush_duration","idle_flush","mem_size","meta_flush","points_flush","points_write","points_write_req","series_flush"],"values":[[1560,185.93142649100008,240,373,4,1228,1233,247,1200]]},{"name":"wal","tags":{"path":"/var/opt/influxdb/wal/adserver/default/5"},"columns":["auto_flush","flush_duration","idle_flush","mem_size","meta_flush","points_flush","points_write","points_write_req","series_flush"],"values":[[3846,246.15171807599987,389,22627,8,88818,92980,1137,20826]]},{"name":"write","columns":["point_req","point_req_hh","point_req_local","point_req_remote","req","write_ok"],"values":[[60894,1695,60894,121788,678,674]]},{"name":"runtime","columns":["Alloc","Frees","HeapAlloc","HeapIdle","HeapInUse","HeapObjects","HeapReleased","HeapSys","Lookups","Mallocs","NumGC","NumGoroutine","PauseTotalNs","Sys","TotalAlloc"],"values":[[4506208,76268800,4506208,203055104,6594560,18098,136126464,209649664,38720,76286898,6472,58,3051072666,566470904,21025083896]]}]}]}

If I can provide anymore data, please let me know.

@jackzampolin
Copy link
Contributor

@clanstyles What volume are you using to back the m4.large? A SSD?

@clanstyles
Copy link
Author

I removed writing points from the system, response time went down from 1-2s to 60ms. Influx servers all still sitting at 100% CPU.

@beckettsean
Copy link
Contributor

@clanstyles if you are writing a million points per day in batches of 1 or 2 points each, that's tens of thousands of http requests per second to InfluxDB. That's too much HTTP overhead for the server.

Batch the writes into 100+ points per batch, ideally 5k points or so, for best performance. Alternately use UDP for the writes.

The high CPU load is likely because the system is trying to drain the WAL (write ahead log), which can accept points faster than the database can persist them to the permanent storage engine.

@clanstyles
Copy link
Author

@beckettsean at this moment, there are no writes happening the CPU is still idle at 100%. The current issue with batching points is the need to write a batch writer.

Since Influx's client doesn't support multiple go-routines, this instantly limits us to a "per context" write. In the event that the server is shutting down, you'll need to gracefully write any remaining logs. There are various steps that will have to be implemented that I felt the influx client should do, and doesn't.

I monitored the influx disk IO on the drives, and it's next to nothing.

Watching the logs, I don't understand how this is even possible:

[monitor] 2015/11/03 01:33:59 failed to store statistics: timeout
[wal] 2015/11/03 01:33:59 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:02 write to index of partition 1 took 3.197536498s
[wal] 2015/11/03 01:34:10 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:10 write to index of partition 1 took 5.159944ms
[wal] 2015/11/03 01:34:19 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:19 write to index of partition 1 took 3.162219ms
[wal] 2015/11/03 01:34:27 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:27 write to index of partition 1 took 2.518651ms
[wal] 2015/11/03 01:34:39 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:40 write to index of partition 1 took 1.605196271s
[wal] 2015/11/03 01:34:54 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:34:56 write to index of partition 1 took 747.790113ms
[monitor] 2015/11/03 01:35:01 failed to store statistics: timeout
[wal] 2015/11/03 01:35:02 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:04 write to index of partition 1 took 1.579186636s
[wal] 2015/11/03 01:35:09 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:10 write to index of partition 1 took 1.566484512s
[wal] 2015/11/03 01:35:19 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:22 write to index of partition 1 took 1.638281477s
[wal] 2015/11/03 01:35:30 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:30 write to index of partition 1 took 2.431985ms
[wal] 2015/11/03 01:35:41 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:41 write to index of partition 1 took 2.17642ms
[wal] 2015/11/03 01:35:48 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:35:51 write to index of partition 1 took 2.540024959s
[wal] 2015/11/03 01:35:59 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:01 write to index of partition 1 took 1.58264594s
[wal] 2015/11/03 01:36:06 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:11 write to index of partition 1 took 5.009611341s
[wal] 2015/11/03 01:36:18 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:19 write to index of partition 1 took 1.557241405s
[wal] 2015/11/03 01:36:30 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:30 write to index of partition 1 took 2.475459ms
[wal] 2015/11/03 01:36:36 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:40 write to index of partition 1 took 3.262602201s
[wal] 2015/11/03 01:36:48 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:36:53 write to index of partition 1 took 5.099431167s
[wal] 2015/11/03 01:36:59 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1
[wal] 2015/11/03 01:37:02 write to index of partition 1 took 3.122957098s
[monitor] 2015/11/03 01:37:07 failed to store statistics: timeout
[wal] 2015/11/03 01:37:07 Flush due to idle. Flushing 3 series with 3 points and 195 bytes from partition 1

@otoolep
Copy link
Contributor

otoolep commented Nov 3, 2015

That looks like the internal stats being written to storage.

We need to surpress those log messages as they are too verbose in default mode.

Are you running InfluxDB as a service?

@clanstyles
Copy link
Author

Yes on the standard AMI from Amazon following your guide.

After restarting all 3 nodes the CPU went down on all but the 3rd (and last node). It's still 100%

@otoolep
Copy link
Contributor

otoolep commented Nov 3, 2015

At this stage it's pretty difficult to work out why, so a restart is probably best. However, we can run with profiling enabled to see if it re-occurs.

https://github.com/influxdb/influxdb/blob/master/CONTRIBUTING.md#profiling

Add -cpuprofile /tmp/influxd.prof to /etc/default/influxdb, which will enable profiling data to that file. So if it hits 100% we'd like that file, plus you need to tell us exactly which version you are running.

@clanstyles
Copy link
Author

I'll do that.

Just to update, we're seeing about 1200 /requests a minute we want to record stats for.

@otoolep
Copy link
Contributor

otoolep commented Nov 3, 2015

Any continuous queries in effect?

@clanstyles
Copy link
Author

No, at the time it started to write, I didn't see more than 6 requests through ELB before the system went 100% CPU and died.

@clanstyles
Copy link
Author

So after the restart, it seems like influx isn't spiking anymore on any of the 3 nodes.

@otoolep
Copy link
Contributor

otoolep commented Nov 3, 2015

OK, thanks @clanstyles -- if you are up for it, we can run with profiling enabled in case it re-occurs.

@clanstyles
Copy link
Author

Ah I think I figured out a trigger. Try to delete a large amount of data -> CPU spikes -> http times out.

@otoolep
Copy link
Contributor

otoolep commented Nov 3, 2015

Can you show us your query?

@clanstyles
Copy link
Author

DELETE FROM foo WHERE time > '2014-06-30'
DROP SERIES FROM impressions -- causes high load on 1 server, 0 on the others.

I'm trying to truncate data from the table.

@clanstyles
Copy link
Author

Now the daemon is crashing. How do I upload the prof file?

@beckettsean
Copy link
Contributor

DELETE is not a valid query. You're hitting a combination of that not being valid #1647 and the error not being thrown #4404.

I will escalate #4404 for the next release, I honestly forgot we hadn't fixed that.

DROP SERIES is a valid command and will cause high load for a while. InfluxDB assumes deletes are infrequent and usually in bulk, e.g. DROP MEASUREMENT or retention policy expiration. DROP SERIES is not as performant.

@jwilder
Copy link
Contributor

jwilder commented Mar 18, 2016

Closing this since it's related to the b1 storage engine which has been replaced by the tsm engine in 0.10.

@jwilder jwilder closed this as completed Mar 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants