etcd high cpu usage #1742

TheHippo · 2014-11-18T14:59:37Z

I'm currently trying out CoreOS with 3 virtual machines via Virtual Box and found out, that the cpu usage for etcd grows over time. (From 35% to 90% over the course of half an hour.) This seem to be happening regardless of the CoreOS update channel I use.

I'm not an expert with etcd, but I'm willing to provide addional data if this helps solve this issue.

The text was updated successfully, but these errors were encountered:

TheHippo · 2014-11-18T15:01:46Z

Output from curl http://127.0.0.1:7001/v2/admin/machines over a span of 2 minutes:

[
  {
    "name": "508913e0694b4354be7ae5fd0c044e14",
    "state": "follower",
    "clientURL": "http://172.17.8.101:4001",
    "peerURL": "http://172.17.8.101:7001"
  },
  {
    "name": "7762f6e808934d57b0092100cf600743",
    "state": "follower",
    "clientURL": "http://172.17.8.160:4001",
    "peerURL": "http://172.17.8.160:7001"
  },
  {
    "name": "9632e08e866d4a769436ab7c2fc70e6b",
    "state": "leader",
    "clientURL": "http://172.17.8.150:4001",
    "peerURL": "http://172.17.8.150:7001"
  }
]

[
  {
    "name": "508913e0694b4354be7ae5fd0c044e14",
    "state": "follower",
    "clientURL": "http://172.17.8.101:4001",
    "peerURL": "http://172.17.8.101:7001"
  },
  {
    "name": "7762f6e808934d57b0092100cf600743",
    "state": "leader",
    "clientURL": "http://172.17.8.160:4001",
    "peerURL": "http://172.17.8.160:7001"
  },
  {
    "name": "9632e08e866d4a769436ab7c2fc70e6b",
    "state": "follower",
    "clientURL": "http://172.17.8.150:4001",
    "peerURL": "http://172.17.8.150:7001"
  }
]

[
  {
    "name": "508913e0694b4354be7ae5fd0c044e14",
    "state": "leader",
    "clientURL": "http://172.17.8.101:4001",
    "peerURL": "http://172.17.8.101:7001"
  },
  {
    "name": "7762f6e808934d57b0092100cf600743",
    "state": "follower",
    "clientURL": "http://172.17.8.160:4001",
    "peerURL": "http://172.17.8.160:7001"
  },
  {
    "name": "9632e08e866d4a769436ab7c2fc70e6b",
    "state": "follower",
    "clientURL": "http://172.17.8.150:4001",
    "peerURL": "http://172.17.8.150:7001"
  }
]

TheHippo · 2014-11-18T15:03:36Z

Ok, this might be related to: #868

philips · 2014-11-19T05:00:51Z

@TheHippo Are you seeing thrashing?

TheHippo · 2014-11-19T13:23:45Z

@philips What exactly do you mean with trashing? Are there any more information I could provide to help your solve this issue?

TheHippo · 2014-11-19T14:20:33Z

I uploaded a log file here.

Last 40 lines:

Nov 19 14:15:07 helper-01 etcd[1030]: [etcd] Nov 19 14:15:07.098 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #492 started.
Nov 19 14:15:07 helper-01 etcd[1030]: [etcd] Nov 19 14:15:07.099 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '02bd3563e2eb4b589f88125d0b959616' to ''.
Nov 19 14:15:07 helper-01 etcd[1030]: [etcd] Nov 19 14:15:07.840 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 214.670341ms
Nov 19 14:15:08 helper-01 etcd[1030]: [etcd] Nov 19 14:15:08.954 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'follower' to 'candidate'.
Nov 19 14:15:08 helper-01 etcd[1030]: [etcd] Nov 19 14:15:08.955 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:08 helper-01 etcd[1030]: [etcd] Nov 19 14:15:08.962 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'candidate' to 'leader'.
Nov 19 14:15:08 helper-01 etcd[1030]: [etcd] Nov 19 14:15:08.962 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '' to '7d0a4c9cf4f646b4b94af9e7932b4f15'.
Nov 19 14:15:11 helper-01 etcd[1030]: [etcd] Nov 19 14:15:11.651 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'leader' to 'follower'.
Nov 19 14:15:11 helper-01 etcd[1030]: [etcd] Nov 19 14:15:11.651 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #494 started.
Nov 19 14:15:11 helper-01 etcd[1030]: [etcd] Nov 19 14:15:11.651 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '7d0a4c9cf4f646b4b94af9e7932b4f15' to ''.
Nov 19 14:15:12 helper-01 etcd[1030]: [etcd] Nov 19 14:15:12.842 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 182.529672ms
Nov 19 14:15:14 helper-01 etcd[1030]: [etcd] Nov 19 14:15:14.954 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #495 started.
Nov 19 14:15:14 helper-01 etcd[1030]: [etcd] Nov 19 14:15:14.955 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:16 helper-01 etcd[1030]: [etcd] Nov 19 14:15:16.442 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #496 started.
Nov 19 14:15:16 helper-01 etcd[1030]: [etcd] Nov 19 14:15:16.442 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '02bd3563e2eb4b589f88125d0b959616' to ''.
Nov 19 14:15:17 helper-01 etcd[1030]: [etcd] Nov 19 14:15:17.839 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 190.602625ms
Nov 19 14:15:18 helper-01 etcd[1030]: [etcd] Nov 19 14:15:18.610 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'follower' to 'candidate'.
Nov 19 14:15:18 helper-01 etcd[1030]: [etcd] Nov 19 14:15:18.611 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:18 helper-01 etcd[1030]: [etcd] Nov 19 14:15:18.642 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'candidate' to 'leader'.
Nov 19 14:15:18 helper-01 etcd[1030]: [etcd] Nov 19 14:15:18.642 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '' to '7d0a4c9cf4f646b4b94af9e7932b4f15'.
Nov 19 14:15:26 helper-01 etcd[1030]: [etcd] Nov 19 14:15:26.893 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'leader' to 'follower'.
Nov 19 14:15:26 helper-01 etcd[1030]: [etcd] Nov 19 14:15:26.893 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #498 started.
Nov 19 14:15:26 helper-01 etcd[1030]: [etcd] Nov 19 14:15:26.893 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '7d0a4c9cf4f646b4b94af9e7932b4f15' to ''.
Nov 19 14:15:27 helper-01 etcd[1030]: [etcd] Nov 19 14:15:27.154 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'follower' to 'candidate'.
Nov 19 14:15:27 helper-01 etcd[1030]: [etcd] Nov 19 14:15:27.163 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'candidate' to 'leader'.
Nov 19 14:15:27 helper-01 etcd[1030]: [etcd] Nov 19 14:15:27.163 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '' to '7d0a4c9cf4f646b4b94af9e7932b4f15'.
Nov 19 14:15:31 helper-01 etcd[1030]: [etcd] Nov 19 14:15:31.115 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'leader' to 'follower'.
Nov 19 14:15:31 helper-01 etcd[1030]: [etcd] Nov 19 14:15:31.115 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #500 started.
Nov 19 14:15:31 helper-01 etcd[1030]: [etcd] Nov 19 14:15:31.115 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '7d0a4c9cf4f646b4b94af9e7932b4f15' to ''.
Nov 19 14:15:31 helper-01 etcd[1030]: [etcd] Nov 19 14:15:31.879 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 203.840044ms
Nov 19 14:15:32 helper-01 etcd[1030]: [etcd] Nov 19 14:15:32.888 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 162.945705ms
Nov 19 14:15:33 helper-01 etcd[1030]: [etcd] Nov 19 14:15:33.883 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 205.580783ms
Nov 19 14:15:36 helper-01 etcd[1030]: [etcd] Nov 19 14:15:36.605 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #501 started.
Nov 19 14:15:36 helper-01 etcd[1030]: [etcd] Nov 19 14:15:36.605 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:36 helper-01 etcd[1030]: [etcd] Nov 19 14:15:36.827 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #502 started.
Nov 19 14:15:37 helper-01 etcd[1030]: [etcd] Nov 19 14:15:37.842 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 194.766931ms
Nov 19 14:15:38 helper-01 etcd[1030]: [etcd] Nov 19 14:15:38.974 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'follower' to 'candidate'.
Nov 19 14:15:38 helper-01 etcd[1030]: [etcd] Nov 19 14:15:38.974 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:38 helper-01 etcd[1030]: [etcd] Nov 19 14:15:38.984 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'candidate' to 'leader'.
Nov 19 14:15:38 helper-01 etcd[1030]: [etcd] Nov 19 14:15:38.984 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '' to '7d0a4c9cf4f646b4b94af9e7932b4f15'.

TheHippo · 2014-11-19T15:06:04Z

I found a way to "fix" this issue.

I added the following lines to my cloud-config, which makes the elections more stable.

write_files:
- path: /run/systemd/system/etcd.service.d/30-timing.conf
  permissions: 420
  content: |
    [Service]
    # Peer Env Vars
    Environment=ETCD_PEER_HEARTBEAT_INTERVAL=150
    Environment=ETCD_PEER_ELECTION_TIMEOUT=750

nakker · 2014-11-24T18:41:49Z

I'm also seeing similar behavior on 3 test nodes. Happy to provide details if needed.

yichengq · 2014-11-25T22:11:42Z

@TheHippo @nakker

it is reasonable to set longer heartbeat interval if it takes that much time for a heartbeat round to finish.
it may happen that some non-cluster member is running around and keeps staring new leader election.

It is good to have more details.

@nakker Open new issues with details if you want to report it. :)

yichengq · 2014-12-15T04:38:08Z

Feel free to reopen it if this is still a problem.

mlazarov · 2015-03-18T21:17:19Z

I do have the same problem. The "fix" just makes the coreos cluster more stable (i.e. not electing new master every second), but still the etcd daemon takes 80-90% of one core

Currently the server is brand new - without any docker instances:

core@core-01 ~ $ fleetctl list-units
UNIT    MACHINE ACTIVE  SUB
core@core-01 ~ $ fleetctl list-unit-files
UNIT    HASH    DSTATE  STATE   TARGET
core@core-01 ~ $

yichengq · 2015-03-18T21:22:50Z

@mlazarov We have moved to 2.0.5, and gonna release it on CoreOS very soon. Please give it a try, and it should be much stable.

oguzbilgic · 2015-04-24T17:20:52Z

Same issue on coreos-stable-633-1-0-v20150414

TheHippo · 2015-04-27T14:15:26Z

@oguzbilgic Open an new issue if this problem still exists.

dminkovsky · 2015-08-21T15:18:35Z

I'm experiencing this problem. Had lots of logging about elections, so I wiped the data dir, thinking perhaps etcd keeps trying to reelect with old, non-extant peers. Now the logging has stopped, but etcd is still at 95% CPU.

Not sure why a new issue needs to be opened when this issue and #868 were closed without any concrete resolution steps.

philips added this to the 0.4.7 milestone Nov 19, 2014

philips added the type/question label Nov 19, 2014

yichengq closed this as completed Dec 15, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd high cpu usage #1742

etcd high cpu usage #1742

TheHippo commented Nov 18, 2014

TheHippo commented Nov 18, 2014

TheHippo commented Nov 18, 2014

philips commented Nov 19, 2014

TheHippo commented Nov 19, 2014

TheHippo commented Nov 19, 2014

TheHippo commented Nov 19, 2014

nakker commented Nov 24, 2014

yichengq commented Nov 25, 2014

yichengq commented Dec 15, 2014

mlazarov commented Mar 18, 2015

yichengq commented Mar 18, 2015

oguzbilgic commented Apr 24, 2015

TheHippo commented Apr 27, 2015

dminkovsky commented Aug 21, 2015

etcd high cpu usage #1742

etcd high cpu usage #1742

Comments

TheHippo commented Nov 18, 2014

TheHippo commented Nov 18, 2014

TheHippo commented Nov 18, 2014

philips commented Nov 19, 2014

TheHippo commented Nov 19, 2014

TheHippo commented Nov 19, 2014

TheHippo commented Nov 19, 2014

nakker commented Nov 24, 2014

yichengq commented Nov 25, 2014

yichengq commented Dec 15, 2014

mlazarov commented Mar 18, 2015

yichengq commented Mar 18, 2015

oguzbilgic commented Apr 24, 2015

TheHippo commented Apr 27, 2015

dminkovsky commented Aug 21, 2015