Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd high cpu usage #1742

Closed
TheHippo opened this issue Nov 18, 2014 · 14 comments
Closed

etcd high cpu usage #1742

TheHippo opened this issue Nov 18, 2014 · 14 comments
Milestone

Comments

@TheHippo
Copy link
Contributor

I'm currently trying out CoreOS with 3 virtual machines via Virtual Box and found out, that the cpu usage for etcd grows over time. (From 35% to 90% over the course of half an hour.) This seem to be happening regardless of the CoreOS update channel I use.

I'm not an expert with etcd, but I'm willing to provide addional data if this helps solve this issue.

@TheHippo
Copy link
Contributor Author

Output from curl http://127.0.0.1:7001/v2/admin/machines over a span of 2 minutes:

[
  {
    "name": "508913e0694b4354be7ae5fd0c044e14",
    "state": "follower",
    "clientURL": "http://172.17.8.101:4001",
    "peerURL": "http://172.17.8.101:7001"
  },
  {
    "name": "7762f6e808934d57b0092100cf600743",
    "state": "follower",
    "clientURL": "http://172.17.8.160:4001",
    "peerURL": "http://172.17.8.160:7001"
  },
  {
    "name": "9632e08e866d4a769436ab7c2fc70e6b",
    "state": "leader",
    "clientURL": "http://172.17.8.150:4001",
    "peerURL": "http://172.17.8.150:7001"
  }
]
[
  {
    "name": "508913e0694b4354be7ae5fd0c044e14",
    "state": "follower",
    "clientURL": "http://172.17.8.101:4001",
    "peerURL": "http://172.17.8.101:7001"
  },
  {
    "name": "7762f6e808934d57b0092100cf600743",
    "state": "leader",
    "clientURL": "http://172.17.8.160:4001",
    "peerURL": "http://172.17.8.160:7001"
  },
  {
    "name": "9632e08e866d4a769436ab7c2fc70e6b",
    "state": "follower",
    "clientURL": "http://172.17.8.150:4001",
    "peerURL": "http://172.17.8.150:7001"
  }
]
[
  {
    "name": "508913e0694b4354be7ae5fd0c044e14",
    "state": "leader",
    "clientURL": "http://172.17.8.101:4001",
    "peerURL": "http://172.17.8.101:7001"
  },
  {
    "name": "7762f6e808934d57b0092100cf600743",
    "state": "follower",
    "clientURL": "http://172.17.8.160:4001",
    "peerURL": "http://172.17.8.160:7001"
  },
  {
    "name": "9632e08e866d4a769436ab7c2fc70e6b",
    "state": "follower",
    "clientURL": "http://172.17.8.150:4001",
    "peerURL": "http://172.17.8.150:7001"
  }
]

@TheHippo
Copy link
Contributor Author

Ok, this might be related to: #868

@philips philips added this to the 0.4.7 milestone Nov 19, 2014
@philips
Copy link
Contributor

philips commented Nov 19, 2014

@TheHippo Are you seeing thrashing?

@TheHippo
Copy link
Contributor Author

@philips What exactly do you mean with trashing? Are there any more information I could provide to help your solve this issue?

@TheHippo
Copy link
Contributor Author

I uploaded a log file here.

Last 40 lines:

Nov 19 14:15:07 helper-01 etcd[1030]: [etcd] Nov 19 14:15:07.098 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #492 started.
Nov 19 14:15:07 helper-01 etcd[1030]: [etcd] Nov 19 14:15:07.099 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '02bd3563e2eb4b589f88125d0b959616' to ''.
Nov 19 14:15:07 helper-01 etcd[1030]: [etcd] Nov 19 14:15:07.840 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 214.670341ms
Nov 19 14:15:08 helper-01 etcd[1030]: [etcd] Nov 19 14:15:08.954 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'follower' to 'candidate'.
Nov 19 14:15:08 helper-01 etcd[1030]: [etcd] Nov 19 14:15:08.955 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:08 helper-01 etcd[1030]: [etcd] Nov 19 14:15:08.962 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'candidate' to 'leader'.
Nov 19 14:15:08 helper-01 etcd[1030]: [etcd] Nov 19 14:15:08.962 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '' to '7d0a4c9cf4f646b4b94af9e7932b4f15'.
Nov 19 14:15:11 helper-01 etcd[1030]: [etcd] Nov 19 14:15:11.651 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'leader' to 'follower'.
Nov 19 14:15:11 helper-01 etcd[1030]: [etcd] Nov 19 14:15:11.651 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #494 started.
Nov 19 14:15:11 helper-01 etcd[1030]: [etcd] Nov 19 14:15:11.651 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '7d0a4c9cf4f646b4b94af9e7932b4f15' to ''.
Nov 19 14:15:12 helper-01 etcd[1030]: [etcd] Nov 19 14:15:12.842 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 182.529672ms
Nov 19 14:15:14 helper-01 etcd[1030]: [etcd] Nov 19 14:15:14.954 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #495 started.
Nov 19 14:15:14 helper-01 etcd[1030]: [etcd] Nov 19 14:15:14.955 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:16 helper-01 etcd[1030]: [etcd] Nov 19 14:15:16.442 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #496 started.
Nov 19 14:15:16 helper-01 etcd[1030]: [etcd] Nov 19 14:15:16.442 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '02bd3563e2eb4b589f88125d0b959616' to ''.
Nov 19 14:15:17 helper-01 etcd[1030]: [etcd] Nov 19 14:15:17.839 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 190.602625ms
Nov 19 14:15:18 helper-01 etcd[1030]: [etcd] Nov 19 14:15:18.610 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'follower' to 'candidate'.
Nov 19 14:15:18 helper-01 etcd[1030]: [etcd] Nov 19 14:15:18.611 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:18 helper-01 etcd[1030]: [etcd] Nov 19 14:15:18.642 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'candidate' to 'leader'.
Nov 19 14:15:18 helper-01 etcd[1030]: [etcd] Nov 19 14:15:18.642 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '' to '7d0a4c9cf4f646b4b94af9e7932b4f15'.
Nov 19 14:15:26 helper-01 etcd[1030]: [etcd] Nov 19 14:15:26.893 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'leader' to 'follower'.
Nov 19 14:15:26 helper-01 etcd[1030]: [etcd] Nov 19 14:15:26.893 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #498 started.
Nov 19 14:15:26 helper-01 etcd[1030]: [etcd] Nov 19 14:15:26.893 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '7d0a4c9cf4f646b4b94af9e7932b4f15' to ''.
Nov 19 14:15:27 helper-01 etcd[1030]: [etcd] Nov 19 14:15:27.154 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'follower' to 'candidate'.
Nov 19 14:15:27 helper-01 etcd[1030]: [etcd] Nov 19 14:15:27.163 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'candidate' to 'leader'.
Nov 19 14:15:27 helper-01 etcd[1030]: [etcd] Nov 19 14:15:27.163 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '' to '7d0a4c9cf4f646b4b94af9e7932b4f15'.
Nov 19 14:15:31 helper-01 etcd[1030]: [etcd] Nov 19 14:15:31.115 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'leader' to 'follower'.
Nov 19 14:15:31 helper-01 etcd[1030]: [etcd] Nov 19 14:15:31.115 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #500 started.
Nov 19 14:15:31 helper-01 etcd[1030]: [etcd] Nov 19 14:15:31.115 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '7d0a4c9cf4f646b4b94af9e7932b4f15' to ''.
Nov 19 14:15:31 helper-01 etcd[1030]: [etcd] Nov 19 14:15:31.879 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 203.840044ms
Nov 19 14:15:32 helper-01 etcd[1030]: [etcd] Nov 19 14:15:32.888 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 162.945705ms
Nov 19 14:15:33 helper-01 etcd[1030]: [etcd] Nov 19 14:15:33.883 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 205.580783ms
Nov 19 14:15:36 helper-01 etcd[1030]: [etcd] Nov 19 14:15:36.605 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #501 started.
Nov 19 14:15:36 helper-01 etcd[1030]: [etcd] Nov 19 14:15:36.605 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:36 helper-01 etcd[1030]: [etcd] Nov 19 14:15:36.827 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: term #502 started.
Nov 19 14:15:37 helper-01 etcd[1030]: [etcd] Nov 19 14:15:37.842 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: warning: heartbeat near election timeout: 194.766931ms
Nov 19 14:15:38 helper-01 etcd[1030]: [etcd] Nov 19 14:15:38.974 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'follower' to 'candidate'.
Nov 19 14:15:38 helper-01 etcd[1030]: [etcd] Nov 19 14:15:38.974 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '8b35c85a473d4b6585fddaf3cd3b025b' to ''.
Nov 19 14:15:38 helper-01 etcd[1030]: [etcd] Nov 19 14:15:38.984 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: state changed from 'candidate' to 'leader'.
Nov 19 14:15:38 helper-01 etcd[1030]: [etcd] Nov 19 14:15:38.984 INFO      | 7d0a4c9cf4f646b4b94af9e7932b4f15: leader changed from '' to '7d0a4c9cf4f646b4b94af9e7932b4f15'.

@TheHippo
Copy link
Contributor Author

I found a way to "fix" this issue.

I added the following lines to my cloud-config, which makes the elections more stable.

write_files:
- path: /run/systemd/system/etcd.service.d/30-timing.conf
  permissions: 420
  content: |
    [Service]
    # Peer Env Vars
    Environment=ETCD_PEER_HEARTBEAT_INTERVAL=150
    Environment=ETCD_PEER_ELECTION_TIMEOUT=750

@nakker
Copy link

nakker commented Nov 24, 2014

I'm also seeing similar behavior on 3 test nodes. Happy to provide details if needed.

@yichengq
Copy link
Contributor

@TheHippo @nakker

  1. it is reasonable to set longer heartbeat interval if it takes that much time for a heartbeat round to finish.
  2. it may happen that some non-cluster member is running around and keeps staring new leader election.

It is good to have more details.

@nakker Open new issues with details if you want to report it. :)

@yichengq
Copy link
Contributor

Feel free to reopen it if this is still a problem.

@mlazarov
Copy link

I do have the same problem. The "fix" just makes the coreos cluster more stable (i.e. not electing new master every second), but still the etcd daemon takes 80-90% of one core

screen shot 2015-03-18 at 23 14 12

Currently the server is brand new - without any docker instances:

core@core-01 ~ $ fleetctl list-units
UNIT    MACHINE ACTIVE  SUB
core@core-01 ~ $ fleetctl list-unit-files
UNIT    HASH    DSTATE  STATE   TARGET
core@core-01 ~ $

@yichengq
Copy link
Contributor

@mlazarov We have moved to 2.0.5, and gonna release it on CoreOS very soon. Please give it a try, and it should be much stable.

@oguzbilgic
Copy link

Same issue on coreos-stable-633-1-0-v20150414

@TheHippo
Copy link
Contributor Author

@oguzbilgic Open an new issue if this problem still exists.

@dminkovsky
Copy link

I'm experiencing this problem. Had lots of logging about elections, so I wiped the data dir, thinking perhaps etcd keeps trying to reelect with old, non-extant peers. Now the logging has stopped, but etcd is still at 95% CPU.

Not sure why a new issue needs to be opened when this issue and #868 were closed without any concrete resolution steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants