integration: make read/write io timeout longer #5350

xiang90 · 2016-05-13T18:01:50Z

gyuho · 2016-05-13T18:04:53Z

LGTM. Thanks.

xiang90 · 2016-05-13T18:09:58Z

@AkihiroSuda I am not exactly sure this is the right fix. Can you try to reproduce the issue with this patch? Thanks!

AkihiroSuda · 2016-05-16T08:44:50Z

Hi @xiang90, sorry for late.

I tested your PR 77fde6a with a printf-debug patch #5022 (comment), and still getting an error (Note that the error strings seems changed!):

suda@suda-PC2:~/WORK/etcd⟫ GOPATH=$(pwd)/gopath go test -c -race github.com/coreos/etcd/integration
suda@suda-PC2:~/WORK/etcd⟫ ./integration.test -test.run TestIssue2746 -test.count 100
--- FAIL: TestIssue2746 (6.59s)
        cluster_test.go:356: create on http://127.0.0.1:21254 error: client: etcd cluster is unavailable or misconfigured(detail: error #0: client: etcd member http://127.0.0.1:21254 has no leader
                )
FAIL

When I hit the issue on my 4-core Xeon(E3-1220 v3 @ 3.10GHz) machine, I did not do anything special nor running heavy applications (although some daemons and X11 were running).
Before running the test, the loadavg were under 1.0.
So I think the issue is not specific to slow machines.

I'll try to add more debug prints and investigate the cause.

mitake · 2016-05-17T04:51:59Z

@AkihiroSuda interesting. Do you have logs of etcd servers?

I'm not sure how to get server logs for integration testsuites (`go test -v` doesn't show any logs) --- FAIL: TestIssue2746 (6.67s) cluster_test.go:356: create on http://127.0.0.1:21564 error: client: etcd cluster is unavailable or misconfigured(detail: error #0: client: etcd member http://127.0.0.1:21564 has no leader: {"errorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out, possibly due to previous leader failure","index":0} )

AkihiroSuda · 2016-05-17T05:38:10Z

More detailed error message (ErrTimeoutDueToLeaderFail):

--- FAIL: TestIssue2746 (6.67s)
        cluster_test.go:356: create on http://127.0.0.1:21564 error: client: etcd cluster is unavailable or misconfigured(detail: error #0: client: etcd member http://127.0.0.1:21564 has no leader: {"errorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out, possibly due to previous leader failure","index":0}

                )

Printf-debug patch: AkihiroSuda@546fed2

@mitake
I'm still not sure how to get logs for the integration tests. go test -v didn't work for me.

heyitsanthony · 2016-05-17T08:38:53Z

@AkihiroSuda @mitake I'm getting full logs by changing init() in integration/v2_http_kv_test.go to another log level. The failed runs I'm seeing (ex. https://gist.github.com/c34360590438f87cd054003e7c2bdda2) look unremarkable compared to the passing runs.

AkihiroSuda · 2016-05-18T09:39:10Z

@heyitsanthony

This line seems almost specific to failed runs

 raft: found conflict at index %d [existing term: %d, conflicting term: %d]

https://github.com/coreos/etcd/blob/master/raft/log.go#L120
test.txt (contains 100 experiments, 5 fails, 6 "found conflict" messages. All the 5 failed runs contain the message, but 1 successful run contains the message as well)

heyitsanthony · 2016-05-18T13:07:50Z

@AkihiroSuda I'm seeing the same thing. I think the conflict is from an election and the key create call is failing because of a lost leader:

failure1:

2016-05-17 18:01:35.537971 D | v2http: [PUT] /v2/keys/foo905924239627700637?prevExist=false remote:127.0.0.1:47451
2016-05-17 18:01:35.577044 I | raft: 31d48e392eff524d is starting a new election at term 2
2016-05-17 18:01:35.577227 I | raft: 31d48e392eff524d became candidate at term 3
2016-05-17 18:01:35.577305 I | raft: 31d48e392eff524d received vote from 31d48e392eff524d at term 3
2016-05-17 18:01:35.577384 I | raft: 31d48e392eff524d [logterm: 2, index: 12] sent vote request to 689c78810a053338 at term 3
2016-05-17 18:01:35.577456 I | raft: 31d48e392eff524d [logterm: 2, index: 12] sent vote request to e100ba9816f5bc05 at term 3
2016-05-17 18:01:35.577527 I | raft: 31d48e392eff524d [logterm: 2, index: 12] sent vote request to 130e2a9f96ae6923 at term 3
2016-05-17 18:01:35.577597 I | raft: 31d48e392eff524d [logterm: 2, index: 12] sent vote request to 3f29caeae3aeba7b at term 3
2016-05-17 18:01:35.577661 I | raft: raft.node: 31d48e392eff524d lost leader 130e2a9f96ae6923 at term 3

failure2:

2016-05-17 18:01:15.672599 D | v2http: [PUT] /v2/keys/foo1710423896842187924?prevExist=false remote:127.0.0.1:59201
2016-05-17 18:01:15.714356 I | raft: 52c5a45f267b1cda [term: 3] received a MsgVote message with higher term from 9668a13b909c1ad5 [term: 4]
2016-05-17 18:01:15.714575 I | raft: 52c5a45f267b1cda became follower at term 4
2016-05-17 18:01:15.714837 I | raft: 52c5a45f267b1cda [logterm: 3, index: 13, vote: 0] voted for 9668a13b909c1ad5 [logterm: 3, index: 13] at term 4
2016-05-17 18:01:15.714955 I | raft: raft.node: 52c5a45f267b1cda lost leader c65f9846abf9c609 at term 4
2016-05-17 18:01:15.715125 I | raft: 74c456ab7366def4 [term: 3] received a MsgVote message with higher term from 9668a13b909c1ad5 [term: 4]
2016-05-17 18:01:15.715320 I | raft: 74c456ab7366def4 became follower at term 4
2016-05-17 18:01:15.715451 I | raft: 74c456ab7366def4 [logterm: 3, index: 13, vote: 0] voted for 9668a13b909c1ad5 [logterm: 3, index: 13] at term 4
2016-05-17 18:01:15.715529 I | raft: a7250e99bf59f0f3 [term: 3] received a MsgVote message with higher term from 9668a13b909c1ad5 [term: 4]
2016-05-17 18:01:15.715803 I | raft: a7250e99bf59f0f3 became follower at term 4
2016-05-17 18:01:15.715925 I | raft: c65f9846abf9c609 [term: 3] received a MsgVote message with higher term from 9668a13b909c1ad5 [term: 4]
2016-05-17 18:01:15.716122 I | raft: c65f9846abf9c609 became follower at term 4
2016-05-17 18:01:15.716229 I | raft: c65f9846abf9c609 [logterm: 3, index: 14, vote: 0] rejected vote from 9668a13b909c1ad5 [logterm: 3, index: 13] at term 4
2016-05-17 18:01:15.716317 I | raft: raft.node: c65f9846abf9c609 lost leader c65f9846abf9c609 at term 4

mitake · 2016-05-19T07:31:31Z

Is returning an error in no leader duration invalid? I think the test failure doesn't mean a bug of etcd because Raft is a leader based consensus algorithm and its unavailability in no leader duration seems to be a valid behavior. How about just retrying requests from client for handling the case (maybe more detailed error code is required?) ?

mitake · 2016-05-19T08:29:12Z

I created a PR #5395 . The PR reduces test failures according to the test of @AkihiroSuda . But the fix would be partial because of the reason I described above. For removing the failure completely, client side retry would be needed, I think.

…rorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out","index":0}

integration: make read/write io timeout longer

77fde6a

AkihiroSuda mentioned this pull request May 16, 2016

TestElectionWait fails #5362

Closed

AkihiroSuda added a commit to AkihiroSuda/etcd that referenced this pull request May 19, 2016

[DO NOT MERGE] reduce etcd-io#5350, still hitting has no leader: {"er…

475c63d

…rorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out","index":0}

purpleidea mentioned this pull request May 25, 2016

etcdmain: Refactor etcdStart for use in embedding etcd servers #5432

Closed

xiang90 closed this Jun 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration: make read/write io timeout longer #5350

integration: make read/write io timeout longer #5350

xiang90 commented May 13, 2016 •

edited

Loading

gyuho commented May 13, 2016

xiang90 commented May 13, 2016

AkihiroSuda commented May 16, 2016

mitake commented May 17, 2016

AkihiroSuda commented May 17, 2016

heyitsanthony commented May 17, 2016

AkihiroSuda commented May 18, 2016 •

edited

Loading

heyitsanthony commented May 18, 2016

mitake commented May 19, 2016

mitake commented May 19, 2016

integration: make read/write io timeout longer #5350

integration: make read/write io timeout longer #5350

Conversation

xiang90 commented May 13, 2016 • edited Loading

gyuho commented May 13, 2016

xiang90 commented May 13, 2016

AkihiroSuda commented May 16, 2016

mitake commented May 17, 2016

AkihiroSuda commented May 17, 2016

heyitsanthony commented May 17, 2016

AkihiroSuda commented May 18, 2016 • edited Loading

heyitsanthony commented May 18, 2016

mitake commented May 19, 2016

mitake commented May 19, 2016

xiang90 commented May 13, 2016 •

edited

Loading

AkihiroSuda commented May 18, 2016 •

edited

Loading