Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration: make read/write io timeout longer #5350

Closed
wants to merge 1 commit into from

Conversation

xiang90
Copy link
Contributor

@xiang90 xiang90 commented May 13, 2016

Fix #5022

@gyuho
Copy link
Contributor

gyuho commented May 13, 2016

LGTM. Thanks.

@xiang90
Copy link
Contributor Author

xiang90 commented May 13, 2016

@AkihiroSuda I am not exactly sure this is the right fix. Can you try to reproduce the issue with this patch? Thanks!

@AkihiroSuda
Copy link
Contributor

Hi @xiang90, sorry for late.

I tested your PR 77fde6a with a printf-debug patch #5022 (comment), and still getting an error (Note that the error strings seems changed!):

suda@suda-PC2:~/WORK/etcd⟫ GOPATH=$(pwd)/gopath go test -c -race github.com/coreos/etcd/integration
suda@suda-PC2:~/WORK/etcd⟫ ./integration.test -test.run TestIssue2746 -test.count 100
--- FAIL: TestIssue2746 (6.59s)
        cluster_test.go:356: create on http://127.0.0.1:21254 error: client: etcd cluster is unavailable or misconfigured(detail: error #0: client: etcd member http://127.0.0.1:21254 has no leader
                )
FAIL

When I hit the issue on my 4-core Xeon(E3-1220 v3 @ 3.10GHz) machine, I did not do anything special nor running heavy applications (although some daemons and X11 were running).
Before running the test, the loadavg were under 1.0.
So I think the issue is not specific to slow machines.

I'll try to add more debug prints and investigate the cause.

@mitake
Copy link
Contributor

mitake commented May 17, 2016

@AkihiroSuda interesting. Do you have logs of etcd servers?

AkihiroSuda added a commit to AkihiroSuda/etcd that referenced this pull request May 17, 2016
I'm not sure how to get server logs for integration testsuites (`go test -v` doesn't show any logs)

--- FAIL: TestIssue2746 (6.67s)
        cluster_test.go:356: create on http://127.0.0.1:21564 error: client: etcd cluster is unavailable or misconfigured(detail: error #0: client: etcd member http://127.0.0.1:21564 has no leader: {"errorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out, possibly due to previous leader failure","index":0}

                )
@AkihiroSuda
Copy link
Contributor

More detailed error message (ErrTimeoutDueToLeaderFail):

--- FAIL: TestIssue2746 (6.67s)
        cluster_test.go:356: create on http://127.0.0.1:21564 error: client: etcd cluster is unavailable or misconfigured(detail: error #0: client: etcd member http://127.0.0.1:21564 has no leader: {"errorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out, possibly due to previous leader failure","index":0}

                )

Printf-debug patch: AkihiroSuda@546fed2

@mitake
I'm still not sure how to get logs for the integration tests. go test -v didn't work for me.

@heyitsanthony
Copy link
Contributor

@AkihiroSuda @mitake I'm getting full logs by changing init() in integration/v2_http_kv_test.go to another log level. The failed runs I'm seeing (ex. https://gist.github.com/c34360590438f87cd054003e7c2bdda2) look unremarkable compared to the passing runs.

@AkihiroSuda
Copy link
Contributor

AkihiroSuda commented May 18, 2016

@heyitsanthony

This line seems almost specific to failed runs

 raft: found conflict at index %d [existing term: %d, conflicting term: %d]

https://github.com/coreos/etcd/blob/master/raft/log.go#L120
test.txt (contains 100 experiments, 5 fails, 6 "found conflict" messages. All the 5 failed runs contain the message, but 1 successful run contains the message as well)

@heyitsanthony
Copy link
Contributor

@AkihiroSuda I'm seeing the same thing. I think the conflict is from an election and the key create call is failing because of a lost leader:

failure1:

2016-05-17 18:01:35.537971 D | v2http: [PUT] /v2/keys/foo905924239627700637?prevExist=false remote:127.0.0.1:47451
2016-05-17 18:01:35.577044 I | raft: 31d48e392eff524d is starting a new election at term 2
2016-05-17 18:01:35.577227 I | raft: 31d48e392eff524d became candidate at term 3
2016-05-17 18:01:35.577305 I | raft: 31d48e392eff524d received vote from 31d48e392eff524d at term 3
2016-05-17 18:01:35.577384 I | raft: 31d48e392eff524d [logterm: 2, index: 12] sent vote request to 689c78810a053338 at term 3
2016-05-17 18:01:35.577456 I | raft: 31d48e392eff524d [logterm: 2, index: 12] sent vote request to e100ba9816f5bc05 at term 3
2016-05-17 18:01:35.577527 I | raft: 31d48e392eff524d [logterm: 2, index: 12] sent vote request to 130e2a9f96ae6923 at term 3
2016-05-17 18:01:35.577597 I | raft: 31d48e392eff524d [logterm: 2, index: 12] sent vote request to 3f29caeae3aeba7b at term 3
2016-05-17 18:01:35.577661 I | raft: raft.node: 31d48e392eff524d lost leader 130e2a9f96ae6923 at term 3

failure2:

2016-05-17 18:01:15.672599 D | v2http: [PUT] /v2/keys/foo1710423896842187924?prevExist=false remote:127.0.0.1:59201
2016-05-17 18:01:15.714356 I | raft: 52c5a45f267b1cda [term: 3] received a MsgVote message with higher term from 9668a13b909c1ad5 [term: 4]
2016-05-17 18:01:15.714575 I | raft: 52c5a45f267b1cda became follower at term 4
2016-05-17 18:01:15.714837 I | raft: 52c5a45f267b1cda [logterm: 3, index: 13, vote: 0] voted for 9668a13b909c1ad5 [logterm: 3, index: 13] at term 4
2016-05-17 18:01:15.714955 I | raft: raft.node: 52c5a45f267b1cda lost leader c65f9846abf9c609 at term 4
2016-05-17 18:01:15.715125 I | raft: 74c456ab7366def4 [term: 3] received a MsgVote message with higher term from 9668a13b909c1ad5 [term: 4]
2016-05-17 18:01:15.715320 I | raft: 74c456ab7366def4 became follower at term 4
2016-05-17 18:01:15.715451 I | raft: 74c456ab7366def4 [logterm: 3, index: 13, vote: 0] voted for 9668a13b909c1ad5 [logterm: 3, index: 13] at term 4
2016-05-17 18:01:15.715529 I | raft: a7250e99bf59f0f3 [term: 3] received a MsgVote message with higher term from 9668a13b909c1ad5 [term: 4]
2016-05-17 18:01:15.715803 I | raft: a7250e99bf59f0f3 became follower at term 4
2016-05-17 18:01:15.715925 I | raft: c65f9846abf9c609 [term: 3] received a MsgVote message with higher term from 9668a13b909c1ad5 [term: 4]
2016-05-17 18:01:15.716122 I | raft: c65f9846abf9c609 became follower at term 4
2016-05-17 18:01:15.716229 I | raft: c65f9846abf9c609 [logterm: 3, index: 14, vote: 0] rejected vote from 9668a13b909c1ad5 [logterm: 3, index: 13] at term 4
2016-05-17 18:01:15.716317 I | raft: raft.node: c65f9846abf9c609 lost leader c65f9846abf9c609 at term 4

@mitake
Copy link
Contributor

mitake commented May 19, 2016

Is returning an error in no leader duration invalid? I think the test failure doesn't mean a bug of etcd because Raft is a leader based consensus algorithm and its unavailability in no leader duration seems to be a valid behavior. How about just retrying requests from client for handling the case (maybe more detailed error code is required?) ?

@mitake
Copy link
Contributor

mitake commented May 19, 2016

I created a PR #5395 . The PR reduces test failures according to the test of @AkihiroSuda . But the fix would be partial because of the reason I described above. For removing the failure completely, client side retry would be needed, I think.

AkihiroSuda added a commit to AkihiroSuda/etcd that referenced this pull request May 19, 2016
…rorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out","index":0}
@xiang90 xiang90 closed this Jun 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants