race condition during blocking query? #300

imkira · 2014-08-22T10:35:40Z

I am running a cluster but I am able to reproduce (using just a single node) what appears to be a race condition during a blocking query.

the setup

So, in that node I run a consul agent in client mode, and in the same node I run 2 processes:

process1
- last_index = 0
- call http://127.0.0.1:8500/v1/health/service/myservice?consistent=&index=<last_index>&wait=1ms
- if response.index != last_index { last_index = response.index; process(response) }
- sleep 1ms
- repeat
process2
- call /v1/agent/service/register
- repeat until the app receives a signal to quit
  - call /v1/agent/check/pass/
  - sleep 1000ms
- call /v1/agent/service/deregister/

First I run process1 and leave it running. I never stop it.
Then, I run process2. Then I kill it (gracefully by sending it a signal to quit). Then run it again. I repeat this process until I reproduce the bug which happens after a few runs (few < 10).

the bug

It happens on process1.
Sometimes index doesn't change between responses (yes this is expected) but the response's Checks structure changes. I am observing 2 changes that I get in the response's Checks, specifically the addition of a service check (with the same checkID I am calling on process2 and status="unknown") and the second change being status changing to "passing". I am logging all responses I get from the API so I was able to notice this. Even after these 2 changes, there are times where response.index does not change.

Is this intended? Is this some race condition?
Thank you in advance and sorry for the long problem report.

notes

Note1: If I remove "if response.index != last_index" all works perfectly (but since process(response) is a heavy task I don't want to do it).
Note2: I am using very tiny wait and sleep times in process1 because I am stress testing the system and checking for any inconsistencies.
Note3: Until I reproduce the bug, I kill process2 after up to 2 or 3 seconds have elapsed.
Note4: I never call two or more process2 simultaneously and I never skip service/register nor service/deregister APIs.
Note5: Does this has anything to do with UI - Number of passing checks is off. #251 ?

The text was updated successfully, but these errors were encountered:

imkira · 2014-08-22T10:59:29Z

I just noticed the bug is reproduced even more easily if I set longer wait times (like 5 seconds).

armon · 2014-08-22T18:30:29Z

This may be a dup of #279. There does seem to be some race with the index calculation it seems.

armon · 2014-08-22T18:54:43Z

Can you provide any scripts you use to trigger this?

armon · 2014-08-22T19:44:56Z

Fixed by 3330956!

imkira · 2014-08-25T06:50:35Z

Hey @armon thank you for the prompt support!
I got the latest source and ran consul but it doesn't seem to address the bug I am reporting.

Sorry for not putting it together as a test.go file, but here is a gist that reproduces the bug.

https://gist.github.com/imkira/f40e64d00583160c0c9f

As initially reported, I get two bug patterns:

len(Checks) between calls to health/service differs (0 vs 1, 1 vs 2, ...)
number of Checks matches but newly registered service status doesn't (unknown vs passing)

Let me know if you can fix this.
Thank you in advance.

armon · 2014-08-25T18:03:00Z

@imkira I am unable to reproduce the error with you test code running against master. Can you verify you re-built against master? Also a gist of the error output could be useful.

imkira · 2014-08-26T03:01:36Z

Yes, I am running the latest version of consul.

~GOPATH/src/github.com/hashicorp/consul git(master) $ git log | head
commit e08e66367c20e829afbcde1f0210d3fe82ece1a7
Author: Armon Dadgar <armon.dadgar@gmail.com>
Date:   Fri Aug 22 15:54:48 2014 -0700

    website: Document node lifecycle. Fixes #282.

commit e0a815772a08148a7b1e89fd183c00ea9b986f14
Author: Armon Dadgar <armon.dadgar@gmail.com>
Date:   Fri Aug 22 15:08:15 2014 -0700

~GOPATH/src/github.com/hashicorp/consul git(master) $ rm $GOPATH/bin/consul
~GOPATH/src/github.com/hashicorp/consul git(master) $ go install
~GOPATH/src/github.com/hashicorp/consul git(master) $ go build -o consul2

I built the app as shown above and I ran it with $GOPATH/bin/consul -data-dir /var/consul/tmp -ui-dir /var/consul/ui as well as with ./consul2 agent -data-dir /var/consul/tmp -ui-dir /var/consul/ui in order to eliminate any mistake of mine of running an old version of consul.
The bug still manifests itself and without a change in results.
There is one thing I forgot to say. Before I ran the the gist, I joined this consul with a remote consul running in server mode. So yes, the local agent is running in client mode.
Would that cause an inconsistency?

As for the output, I ran the gist program 10 times and posted the results as a comment to the gist

https://gist.github.com/imkira/f40e64d00583160c0c9f

Let me know of any other way I can help you track down this bug.

imkira · 2014-08-26T03:07:59Z

Oh my oh my! My bad!
The devil was in the details. The node I was joining to was using 0.3.1 so yes, it was not fixed!
When I ran the local agent in server mode and stopped joining, the bug you said did not manifest at all!
I believe this is fixed and that it is not just because the local agent is not forwarding the request to another node in the cluster, though I am eager to reconfirm these results again as soon as the next release gets out (for the peace of mind :). I will let you know again when I do it!
Thank you a lot and sorry for the confusion.

armon · 2014-08-26T22:36:55Z

No worries! Glad it is working.

armon closed this as completed Aug 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race condition during blocking query? #300

race condition during blocking query? #300

imkira commented Aug 22, 2014

imkira commented Aug 22, 2014

armon commented Aug 22, 2014

armon commented Aug 22, 2014

armon commented Aug 22, 2014

imkira commented Aug 25, 2014

armon commented Aug 25, 2014

imkira commented Aug 26, 2014

imkira commented Aug 26, 2014

armon commented Aug 26, 2014

race condition during blocking query? #300

race condition during blocking query? #300

Comments

imkira commented Aug 22, 2014

the setup

the bug

notes

imkira commented Aug 22, 2014

armon commented Aug 22, 2014

armon commented Aug 22, 2014

armon commented Aug 22, 2014

imkira commented Aug 25, 2014

armon commented Aug 25, 2014

imkira commented Aug 26, 2014

imkira commented Aug 26, 2014

armon commented Aug 26, 2014