-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade boltdb dependency to fix consul not starting on windows 2012 R2 #2203
Comments
Hey @slackpad, I added a very reproducible test and very simple soak test for this here: I uploaded my logs for when it occurs. Also, I cloned consul, and from the v0.6.4 tag, I did only an upgrade of boltdb. Then when I reran the soak test again (overnight), it looked like the resize locking error did not show up again. You don't need to restart windows or any other funny business. Just soak test for a while and the error shows up eventually. |
This fixes #2203 which was a consistency problem on Windows.
Hi @FrankHassanabad thanks for the detailed report and the soak test! Do you mind running again with master just as a quick cross-check? Thanks. |
Yea, as a matter of fact @Tzinov15 is setting up a gatling stress test here right now: Using Gatling he was able to stress test and reduce the time to failure from hours to minutes. Pretty cool stuff. As soon as that's up, we will make a windows build of consul from master and give it a whirl. |
@slackpad Thank you for the quick response! Quick question regarding release: This issue represents a significant bug in our deployed software today. Do you guys have any intention of releasing a patch (0.6.5) or an ETA on the next minor release (0.7.0)? If not, we will need to produce and deploy an in-house build with the bolt db upgrade. Thanks! |
Hi @autoric we are working on getting a 0.7.0 release candidate out over the next few weeks so this likely won't go out in a patch. If you can build locally that's probably the best option in the very near term. |
Good to know, thank you! |
Hey @slackpad, I code reviewed what you checked in and then did a build of Test run was from @Tzinov15 here: Here are the metrics from SlamKeyValue if you're curious (everything is ms and OK means REST response 200, KO means non REST 200 was returned, and a
The other 2 stress tests look to be passing as well. Only artifact from upgrading boltdb I noticed was that |
@FrankHassanabad thank you for the follow up and I super appreciate the extra stress testing! |
For those who willing to trust my exe files ;-) and need this fix, here is my release of v0.6.4: When you run
|
consul version
for both Client and Server0.6.4
0.6.4
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
Windows 2012 RC2
Description of the Issue (and unexpected/desired result)
On Windows 2012 RC2 we consistently see this error in the log files over a varying period of time:
Impact is that consul will no longer start until you manually login and delete your raft.db file.
Root of the issue seems to be from boltdb and they it fixed in
v1.2.1
:boltdb/bolt#504
Consul is using BoltDB
v1.2.0
which does not have the fix. Updating the godeps here https://github.com/hashicorp/consul/blob/master/Godeps/Godeps.json#L37will also upgrade BoltDB and should solve this issue.
Reproduction steps
Start Consul on (any) windows system and begin pushing K/V data into it. Periodically reboot your windows system and eventually you will get the above BoltDB error. See linked boltDB ticket for more reproducible steps.
The text was updated successfully, but these errors were encountered: