Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade boltdb dependency to fix consul not starting on windows 2012 R2 #2203

Closed
FrankHassanabad opened this issue Jul 21, 2016 · 9 comments · Fixed by #2211
Closed

Upgrade boltdb dependency to fix consul not starting on windows 2012 R2 #2203

FrankHassanabad opened this issue Jul 21, 2016 · 9 comments · Fixed by #2211
Assignees
Labels
type/bug Feature does not function as expected

Comments

@FrankHassanabad
Copy link

FrankHassanabad commented Jul 21, 2016

consul version for both Client and Server

0.6.4
0.6.4

consul info for both Client and Server

Client:

Can't give it since I can't start consul on my server

Server:

Can't give it since I can't start consul on my server

Operating system and Environment details

Windows 2012 RC2

Description of the Issue (and unexpected/desired result)

On Windows 2012 RC2 we consistently see this error in the log files over a varying period of time:

2016/07/21 12:18:16 [ERR] raft: Failed to commit logs: file resize error: truncate C:\consul\data\raft\raft.db: The requested operation cannot be performed on a file with a user-mapped section open.

Impact is that consul will no longer start until you manually login and delete your raft.db file.

Root of the issue seems to be from boltdb and they it fixed in v1.2.1:
boltdb/bolt#504

Consul is using BoltDB v1.2.0 which does not have the fix. Updating the godeps here https://github.com/hashicorp/consul/blob/master/Godeps/Godeps.json#L37

will also upgrade BoltDB and should solve this issue.

Reproduction steps

Start Consul on (any) windows system and begin pushing K/V data into it. Periodically reboot your windows system and eventually you will get the above BoltDB error. See linked boltDB ticket for more reproducible steps.

@slackpad slackpad added the type/bug Feature does not function as expected label Jul 21, 2016
@slackpad slackpad self-assigned this Jul 21, 2016
@FrankHassanabad
Copy link
Author

FrankHassanabad commented Jul 25, 2016

Hey @slackpad, I added a very reproducible test and very simple soak test for this here:
https://github.com/FrankHassanabad/consul-e2e-tests-crashes

I uploaded my logs for when it occurs.

Also, I cloned consul, and from the v0.6.4 tag, I did only an upgrade of boltdb. Then when I reran the soak test again (overnight), it looked like the resize locking error did not show up again.

You don't need to restart windows or any other funny business. Just soak test for a while and the error shows up eventually.

slackpad pushed a commit that referenced this issue Jul 25, 2016
This fixes #2203 which was a consistency problem on Windows.
@slackpad
Copy link
Contributor

Hi @FrankHassanabad thanks for the detailed report and the soak test! Do you mind running again with master just as a quick cross-check? Thanks.

@FrankHassanabad
Copy link
Author

FrankHassanabad commented Jul 26, 2016

Yea, as a matter of fact @Tzinov15 is setting up a gatling stress test here right now:
https://github.com/Tzinov15/GatlingConsul

Using Gatling he was able to stress test and reduce the time to failure from hours to minutes. Pretty cool stuff.

As soon as that's up, we will make a windows build of consul from master and give it a whirl.

@erin-noe-payne
Copy link

@slackpad Thank you for the quick response! Quick question regarding release:

This issue represents a significant bug in our deployed software today. Do you guys have any intention of releasing a patch (0.6.5) or an ETA on the next minor release (0.7.0)? If not, we will need to produce and deploy an in-house build with the bolt db upgrade. Thanks!

@slackpad
Copy link
Contributor

Hi @autoric we are working on getting a 0.7.0 release candidate out over the next few weeks so this likely won't go out in a patch. If you can build locally that's probably the best option in the very near term.

@erin-noe-payne
Copy link

Good to know, thank you!

@FrankHassanabad
Copy link
Author

FrankHassanabad commented Jul 27, 2016

Hey @slackpad, I code reviewed what you checked in and then did a build of master on windows 2000 R2 using TDM-GCC tool chain, and ran gatling against it with the boltdb upgrade. raft.db was able to expand beyond 65 MB to 256 MB without any issues.

Test run was from @Tzinov15 here:
SlamKeyValue

Here are the metrics from SlamKeyValue if you're curious (everything is ms and OK means REST response 200, KO means non REST 200 was returned, and a - means no KO):

---- Global Information --------------------------------------------------------
> request count                                     100000 (OK=100000 KO=0     )
> min response time                                      1 (OK=1      KO=-     )
> max response time                                    242 (OK=242    KO=-     )
> mean response time                                    50 (OK=50     KO=-     )
> std deviation                                         32 (OK=32     KO=-     )
> response time 50th percentile                         47 (OK=47     KO=-     )
> response time 75th percentile                         57 (OK=56     KO=-     )
> response time 95th percentile                         83 (OK=83     KO=-     )
> response time 99th percentile                        197 (OK=197    KO=-     )
> mean requests/sec                               7142.857 (OK=7142.857 KO=-   )

The other 2 stress tests look to be passing as well.

Only artifact from upgrading boltdb I noticed was that raft.db.lock is a new file which shows up when boltdb opens the memory map. Once it releases the memory map, the raft.db.lock file goes away.

@slackpad
Copy link
Contributor

@FrankHassanabad thank you for the follow up and I super appreciate the extra stress testing!

@FrankHassanabad
Copy link
Author

For those who willing to trust my exe files ;-) and need this fix, here is my release of v0.6.4:
https://github.com/FrankHassanabad/consul/releases/tag/v0.6.4-boltdb-upgrade

When you run consul.exe version you should see:

Consul v0.6.4-1-g7e58d58
Consul Protocol: 3 (Understands back to: 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants