Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Snapshot error with serf and raft (windows) #332

Closed
failattu opened this issue Sep 10, 2014 · 5 comments
Closed

Consul Snapshot error with serf and raft (windows) #332

failattu opened this issue Sep 10, 2014 · 5 comments
Assignees
Labels
type/bug Feature does not function as expected

Comments

@failattu
Copy link

Hello,

We are currently running consul on windows server 2008 r2, windows server 2012, windows server 2012 r2 and windows 8.1 and the non crashing errors seem to happen with all the computers. We are currently running Consul 0.4.0 and the panic attack came with previous version. However the other problems have been duplicated with Consul 0.4.0

So there's actually two errors with raft and one error with serf.

Raft creates a panic error when it hits this (or at least consul restarts and tries again this has happened multiple times):

2014/09/05 03:36:35 [ERR] raft: Failed to get log at 802841: log not found

(I have lots of error log if need be ask it's sadly not time stamped)

Then we have two other errors with serf and raft. Serf error when it happens starts immediately and keeps returning the following:

2014/09/05 03:26:01 [ERR] serf: Failed to update snapshot: failed to install new snapshot: rename C:\Tools\Consul\consul.d\serf\local.snapshot.compact C:\Tools\Consul\consul.d\serf\local.snapshot: The process cannot access the file because it is being used by another process.

We did research and the file is used like the following:

Output of Process Monitor with Include Filter 'begins with' : "C:\Tools\Consul\consul.d\serf\local.snapshot"
8:08:08.4575976 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot SUCCESS Offset: -1, Length: 23
8:08:08.4577681 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact NOT A DIRECTORY Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:08.4578792 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Desired Access: Generic Read/Write, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
8:08:08.4582376 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Offset: 0, Length: 1,113, Priority: Normal
8:08:08.4583731 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SHARING VIOLATION Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:08.4583951 AM consul.exe 3708 CloseFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS
8:08:15.4239806 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot SUCCESS Offset: -1, Length: 25
8:08:15.4241110 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact NOT A DIRECTORY Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:15.4242185 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Desired Access: Generic Read/Write, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
8:08:15.4244814 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Offset: 0, Length: 1,072, Priority: Normal
8:08:15.4246155 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SHARING VIOLATION Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:15.4246365 AM consul.exe 3708 CloseFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS
8:08:26.2424648 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot SUCCESS Offset: -1, Length: 23
8:08:26.2425880 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact NOT A DIRECTORY Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:26.2426949 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Desired Access: Generic Read/Write, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
8:08:26.2430108 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Offset: 0, Length: 1,033, Priority: Normal
8:08:26.2431488 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SHARING VIOLATION Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:26.2431698 AM consul.exe 3708 CloseFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS
8:08:27.5246221 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot SUCCESS Offset: -1, Length: 39
8:08:27.5247663 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact NOT A DIRECTORY Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:27.5248715 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Desired Access: Generic Read/Write, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
8:08:27.5251922 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Offset: 0, Length: 1,072, Priority: Normal
8:08:27.5253253 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SHARING VIOLATION Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:27.5253467 AM consul.exe 3708 CloseFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS
8:08:33.3256316 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot SUCCESS Offset: -1, Length: 41
8:08:33.3257716 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact NOT A DIRECTORY Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:33.3258822 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Desired Access: Generic Read/Write, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
8:08:33.3260084 AM consul.exe 3708 WriteFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS Offset: 0, Length: 1,113, Priority: Normal
8:08:33.3261463 AM consul.exe 3708 CreateFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SHARING VIOLATION Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a
8:08:33.3261689 AM consul.exe 3708 CloseFile C:\Tools\Consul\consul.d\serf\local.snapshot.compact SUCCESS

And then we have the raft error number two which starts appearing after a while:

2014/08/12 04:38:04 [ERR] snapshot: Failed to make snapshot directory: mkdir C:\Tools\Consul\consul.d\raft\snapshots\11-8195-2014-08-12T04:38:04.5605161Z.tmp: The filename, directory name, or volume label syntax is incorrect.
2014/08/12 04:38:04 [ERR] raft: Failed to take snapshot: failed to create snapshot: mkdir C:\Tools\Consul\consul.d\raft\snapshots\11-8195-2014-08-12T04:38:04.5605161Z.tmp: The filename, directory name, or volume label syntax is incorrect.

I would assume the last one comes from the fact that you are using : as delimiters.

If there's something I did not attach that you need please ask I have quite a lot more log files.

EDIT: added Consul version.

@failattu
Copy link
Author

Might have found a fix for the file name problem at least in raft.. Testing atm.
Row 119 in file_snapshot.go in raft
Changed the formating to following:
func snapshotName(term, index uint64) string {
return fmt.Sprintf("%d-%d-%s", term, index, time.Now().Format("2006-01-02-15-04-05-0700"))
}
So far at least once worked. Will update after longer testing. (My format might be bit wrong first time writing Go.)
Cheers!

EDIT: With this fix I started to get the following errors:

2014/09/12 16:22:04 [WARN] snapshot: Found temporary snapshot: 102747-19558-2014-09-12-16-20-42+0200.tmp
2014/09/12 16:22:04 [INFO] snapshot: reaping snapshot C:\shared\consul.d\raft\snapshots\102747-19558-2014-09-12-16-21-02+0200
2014/09/12 16:21:02 [ERR] raft: Failed to restore snapshot: failed to restore snapshot 102747-19558-2014-09-12-16-21-02+0200: Exec format error

The sending computer also says it couldn't restore snapshot. Will try to figure this out. Just for heads up.

@cosborne83
Copy link

The secondary problem you're experiencing is due to the fact that when restoring the database, it attempts to map a second 512MB memory block for the in-memory database, and there appears to be a low likelihood (zero?) that a contiguous free memory region of sufficient size is available within the process' address space.

Adjusting the dbMaxMapSize32bit constant in state_store.go to 256MB seems to resolve the issue.

@armon
Copy link
Member

armon commented Sep 30, 2014

Awesome thanks for hunting this down! Definitely 32bit presents some challenges with the mmap()'s. I think we can probably safely tune down the size of both the state store and raft log on 32bit systems.

@archgl
Copy link

archgl commented Oct 2, 2014

With dbMaxMapSize32bit set to 256MB the tests still fail. They pass setting it to 128 MB.
A version with the snapshotName fix and the 128MB has been running with out problems
for a couple of days on our servers.

On the serf snapshot compaction issue "The process cannot access the file because it is being used by another process" unlike posix systems windows doesn't allow to rename open files. See:
https://codereview.appspot.com/8203043/
http://msdn.microsoft.com/en-us/library/aa363858%28v=vs.85%29.aspx

I tried setting the FILE_SHARE_DELETE as suggested below but still kept getting the error:
http://stackoverflow.com/questions/23897174/is-it-possible-to-open-a-file-allowing-another-processes-to-delete-this-file

For now we implemented a workaround of the rename based on a step by step close/rename/delete/reopen process.

@armon armon added the type/bug Feature does not function as expected label Oct 14, 2014
@armon armon self-assigned this Oct 15, 2014
@armon
Copy link
Member

armon commented Oct 15, 2014

@armon armon closed this as completed Oct 15, 2014
duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021
* ServiceRouter support

* controllers => controller for logger name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants