Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Win_x64. Error: sync data\raft\snapshots: The handle is invalid. And 30k snapshots in folder(. #3409

Closed
Lexus-3141 opened this issue Aug 22, 2017 · 11 comments
Labels
type/bug Feature does not function as expected
Milestone

Comments

@Lexus-3141
Copy link

Lexus-3141 commented Aug 22, 2017

consul version for both Client and Server

Client: 0.9.2 upd from 0.8x
Server: 0.9.2 upd from 0.8x

consul info for both Client and Server

Server: ACL enbled, but allow all.

Operating system and Environment details

Win server 2012r2
Consul started as service by nssm demon. Account - SYSTEM for both processes.

Description of the Issue (and unexpected/desired result)

After updating consul to 0.9.2 and fixing troubles with acl, file system trigger was discovered.
folder 'Consul\data\raft\snapshots' ate 40gb.

Reproduction steps

unknown...( At the moment it's stable. Begin at some times after restarting

Log Fragments (TRACE level):

    2017/08/22 13:50:14 [INFO] raft: Starting snapshot up to 75495602
    2017/08/22 13:50:14 [INFO] snapshot: Creating new snapshot at data\raft\snapshots\71-75495602-1503399014216.tmp
    2017/08/22 13:50:14 [ERR] snapshot: Failed syncing parent directory data\raft\snapshots, error: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:14 [ERR] raft: Failed to take snapshot: failed to close snapshot: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:19 [INFO] consul.fsm: snapshot created in 0s
    2017/08/22 13:50:19 [INFO] raft: Starting snapshot up to 75495650
    2017/08/22 13:50:19 [INFO] snapshot: Creating new snapshot at data\raft\snapshots\71-75495650-1503399019803.tmp
    2017/08/22 13:50:19 [ERR] snapshot: Failed syncing parent directory data\raft\snapshots, error: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:19 [ERR] raft: Failed to take snapshot: failed to close snapshot: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:28 [INFO] consul.fsm: snapshot created in 0s
    2017/08/22 13:50:28 [INFO] raft: Starting snapshot up to 75495748
    2017/08/22 13:50:28 [INFO] snapshot: Creating new snapshot at data\raft\snapshots\71-75495748-1503399028720.tmp
    2017/08/22 13:50:28 [ERR] snapshot: Failed syncing parent directory data\raft\snapshots, error: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:28 [ERR] raft: Failed to take snapshot: failed to close snapshot: sync data\raft\snapshots: The handle is invalid.
@preetapan
Copy link
Contributor

@Lexus-3141 Thanks for the report, this looks related to hashicorp/raft#232

Saving snapshots now calls the "sync" syscall to ensure that data is actually persisted to disk correctly. Sync's implementation is different for Windows vs *nix. For windows, this calls FlushFileBuffers. As far as I can tell this the right implementation. Do you see this autorecover after it emits the above error after restarting, or is it never able to save a snapshot successfully?

@Lexus-3141
Copy link
Author

Yes, autorecovery works successfully. Log after restart:

==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.9.2'
           Node ID: '41335b46-fbb3-8555-ea7f-719d12246f1b'
         Node name: 'wn-hv1'
        Datacenter: 'wz'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 192.168.12.93 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/08/22 16:33:52 [INFO] raft: Restored from snapshot 71-75581359-1503408826419
    2017/08/22 16:33:52 [INFO] raft: Initial configuration (index=75347702): [{Suffrage:Voter ID:192.168.12.4:8300 Address:192.168.12.4:8300} {Suffrage:Voter ID:192.168.12.1:8300 Address:192.168.12.1:8300} {Suffrage:Voter ID:192.168.12.2:8300 Address:192.168.12.2:8300} {Suffrage:Voter ID:192.168.12.3:8300 Address:192.168.12.3:8300} {Suffrage:Voter ID:192.168.13.212:8300 Address:192.168.13.212:8300} {Suffrage:Voter ID:192.168.12.93:8300 Address:192.168.12.93:8300} {Suffrage:Voter ID:192.168.12.94:8300 Address:192.168.12.94:8300} {Suffrage:Voter ID:192.168.13.226:8300 Address:192.168.13.226:8300}]
    2017/08/22 16:33:52 [INFO] raft: Node at 192.168.12.93:8300 [Follower] entering Follower state (Leader: "")

But i'm not sure about '(Leader: "")', does it a correct way of leader discovery?

@preetapan
Copy link
Contributor

Yes the Leader:"" log message is expected to be seen when it starts up as a follower.

@preetapan
Copy link
Contributor

@Lexus-3141 according to the documentation for FlushFileBuffers:

To flush all open files on a volume, call FlushFileBuffers with a handle to the volume. The caller must have administrative privileges. For more information, see Running with Special Privileges.

The code changes for hashicorp/raft#232 perform a fsync on the parent directory, which is failing according to the logs you added above. This could be because the consul agent is not running with admin privileges, can you double check that?

@Lexus-3141
Copy link
Author

It's do it. And it has an elevated token.
image

Once I found in windows 'System' principal has less rights then 'Administrator'. But I think it's enough for interact to file system.

@preetapan
Copy link
Contributor

@Lexus-3141 can you see if the error goes away with Administrator instead of System as the principal? Asking because while having write permissions allows you to fsync and saving the snapshot file, looks like fsyncing the snapshot directory requires administrator privileges according to that doc page. This is a difference in behavior between *nix and Windows. On *nix systems, write permissions are sufficient to do fsyncs.

@Lexus-3141
Copy link
Author

I got same result again.

@Lexus-3141
Copy link
Author

I try to auditing file system access and found nothing. All operation has success result.
Looks like you do it wrong. ( WinAPI is very dense.

@Lexus-3141
Copy link
Author

And at local administrator too.

@slackpad
Copy link
Contributor

I was able to get this to happen in a Windows 10 x64 VM, so it doesn't appear to be specific to the Windows Server 2012r2 version reported here. We might need to do some kind of alternate thing for Windows for the directory sync.

@slackpad slackpad added this to the 0.9.2 milestone Aug 22, 2017
@slackpad slackpad added the type/bug Feature does not function as expected label Aug 22, 2017
preetapan pushed a commit that referenced this issue Aug 24, 2017
Update raft library for windows snapshot fsync fixes. 

This fixes #3409
@s-vitaliy
Copy link

Hello, I have a similar issue with nomad (windows server 2016, nomad version 0.6.2). Is this issue fixed in nomad as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants