Skip to content

Win_x64. Error: sync data\raft\snapshots: The handle is invalid. And 30k snapshots in folder(. #3409

Closed
@Lexus-3141

Description

@Lexus-3141

consul version for both Client and Server

Client: 0.9.2 upd from 0.8x
Server: 0.9.2 upd from 0.8x

consul info for both Client and Server

Server: ACL enbled, but allow all.

Operating system and Environment details

Win server 2012r2
Consul started as service by nssm demon. Account - SYSTEM for both processes.

Description of the Issue (and unexpected/desired result)

After updating consul to 0.9.2 and fixing troubles with acl, file system trigger was discovered.
folder 'Consul\data\raft\snapshots' ate 40gb.

Reproduction steps

unknown...( At the moment it's stable. Begin at some times after restarting

Log Fragments (TRACE level):

    2017/08/22 13:50:14 [INFO] raft: Starting snapshot up to 75495602
    2017/08/22 13:50:14 [INFO] snapshot: Creating new snapshot at data\raft\snapshots\71-75495602-1503399014216.tmp
    2017/08/22 13:50:14 [ERR] snapshot: Failed syncing parent directory data\raft\snapshots, error: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:14 [ERR] raft: Failed to take snapshot: failed to close snapshot: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:19 [INFO] consul.fsm: snapshot created in 0s
    2017/08/22 13:50:19 [INFO] raft: Starting snapshot up to 75495650
    2017/08/22 13:50:19 [INFO] snapshot: Creating new snapshot at data\raft\snapshots\71-75495650-1503399019803.tmp
    2017/08/22 13:50:19 [ERR] snapshot: Failed syncing parent directory data\raft\snapshots, error: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:19 [ERR] raft: Failed to take snapshot: failed to close snapshot: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:28 [INFO] consul.fsm: snapshot created in 0s
    2017/08/22 13:50:28 [INFO] raft: Starting snapshot up to 75495748
    2017/08/22 13:50:28 [INFO] snapshot: Creating new snapshot at data\raft\snapshots\71-75495748-1503399028720.tmp
    2017/08/22 13:50:28 [ERR] snapshot: Failed syncing parent directory data\raft\snapshots, error: sync data\raft\snapshots: The handle is invalid.
    2017/08/22 13:50:28 [ERR] raft: Failed to take snapshot: failed to close snapshot: sync data\raft\snapshots: The handle is invalid.

Activity

preetapan

preetapan commented on Aug 22, 2017

@preetapan
Contributor

@Lexus-3141 Thanks for the report, this looks related to hashicorp/raft#232

Saving snapshots now calls the "sync" syscall to ensure that data is actually persisted to disk correctly. Sync's implementation is different for Windows vs *nix. For windows, this calls FlushFileBuffers. As far as I can tell this the right implementation. Do you see this autorecover after it emits the above error after restarting, or is it never able to save a snapshot successfully?

Lexus-3141

Lexus-3141 commented on Aug 22, 2017

@Lexus-3141
Author

Yes, autorecovery works successfully. Log after restart:

==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.9.2'
           Node ID: '41335b46-fbb3-8555-ea7f-719d12246f1b'
         Node name: 'wn-hv1'
        Datacenter: 'wz'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 192.168.12.93 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/08/22 16:33:52 [INFO] raft: Restored from snapshot 71-75581359-1503408826419
    2017/08/22 16:33:52 [INFO] raft: Initial configuration (index=75347702): [{Suffrage:Voter ID:192.168.12.4:8300 Address:192.168.12.4:8300} {Suffrage:Voter ID:192.168.12.1:8300 Address:192.168.12.1:8300} {Suffrage:Voter ID:192.168.12.2:8300 Address:192.168.12.2:8300} {Suffrage:Voter ID:192.168.12.3:8300 Address:192.168.12.3:8300} {Suffrage:Voter ID:192.168.13.212:8300 Address:192.168.13.212:8300} {Suffrage:Voter ID:192.168.12.93:8300 Address:192.168.12.93:8300} {Suffrage:Voter ID:192.168.12.94:8300 Address:192.168.12.94:8300} {Suffrage:Voter ID:192.168.13.226:8300 Address:192.168.13.226:8300}]
    2017/08/22 16:33:52 [INFO] raft: Node at 192.168.12.93:8300 [Follower] entering Follower state (Leader: "")

But i'm not sure about '(Leader: "")', does it a correct way of leader discovery?

preetapan

preetapan commented on Aug 22, 2017

@preetapan
Contributor

Yes the Leader:"" log message is expected to be seen when it starts up as a follower.

preetapan

preetapan commented on Aug 22, 2017

@preetapan
Contributor

@Lexus-3141 according to the documentation for FlushFileBuffers:

To flush all open files on a volume, call FlushFileBuffers with a handle to the volume. The caller must have administrative privileges. For more information, see Running with Special Privileges.

The code changes for hashicorp/raft#232 perform a fsync on the parent directory, which is failing according to the logs you added above. This could be because the consul agent is not running with admin privileges, can you double check that?

Lexus-3141

Lexus-3141 commented on Aug 22, 2017

@Lexus-3141
Author

It's do it. And it has an elevated token.
image

Once I found in windows 'System' principal has less rights then 'Administrator'. But I think it's enough for interact to file system.

preetapan

preetapan commented on Aug 22, 2017

@preetapan
Contributor

@Lexus-3141 can you see if the error goes away with Administrator instead of System as the principal? Asking because while having write permissions allows you to fsync and saving the snapshot file, looks like fsyncing the snapshot directory requires administrator privileges according to that doc page. This is a difference in behavior between *nix and Windows. On *nix systems, write permissions are sufficient to do fsyncs.

Lexus-3141

Lexus-3141 commented on Aug 22, 2017

@Lexus-3141
Author

I got same result again.

Lexus-3141

Lexus-3141 commented on Aug 22, 2017

@Lexus-3141
Author

I try to auditing file system access and found nothing. All operation has success result.
Looks like you do it wrong. ( WinAPI is very dense.

Lexus-3141

Lexus-3141 commented on Aug 22, 2017

@Lexus-3141
Author

And at local administrator too.

slackpad

slackpad commented on Aug 22, 2017

@slackpad
Contributor

I was able to get this to happen in a Windows 10 x64 VM, so it doesn't appear to be specific to the Windows Server 2012r2 version reported here. We might need to do some kind of alternate thing for Windows for the directory sync.

added this to the 0.9.2 milestone on Aug 22, 2017
added a commit that references this issue on Aug 24, 2017
30fd0d2
added a commit that references this issue on Aug 24, 2017

Merge pull request #3416 from hashicorp/issue_3409

bf54195
s-vitaliy

s-vitaliy commented on Feb 15, 2018

@s-vitaliy

Hello, I have a similar issue with nomad (windows server 2016, nomad version 0.6.2). Is this issue fixed in nomad as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugFeature does not function as expected

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @preetapan@slackpad@s-vitaliy@Lexus-3141

        Issue actions

          Win_x64. Error: sync data\raft\snapshots: The handle is invalid. And 30k snapshots in folder(. · Issue #3409 · hashicorp/consul