Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "Access is Denied" error on windows (with lock file) #4725

Closed
wants to merge 6 commits into from

Conversation

djs55
Copy link
Contributor

@djs55 djs55 commented Dec 18, 2023

- What I did

Fix a bug where, if one process is reading a meta.json on Windows, another CreateOrUpdate will fail with Access is Denied.

Example error:

rename C:\Users\docker\.docker\contexts\meta\<id>\.tmp-meta.json2945721760 C:\Users\docker\.docker\contexts\meta\<id>\meta.json: Access is denied.

- How I did it

Previously the code attempts to use moby's ioutils.AtomicWriteFile (called from here) to update meta.json but this fails on Windows calling os.Rename if another process has the meta.json open for reading (unlike on Unix OSes).

Avoid this problem by serialising accesses to the context filesystem db using a lock file via github.com/gofrs/flock which uses LockFileEx on Windows and syscall.Flock on Unix.

The lock is associated with a file descriptor / handle and released by calling Close() or when the process exits. The Microsoft docs have:

If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system.

and man 2 flock has:

Furthermore, the lock is released either by an explicit LOCK_UN operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.

Note the Unlock() function returns an error which means that functions which could previously fail for one reason can now fail for two (the original + a lock/unlock failure). Use github.com/hashicorp/go-multierror to return all the errors. When we upgrade to go 1.20 we can use errors.Join.

- How to verify it

I saw this in CI where one process was running a docker command which read a meta.json while another was trying to update the meta.json. I've just seen it in a Docker Desktop bug report this morning.

- Description for the changelog

  • Fix race on Windows which could cause concurrent context access to fail with a transient Access is Denied.

- A picture of a cute animal (not mandatory but encouraged)

lots of cute animals

@codecov-commenter
Copy link

codecov-commenter commented Dec 18, 2023

Codecov Report

Merging #4725 (ab429fe) into master (70d01b9) will decrease coverage by 0.12%.
Report is 10 commits behind head on master.
The diff coverage is 29.76%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4725      +/-   ##
==========================================
- Coverage   59.66%   59.55%   -0.12%     
==========================================
  Files         287      287              
  Lines       24760    24831      +71     
==========================================
+ Hits        14774    14788      +14     
- Misses       9100     9138      +38     
- Partials      886      905      +19     

```
make dev
make vendor
```

Signed-off-by: David Scott <dave@recoil.org>
Signed-off-by: David Scott <dave@recoil.org>
Previously the code attempts to use ioutils.AtomicWriteFile to update
meta.json but this fails on Windows if another process has the meta.json
open for reading (unlike on Unix OSes).

Example error:
```
rename C:\Users\docker\.docker\contexts\meta\<id>\.tmp-meta.json2945721760 C:\Users\docker\.docker\contexts\meta\<id>\meta.json: Access is denied.
```

So if one `docker context inspect` is running, then another `docker context`
command can fail transiently in `CreateOrUpdate`.

Avoid this problem by serialising accesses to the context filesystem db
using a lock file via github.com/gofrs/flock which uses LockFileEx
on Windows and syscall.Flock on Unix. The lock is associated with a
file descriptor / handle and released by calling Close() or when the
process exits.

Note this means that functions which could previously fail for one reason
can now fail for two (the original + a lock/unlock failure). Use
go-multierror to return all the errors. When we upgrade to go 1.20 we can
use errors.Join.

Signed-off-by: David Scott <dave@recoil.org>
@thaJeztah
Copy link
Member

Previously the code attempts to use moby's ioutils.AtomicWriteFile (called from here) to update meta.json but this fails on Windows calling os.Rename if another process has the meta.json open for reading (unlike on Unix OSes).

Didn't look at the code yet, but do you think this is something we should have as part of that package? (i.e., would other uses of the package have the same issue?)

"path"
"path/filepath"
"regexp"
"strings"

"github.com/docker/docker/errdefs"
"github.com/gofrs/flock"
"github.com/hashicorp/go-multierror"
Copy link
Member

@thaJeztah thaJeztah Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Go's native multi-errors for this? (errors.Join() (https://pkg.go.dev/errors#Join)), or would we loose too much functionality?

I think the general trend is to (try) moving away from hashicorp's packages as licensing for some of them becomes more risky (plus trying to use stdlib where possible in general)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can switch to errors.Join if you don't mind requiring Go 1.20. The vendor.mod requires Go 1.19 at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a few patches to switch to the Go 1.20 errors package to see how it looks.

Signed-off-by: David Scott <dave@recoil.org>
https://github.com/pkg/errors has been archived since Dec 2021

Replace
- errors.Errorf and errors.Wrap with fmt.Errorf("%w")
- multierror.Append with errors.Join

Signed-off-by: David Scott <dave@recoil.org>
@djs55
Copy link
Contributor Author

djs55 commented Dec 18, 2023

Previously the code attempts to use moby's ioutils.AtomicWriteFile (called from here) to update meta.json but this fails on Windows calling os.Rename if another process has the meta.json open for reading (unlike on Unix OSes).

Didn't look at the code yet, but do you think this is something we should have as part of that package? (i.e., would other uses of the package have the same issue?)

I think other users of the package on Windows will definitely have the same issue. A colleague (@fredericdalleau ) separately suggested a possibility which is to use a different file sharing mode when opening the file on Windows, to permit the rename on top. This is the approach we use in Docker Desktop's "grpcfuse" filesharing to make Windows file I/O more like Unix file I/O. I'll take a look at that possibility too.

@thaJeztah
Copy link
Member

A colleague (@fredericdalleau ) separately suggested a possibility which is to use a different file sharing mode when opening the file on Windows, to permit the rename on top.

For others; this touched on the FILE_SHARE_DELETE discussion (again); which was a long discussion with the Go maintainers where they disregarded the Windows kernel team recommendations, and went their own way, and suggested “Just fork the code in your project” 😞 golang/go#32088 golang/go#32088 (comment) (also golang/go#34681)

Which ... is what we ended up doing in the docker daemon 😞 😞 😞 😞 ; https://github.com/moby/moby/blob/199793350807908c04b56741b11e6b943158ac51/daemon/logger/loggerutils/file_windows.go#L25-L72

@djs55 djs55 changed the title Fix "Access is Denied" error on windows Fix "Access is Denied" error on windows (option 1: lock file) Dec 18, 2023
@djs55 djs55 changed the title Fix "Access is Denied" error on windows (option 1: lock file) Fix "Access is Denied" error on windows (with lock file) Dec 18, 2023
@djs55
Copy link
Contributor Author

djs55 commented Dec 18, 2023

Thanks for the FILE_SHARE_DELETE links. To my surprise it doesn't seem to be enough for this case. In this branch I open the file with the delete flag, but os.Rename still fails. The only new thing I can do is delete the file and then write a new one, but this won't be atomic so any readers will see a transient "file not found" exception.

Have a look at this unit test in that other branch: djs55@81b67b2

I think the only way to have clients see the Unix behaviour on Windows is to either serialise calls with a lock(file) or handle the transient errors with a sleep/retry loop. In the current code clients will see transient "Access is Denied" if there is a concurrent read; if we use FILE_SHARE_DELETE + os.Unlink + os.Rename instead then clients will see transient "File not found".

@thaJeztah
Copy link
Member

IIUC, the general issue is (potentially silly comments ahead);

  • this problem occurred when concurrent processes manipulate the context storage
    • ☝️ this definitely is an issue in general; much of this code was written with the assumption that there's only a single, short-lived "cli" process, but it does not take concurrent instances into account
    • ☝️ adding to this is the unfortunate situation where CLI plugins may also manipulate (or read) contexts. In this case "both" the parent process (the "CLI") and a plugin (say "buildx") could try to manipulate (or read) these locations on their own
    • ☝️ for the above situation, we have some work to do, not just for that case, but also to prevent all plugins to have to re-implement the context-management logic; we've been contemplating the "main" CLI to act as a short-lived "daemon" (for the duration of the CLI), and expose a socket that plugins can connect to; this socket could provide an API (plugin asks CLI for context information, or to update/create/manipulate; same for things like config etc); Initial (very early/partial) building-blocks for this were added in @laurazard's PR for handling signals/termination; cli-plugins: terminate plugin when CLI exits #4599, but no further work has been done (yet).
  • I'm curious if this issue only happens when manipulating contexts, or also on concurrent read operations (two CLI instances running docker context ls in parallel)?
  • Having a "lock" file could definitely help here, but I also have some initial concerns;
    • Should the CLI be able to run on a read-only filesystem? (I somewhat expect that to be an option); in that case we may need handling to prevent "failing to write lock-file" from causing a failiure (we could fallback to old behavior, and just go ahead)
    • What's the risk of lock-files being left behind, and rendering the CLI broken?
    • As we now need to write a lock-file for any operation (including "show context", "list contexts"); what's the additional overhead? (Assuming read operations to be a lot more common than write)
  • Could we use the metadata-files themselves as "lock"? i.e. keep a lock on the file when starting a write-operation? (I guess this may be partially your comment on "retry")

@thaJeztah thaJeztah added this to the 25.0.0 milestone Dec 19, 2023
@thaJeztah thaJeztah modified the milestones: 25.0.0, 26.0.0 Jan 19, 2024
@vvoland vvoland modified the milestones: 26.0.0, 27.0.0 Mar 14, 2024
Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder -1; we should be able to keep using the atomic writer, which we should change in Moby to make use of our FILE_SHARE_DELETE I/O helper.

@vvoland
Copy link
Collaborator

vvoland commented Jun 20, 2024

Let me close this one as stale. Feel free to reopen if we should still reconsider this one.

@vvoland vvoland closed this Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants