-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figuring out our relationship with the filesystem #12
Comments
Experimentally, it looks like on Windows:
|
Windows behaviour varies by version, so you should test on the oldest version you want to support, for drawing inferences about locking etc behaviour. |
I've pretty much been assuming that we don't need to target anything older
than evergreen Win 10, since everything older is already eol, and will be
even more so by the time we need to worry about long-tail users. If there's
some depressing reason that makes that wrong, please let me know :-)
Of course within Win 10 there's still a diversity of versions, but
hopefully as long as we avoid the most-cutting-edge stuff we should be
fine? And I'm not sure what else we can realistically do, since afaik
there's no way to get older win 10 releases for testing.
…On Sun, Feb 12, 2023, 02:48 Robert Collins ***@***.***> wrote:
Windows behaviour varies by version, so you should test on the oldest
version you want to support, for drawing inferences about locking etc
behaviour.
—
Reply to this email directly, view it on GitHub
<#12 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEU42F4EWGO3KMPLYDF6CTWXC5XXANCNFSM6AAAAAAUKEIYEY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Windows 10: yeah, sadly many enterprises won't be on evergreen Windows 10. There is a commonly documented oldest version to support; I suggest you don't support anything older than the Rust minimum version, (which is Windows 7 still)[https://doc.rust-lang.org/nightly/rustc/platform-support.html] :- but something newer than that, though perhaps not 'latest Windows 10'. I've had surprises where folk have a bug and the cause has been 'they work at a company'. Some more thoughts:
File system atomicity is hard. We spent a lot of time on this in bzr.
I think a useful thing to think about is your threat model. Cache poisoning is a classic way to make everyones life hard. How will you deal with a hostile attacker who inserts malicious content at a known key in your cache? OS caching. Hah! Windows don't cache like you might think. Assume that every read will be actual IO, except for directory metadata. (This is down to the very different structure for IO: rather than the page cache owning IO, processes own IO. Then when the process is killed, the IO can all get unwound. This is why virus scanners that are scanning files written by a process can cause the process to suffer IO latency even when there is tonnes of memory etc to buffer the files). I (did a talk on this)[https://www.youtube.com/watch?v=qbKGw8MQ0i8]. Fsync: for |
Now, less doom and gloom, some suggestions:
You probably wouldn't want such a daemon to be a pseudo-init, though thats a possible version of the design.
|
(images) The free dev images are certainly just evergreen. Possibly an MSDN membership would get you older versions to test with. Guido Or Steve Dower might be able to arrange something? |
Currently we handle all our on-disk storage through the
KVFileStore
andKVDirStore
abstractions. They're both basically key->value mappings, where the value is either a file or a directory respectively. They have per-key locking, and try to implement atomic updates when possible.But these aren't necessarily the best abstractions for what we need, because I wrote them before knowing how we were going to use them. And also, while I thought they should work on Windows, it turns out there were a few details I was missing (see #4) that makes them pretty fragile, esp. in the presence of operations like filesystem indexing or AV scanning that can randomly open files. And they don't currently have any support for garbage-collecting old data.
So here's a brain dump about what actual KV stores we've ended up with and what properties each one needs.
hash_cache
: map artifact hash -> artifact (i.e. wheel/sdist/pybi)posy
invocations are running simultaneouslymetadata_cache
: maps artifact hash -> core METADATA for that artifact (useful to skip the dance required to pull it out of a remote zip file, and saves locally built metadata from sdists)hash_cache
, except that we always slurp in the whole file in one shotwheel_cache
: maps sdist hash -> directory of wheels that we've built from itposy
invocations are running simultaneouslyhttp_cache
: maps request info -> request metadata + previous responsesomepkg @ https://.../somepkg.whl
. Must not have partial writes.EnvForest
: maps wheel/pybi hash to unpacked tree, or sdist hash to a directory containing unpacked treesbuild_store
: maps sdist hash -> build scratch spacehttps://stackoverflow.com/a/57358387/ has some very smart sounding comments about what actually works on Windows, and one claim is that deleting a large file on NTFS can itself be non-atomic, and a badly timed crash could leave the file truncated instead. There's also the newer and undocumented (but public)
FILE_RENAME_FLAG_POSIX_SEMANTICS
which.... might do something useful? Might need to experiment to figure out what this thing actually does. [Edit: Turns out you need to read the kernel-level documentation. The answer is that it lets you overwrite a destination file that still has open handles. It doesn't, AFAIK, do anything to help with the case where the source file has open handles.]I'm worried about data integrity; specifically, we currently trust that
hash_cache
/metadata_cache
/wheel_cache
/http_cache
/EnvForest
are normative, so if a truncated entry ended up there then everything could become wedged permanently until someone manually clears out the caches. That would suck. (On the other hand, lack of durability is fine -- if a value gets lost, or gets truncated but we can detect that it's truncated and discard it, then that's OK; these things can all be reconstructed if needed.)Blob storage
For the ones that store blobs, we might just want to use a full-fledged transactional store, like sqlite or bdb. Trade-offs:
hash_cache
,wheel_cache
). Formetadata_cache
andhttp_cache
sqlite should handle them fine, possibly as BLOBs.The main alternative is to do something like we're doing, with one on-disk file per value, which has a few challenges.
For integrity, files require either
fsync
and then atomicrename
, or else some sort of checksum verification so we can detect and discard corrupted values. Neither is super attractive...fsync
can make writes pretty expensive, and on Windows atomicrename
can be thwarted by open handles. (I guess you can sleep and retry?) Though Windows does haveCreateHardLink
so you could at least link the file into place and then worry about deleting the original tmp file later opportunistically. (And this can even overwrite an existing file if you useNtSetInformationFile
+FILE_LINK_INFORMATION
+FILE_LINK_REPLACE_IF_EXISTS
+FILE_LINK_POSIX_SEMANTICS
.) Checksums make writes easy and fast, but then when you open the file again you have to read through the whole thing to validate the checksum, before you know whether you have a file at all. Fast checksums can be very fast (on my laptop even sha256 goes at ~1.6 GB/s according toopenssl speed
, and I assume crc32c would be even faster), but that's still an extra human-perceptible lag for multi-hundred-megabyte GPU wheels, and extra I/O. (Which might get hidden by caching, if the whole file fits in cache and the OS doesn't activate dropbehind logic for sequential scans and if we're going to read the whole file anyway, like we usually will for artifacts.)I guess another option to avoid
rename
on Windows would be: write the file directly to its final name,fsync
, and then put a marker file next to it to record that the main file is complete and trustworthy.Or, we can combine them: write the file to disk,
fsync
, and then commit a record in sqlite saying that it's there and valid.Finally, there's the question of garbage collection: for files, I think this is actually pretty easy? Unix and Windows do both support deleting a file while letting current readers continue (for Windows you need a magic
POSIX_SEMANTICS
flag but it's there).Tentatively, it seems like we might want to use sqlite for
metadata_cache
andhttp_cache
, and files forhash_cache
andwheel_cache
.Directories
build_store
isn't a big deal, because usage is restricted to a single process. We can just create directories and write to them whenever we like. We might specifically want to avoid renaming the directory into place, to avoid Windows issues. If we add multi-threading in the future then we'll probably want some kind of locking. But that's about it.EnvForest
OTOH is... idk, maybe intractable, in two different ways:When unpacking wheels/pybis into it, the only way to guarantee integrity is to
fsync
every file and directory. Ugh! (Well I guess on Unix we could also callsync(2)
but that's gross too.) ...I guess an alternate Overly Clever solution would be to do direct I/O and regain the performance with aggressively dispatching the IO through io_uring/IOCP/threads. Maybe if you use threads to call aggressivelyfsync
on lots of files in parallel it ends up being not so bad? Are modern FS's all clever enough to batch the journal transactions? idk. Maybe we should just give up on integrity here.For GC'ing stuff: we have no reliable way to know whether an entry is in use. I guess when
python
starts up we could have it take a read-lock on all the entries it's using? But an entry can still be in use even if nopython
process is running, because it could be lurking in an environment variable in a non-python
process (e.g. a shell), and then later it spawns apython
that will expect all the entries to be there. I guess in this case the python process could at least detect at startup if anything is missing, and fail noisily? Or if we want to be ridiculously clever, it could invokeposy
to fill the cache again before continuing...The text was updated successfully, but these errors were encountered: