-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deterministic/Reproducible Directory Listing #12
Comments
Casefold?I don't think a case-folded comparison is necessary for deterministic sort order, as only ancient filesystems do not preserve case. It theoretically reduces entropy in some silly edge cases, but it is probably unhelpful in practice. I would only add a casefolding comparison level as an option if someone from a reproducible build community says it would be helpful or we have to store NFKC_casefolded form anyway. NFCK_CasefoldEdit: The Unicode standard pushes for the use of NFKC_Casefold as the identifier, going so far as to pre-compute an optimized character mapping in the UCD and providing dedicated NFKCCasefold functions in the ICU. It performs NFKC normalization, removes "default ignorables" (meta characters used for formatting), and has the same stability guarantees as regular normalization forms. The fact that it casefolds doesn't really matter because of the NFC level tie-breaker. This may represent the greatest ROI for the least amount of effort and bike-shedding. The only thing that gives me serious pause about using NFKC_Casefold for the top-level sort is that it doesn't remove control characters. |
NFD vs NFCThe choice between Composed ( Some 99.98% of text is in NFC form anyway, so it is faster and less disruptive than NFD. NFD is the most commonly used form in filesystems, but I cannot find any good sources for the rationale. This i18n filesystem thread speculates that HFS+ chose (a variant of) NFD because it was originally going to be the only stable normal form or because it is theoretically faster. The Linux implementation of case-insensitivity uses NFDi (NFD - default ignorables, but not control codes!) because they thought that NFD would be faster than NFC, which is a common misconception (most text is already in NFC form and the fastest thing is to do nothing). |
NFKCI was hoping to skip the NFKx level and define a single ~NFC-superset level comparison that would ignore the differences in normal forms found in popular filesystems. However, this turned out to be fairly brittle, as each filesystem performs slightly different non-standard normalization forms for slightly different reasons. I also found out that different input methods may insert strings that won't be identical even in NFx form (i.e. Korean: "same keypress will result in U+3131 in Windows but U+1100 in Mac"). When you factor in files that use compatibility code points due to maintain round-trip-ability with old code pages, NFKC seems necessary. |
Lowest Common DenominatorThis requires implementer feedback. The simplest to implement is We could do NFKC - default ignorables - C1 - C0 (ASCII control codes). This would be the least destructive but also a snowflake. We could also use the PRECIS NFKC Freeform profile. If we use PRECIS in other parts of the standard, then it would make sense to use it here. This has the benefit of additional controls around the ZWJ and bidirectional text. It would, however, remove unassigned code points. The advice from the Unicode mailing list was to ignore unassigned code-points, as new characters with non-zero ccc values are rarely assigned and as files from newer systems might use characters from minority groups. Instead, an implementation should emit a warning when encountering unassigned codepoints and error depending on a configuration option. |
It would seem that anything that would incorporate NFC, NFD, or Casefold at any part of the ordering would seem to depend on a version of Unicode, and thus be nondeterministic. Also, as I detailed here, I expect WASI-filesystem will contain a significant amount of nondeterminism in practice, so making this one aspect deterministic doesn't seem worth the significant complexity here. |
Reproducibility is the best any platform can hope to accomplish, as any Turing complete program with access to non-deterministic inputs can be made to run non-deterministically. Sorting by codepoint would not be version dependent, cover ~95% of use cases, and would be a LOT less work than implementing a VFS or ad-hoc patching individual programs. Point taken WRT complexity, performance, and implementation costs. I was hoping to propose an Evergreen Unicode release strategy and contribute a sorting implementation, but medical issues have delayed this work. Ideally, things would be architected to allow someone with the budget/time to contribute a maximally normalizing sort algorithm later on. But feel free to close and tag with |
Yeah, let's kill this. Without the FS maintaining the index, we will get weird performance edge cases. It's just the wrong level of abstraction to be doing this on. |
The order in which files in a directory are listed is usually determined by when they were added to the filesystem b-tree and is a common source of non-determinism. A deterministic filesystem API would sort directory listings and iteration based on filename.
Unicode provides a local-invariant default sorting algorithm (DUCET), which can be made deterministic. But it is not suitable for use in WASI for several reasons:
The consensus on the ICU support mailing list was that code-point order is the only sane option: given the high-bar of a portable deterministic sort order, anything more complex is an implementation hazard.
Plain code-point order would be simplest to implement and the only truly version-less sort order. However, subtle platform differences may alter the raw bag-of-bytes filename encoding:
Å
could be encoded as the nordic letterU+00C5
, the symbolU+212B
, or the ASCII letterA
/U+0041
followed by "COMBINING RING ABOVE"°
/030A
. Non-exhaustive list of things that can alter filename encoding:The "solution" is to perform the initial sort on some maximally compatible canonical normalization, using progressively less extreme normal forms (and eventually the raw byte-string) as tie-breakers. As the file store ultimately decides whether two strings are equivalent in a given namespace, this would result in an identical sort or raising an error.
However, the idiot-in-a-hurry implementation of this "solution" (i.e. NFKx_casefold -> NFKx -> NFx -> WTF-8 byte-string -> UTF-8 byte-string) introduces its own set of stability, complexity, and performance issues.TL;DR
I'm hesitant to suggest any further refinements without implementer feedback.
The text was updated successfully, but these errors were encountered: