Skip to content

Deterministic/Reproducible Directory Listing #12

Closed
@indolering

Description

@indolering

The order in which files in a directory are listed is usually determined by when they were added to the filesystem b-tree and is a common source of non-determinism. A deterministic filesystem API would sort directory listings and iteration based on filename.

Unicode provides a local-invariant default sorting algorithm (DUCET), which can be made deterministic. But it is not suitable for use in WASI for several reasons:

  • It's a complex algorithm and ripe for subtle bugs.
  • Sort order is unstable across versions of Unicode, even to assigned code-points. So we would have to pick a single version and turn WASI into a snowflake standard.
  • Unlike with casefolding, Unicode collation is heavily locale dependent. Any presentation to the user would require recalculating the sort keys, so it's not worth trying to amortize the cost of DUCET by calculating the keys upfront and storing them with the filename.

The consensus on the ICU support mailing list was that code-point order is the only sane option: given the high-bar of a portable deterministic sort order, anything more complex is an implementation hazard.

Plain code-point order would be simplest to implement and the only truly version-less sort order. However, subtle platform differences may alter the raw bag-of-bytes filename encoding: Å could be encoded as the nordic letter U+00C5, the symbol U+212B, or the ASCII letter A/U+0041 followed by "COMBINING RING ABOVE" °/030A. Non-exhaustive list of things that can alter filename encoding:

  • Differences in normalization handling in the filestore (HFS+, SMB, Git, etc).
  • Differences in input methods, user locale, and available Unicode version.
  • Differences between programming languages.
  • Complications arising from ensuring round-trip encoding between legacy code-pages and Unicode.

The "solution" is to perform the initial sort on some maximally compatible canonical normalization, using progressively less extreme normal forms (and eventually the raw byte-string) as tie-breakers. As the file store ultimately decides whether two strings are equivalent in a given namespace, this would result in an identical sort or raising an error. However, the idiot-in-a-hurry implementation of this "solution" (i.e. NFKx_casefold -> NFKx -> NFx -> WTF-8 byte-string -> UTF-8 byte-string) introduces its own set of stability, complexity, and performance issues.

TL;DR

  1. Top level code-point sort based on a maximally compatible canonical normalization/lowest common denominator normalization to remove platform dependent differences.
  2. Straight NFC tie-breaker. This is roughly at the level of what most file stores use, except for edge cases like default ignorables and fractions.
  3. Raw binary tie-breaker. Again, the design intentionally minimizes the need for this, as differences due to UCS-2/WTF-8 make these comparisons non-nonsensical/non-portable.

I'm hesitant to suggest any further refinements without implementer feedback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions