Deterministic/Reproducible Directory Listing

The order in which files in a directory are listed is usually determined by when they were added to the filesystem b-tree and is a common source of non-determinism.  A deterministic filesystem API would sort directory listings and iteration based on filename.

Unicode provides a local-invariant default sorting algorithm (DUCET), which can be made [deterministic](http://www.unicode.org/reports/tr10/index.html#Deterministic_Sorting). But it is not suitable for use in WASI for several reasons:
* It's a complex algorithm and ripe for subtle bugs.
* Sort order is unstable across versions of Unicode, even to assigned code-points.  So we would have to pick a single version and turn WASI into a snowflake standard.
* Unlike with casefolding, Unicode collation is *heavily* locale dependent.  Any presentation to the user would require recalculating the sort keys, so it's not worth trying to amortize the cost of DUCET by calculating the keys upfront and storing them with the filename.

The consensus on the [ICU support mailing list](https://sourceforge.net/p/icu/mailman/icu-support/thread/CABWuLVfqWoL%2BfXpuc_GEw%2B4Razh%2BrJk6U272Yq7Y0oVTdhpTTQ%40mail.gmail.com/#msg37136727) was that code-point order is the only sane option: given the high-bar of a portable deterministic sort order, anything more complex is an implementation hazard.  

Plain code-point order would be simplest to implement and the only truly version-less sort order.  However, subtle platform differences may alter the raw bag-of-bytes filename encoding:  `Å` could be encoded as the nordic letter `U+00C5`, the symbol `U+212B`, or the ASCII letter `A`/`U+0041` followed by "COMBINING RING ABOVE" `°`/`030A`.  Non-exhaustive list of things that can alter filename encoding:
* Differences in normalization handling in the filestore (HFS+, SMB, [Git](https://github.com/git/git/commit/3a59e5954ef19ac94522219c2f29d49a187d31d8), etc).
* Differences in input methods, user locale, and available Unicode version.
* Differences between programming languages.
* Complications arising from ensuring round-trip encoding between legacy code-pages and Unicode.

The "solution" is to perform the initial sort on some maximally compatible canonical normalization, using progressively less extreme normal forms (and eventually the raw byte-string) as tie-breakers.  As the file store ultimately decides whether two strings are equivalent in a given namespace, this would result in an identical sort or raising an error. ~~However, the idiot-in-a-hurry implementation of this "solution"  (i.e. NFKx_casefold -> NFKx -> NFx -> WTF-8 byte-string -> UTF-8 byte-string) introduces its own set of stability, complexity, and performance issues.~~

TL;DR
=====
0. Top level code-point sort based on a maximally compatible canonical normalization/lowest common denominator normalization to remove platform dependent differences.
1. Straight NFC tie-breaker.  This is roughly at the level of what most file stores use, except for edge cases like default ignorables and fractions.
2. Raw binary tie-breaker.  Again, the design intentionally minimizes the need for this, as differences due to UCS-2/WTF-8 make these comparisons non-nonsensical/non-portable.

I'm hesitant to suggest any further refinements without implementer feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deterministic/Reproducible Directory Listing #12

TL;DR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deterministic/Reproducible Directory Listing #12

Description

TL;DR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions