Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define string sorting by "code unit order" #55

Closed
domenic opened this issue Jan 13, 2017 · 11 comments · Fixed by #248
Closed

Define string sorting by "code unit order" #55

domenic opened this issue Jan 13, 2017 · 11 comments · Fixed by #248

Comments

@domenic
Copy link
Member

domenic commented Jan 13, 2017

Background: whatwg/url#199

We want to update specs to be unambiguous about this, so I think infra is a good place to define it. It should include some examples (similar to the URLSearchParams WPTs).

@aphillips
Copy link
Contributor

I don't want to add a circular reference here: I commented on url#199 to the effect that 'code unit' requires that the character encoding (probably UTF-16 if working in terms of the DOM) be specified. If we're working in terms of UTF-8 byte strings, code point order would be better. Otherwise sort won't do anything reasonable.

@annevk
Copy link
Member

annevk commented Mar 17, 2017

We should settle this in a similar way to how we settle sizing of strings I think, as discussed in #74. And this would be easy to fix once #73 is done.

@annevk
Copy link
Member

annevk commented Mar 20, 2017

We should also fix byte sorting for whatwg/fetch#454.

@annevk
Copy link
Member

annevk commented Mar 27, 2017

This should probably do something like https://en.wikipedia.org/wiki/Comparison_sort, but not sure yet. See also advice from @domenic in #74.

@annevk
Copy link
Member

annevk commented Mar 27, 2017

I'm thinking it should be something like this:

To sort JavaScript strings perform a comparison sort using the following three-way comparison operation, given A and B:

  • If A is B, then return 0.
  • If A is a prefix of B, then return -1.
  • If B is a prefix of A, then return 1.
  • For each code unit a and b in A and B:
    • If a is b, then continue.
    • If a is less than b, then return -1.
    • Return 1.

This will get a little repetitive though to do for byte sequences, JavaScript strings, and strings.

@aphillips
Copy link
Contributor

For the W3C spec IndexedDb, I suggested some text that may come in handy in annotating this:

This matches the Array.prototype.sort on an Array of Strings. This ordering compares the 16-bit code units in each string, producing a highly efficient, consistent, and deterministic sort order. The resulting list will not match any particular alphabet or lexicographical order, particularly for code points represented by a surrogate pair.

@domenic
Copy link
Member Author

domenic commented Mar 28, 2017

Personally I think we should be able to just define a "less than" operation on two strings. We don't need the 0/1/-1. Mathematically at least only less than is necessary (plus an implicit "equals" concept).

I would also define "code unit comparison", and then say that a "sort" operation takes a comparison. I would not tie together the definition of sort and the specific comparison we use.

We should probably also define ascending vs. descending sort (and say ascending is the default), although we could wait until we have a definite consumer for that.

I like @aphillips's text particularly for pointing out the advantages and caveats.

As for the repetition, this kind of thing makes me wonder about our implicit concept of "sequence" that we are using for both strings + byte sequences. It's kind of a list even, but I'm not sure I'd want to go that far in the layering... Still, it would be nice to generalize the sorting to work on all three sequence types.

@annevk
Copy link
Member

annevk commented Mar 28, 2017

Perhaps they can be lists with immutable size/length. Even if you just define "less than", you'd still have most of the steps I outlined above, no? Apart from the first step that is.

@domenic
Copy link
Member Author

domenic commented Mar 28, 2017

Hmmm, yeah, I guess that's true.

@annevk
Copy link
Member

annevk commented Mar 28, 2017

Note that if we settle #91 on mutable treating these as list-likes seems likely. We already have places that use "append" on strings, although they sometimes get passed a code point and sometimes a string (which would more argue for "extend"/"concat" if we're being picky).

@annevk
Copy link
Member

annevk commented Feb 22, 2018

(CSS Typed OM uses "code point order" for some things.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants