Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(marshal)!: compare strings by codepoint #2008

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions packages/marshal/NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
User-visible changes in `@endo/marshal`:

# Next release

- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareRank` and associated functions compared strings using this JavaScript-native comparison. Now `compareRank` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***.
- This release does not change the `encodePassable` encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. `encodePassable` is rank-order preserving when the encoded strings are compared using `compareRank`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibson042 is this true? It was true for my small test case, which proves very little. Will the same property also be true for compactOrdered? For either, does restricting these strings to well-ordered have any effect on whether their encoding is order preserving?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true now, but I think that's a mistake... recordNames and any similar function that .sort()s an array of strings in marshal or a related package should probably be updated to .sort(compareByCodePoints) so the encoding of Copy{Record,Set,Bag,Map}s and their own comparison is consistent with that of their constituent strings.

Which unfortunately complicates adoption if we have existing use of any such strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and bad news!

Grepping for .sort() specifically with nothing between the parens, I see 96 occurrences in agoric-sdk and 26 in endo. Some may not be or contain strings. But still, fixing all that do will be disruptive. And the longer we wait, the more disruptive it'll be.

I'm putting this back into Draft until we decide what our plan is. Attn @ivanlei

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any practical way to scan a recent snapshot of our chain and somehow see how many persistent strings are

  • non-ascii,
  • non-well-formed, or
  • have supplementary characters (those whose code is > 16 bits)

?
How hard would it be?
Attn @mhofman

NOT URGENT.

- The key order of strings defined by the @endo/patterns module is still defined to be the same as the rank ordering of those strings. So this release changes key order among strings to also be lexicographic comparison of Unicode Code Points. To accommodate this change, you may need to adapt applications that relied on key-order being the same as JS native order. This could include the use of any patterns expressing key inequality tests, like `M.gte(string)`.
- These string ordering changes brings Endo into conformance with any string ordering components of the OCapN standard.
- To accommodate these change, you may need to adapt applications that relied on rank-order or key-order being the same as JS native order. You may need to resort any data that had previously been rank sorted using the prior `compareRank` function. You may need to revisit any use of patterns like `M.gte(string)` expressing inequalities over strings.

# v1.6.0 (2024-10-22)

- `compareRank` now short-circuits upon encountering remotables to compare,
Expand Down
1 change: 1 addition & 0 deletions packages/marshal/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ export {

export {
trivialComparator,
compareByCodePoints,
assertRankSorted,
compareRank,
isRankSorted,
Expand Down
45 changes: 42 additions & 3 deletions packages/marshal/src/rankOrder.js
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import {

/**
* @import {Passable, PassStyle} from '@endo/pass-style'
* @import {FullCompare, PartialCompare, PartialComparison, RankCompare, RankCover} from './types.js'
* @import {FullCompare, PartialCompare, PartialComparison, RankCompare, RankCover, RankComparison} from './types.js'
*/

const { entries, fromEntries, setPrototypeOf, is } = Object;
Expand Down Expand Up @@ -44,9 +44,46 @@ const { entries, fromEntries, setPrototypeOf, is } = Object;
*/
const sameValueZero = (x, y) => x === y || is(x, y);

/**
* @param {any} left
* @param {any} right
* @returns {RankComparison}
*/
export const trivialComparator = (left, right) =>
erights marked this conversation as resolved.
Show resolved Hide resolved
// eslint-disable-next-line no-nested-ternary, @endo/restrict-comparison-operands
left < right ? -1 : left === right ? 0 : 1;
erights marked this conversation as resolved.
Show resolved Hide resolved
harden(trivialComparator);

// Apparently eslint confused about whether the function can ever exit
// without an explicit return.
// eslint-disable-next-line jsdoc/require-returns-check
/**
* @param {string} left
* @param {string} right
* @returns {RankComparison}
*/
export const compareByCodePoints = (left, right) => {
const leftIter = left[Symbol.iterator]();
const rightIter = right[Symbol.iterator]();
for (;;) {
const { value: leftChar } = leftIter.next();
const { value: rightChar } = rightIter.next();
if (leftChar === undefined && rightChar === undefined) {
return 0;
} else if (leftChar === undefined) {
// left is a prefix of right.
return -1;
} else if (rightChar === undefined) {
// right is a prefix of left.
return 1;
}
const leftCodepoint = /** @type {number} */ (leftChar.codePointAt(0));
const rightCodepoint = /** @type {number} */ (rightChar.codePointAt(0));
if (leftCodepoint < rightCodepoint) return -1;
if (leftCodepoint > rightCodepoint) return 1;
}
};
harden(compareByCodePoints);

/**
* @typedef {Record<PassStyle, { index: number, cover: RankCover }>} PassStyleRanksRecord
Expand Down Expand Up @@ -139,8 +176,7 @@ export const makeComparatorKit = (compareRemotables = (_x, _y) => NaN) => {
return 0;
}
case 'boolean':
case 'bigint':
case 'string': {
case 'bigint': {
// Within each of these passStyles, the rank ordering agrees with
// JavaScript's relational operators `<` and `>`.
if (left < right) {
Expand All @@ -150,6 +186,9 @@ export const makeComparatorKit = (compareRemotables = (_x, _y) => NaN) => {
return 1;
}
}
case 'string': {
return compareByCodePoints(left, right);
}
case 'symbol': {
return comparator(
nameForPassableSymbol(left),
Expand Down
87 changes: 87 additions & 0 deletions packages/marshal/test/encodePassable-for-testing.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
/* eslint-disable no-bitwise, @endo/restrict-comparison-operands */
import { Fail, q } from '@endo/errors';

import {
makeEncodePassable,
makeDecodePassable,
} from '../src/encodePassable.js';
import { compareRank, makeComparatorKit } from '../src/rankOrder.js';

const buffers = {
__proto__: null,
r: [],
'?': [],
'!': [],
};
const resetBuffers = () => {
buffers.r = [];
buffers['?'] = [];
buffers['!'] = [];
};
const cursors = {
__proto__: null,
r: 0,
'?': 0,
'!': 0,
};
const resetCursors = () => {
cursors.r = 0;
cursors['?'] = 0;
cursors['!'] = 0;
};

const encodeThing = (prefix, r) => {
buffers[prefix].push(r);
// With this encoding, all things with the same prefix have the same rank
return prefix;
};

const decodeThing = (prefix, e) => {
prefix === e ||
Fail`expected encoding ${q(e)} to simply be the prefix ${q(prefix)}`;
(cursors[prefix] >= 0 && cursors[prefix] < buffers[prefix].length) ||
Fail`while decoding ${q(e)}, expected cursors[${q(prefix)}], i.e., ${q(
cursors[prefix],
)} <= ${q(buffers[prefix].length)}`;
const thing = buffers[prefix][cursors[prefix]];
cursors[prefix] += 1;
return thing;
};

const encodePassableInternal = makeEncodePassable({
encodeRemotable: r => encodeThing('r', r),
encodePromise: p => encodeThing('?', p),
encodeError: er => encodeThing('!', er),
});

export const encodePassableInternal2 = makeEncodePassable({
encodeRemotable: r => encodeThing('r', r),
encodePromise: p => encodeThing('?', p),
encodeError: er => encodeThing('!', er),
format: 'compactOrdered',
});

export const encodePassable = passable => {
resetBuffers();
return encodePassableInternal(passable);
};

export const encodePassable2 = passable => {
resetBuffers();
return encodePassableInternal2(passable);
};
export const decodePassableInternal = makeDecodePassable({
decodeRemotable: e => decodeThing('r', e),
decodePromise: e => decodeThing('?', e),
decodeError: e => decodeThing('!', e),
});

export const decodePassable = encoded => {
resetCursors();
return decodePassableInternal(encoded);
};

const compareRemotables = (x, y) =>
compareRank(encodeThing('r', x), encodeThing('r', y));

export const { comparator: compareFull } = makeComparatorKit(compareRemotables);
64 changes: 64 additions & 0 deletions packages/marshal/test/test-string-rank-order.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import test from '@endo/ses-ava/prepare-endo.js';

import { compareRank } from '../src/rankOrder.js';
import { encodePassable } from './encodePassable-for-testing.js';

/**
* Essentially a ponyfill for Array.prototype.toSorted, for use before
* we can always rely on the platform to provide it.
*
* @param {string[]} strings
* @param {(
* left: string,
* right: string
* ) => import('../src/types.js').RankComparison} comp
* @returns {string[]}
*/
const sorted = (strings, comp) => [...strings].sort(comp);

test('unicode code point order', t => {
// Test case from
// https://icu-project.org/docs/papers/utf16_code_point_order.html
const str0 = '\u{ff61}';
const str3 = '\u{d800}\u{dc02}';

// str1 and str2 become impossible examples once we prohibit
// non - well - formed strings.
// See https://github.com/endojs/endo/pull/2002
const str1 = '\u{d800}X';
const str2 = '\u{d800}\u{ff61}';

// harden to ensure it is not sorted in place, just for sanity
const strs = harden([str0, str1, str2, str3]);

/**
* @param {string} left
* @param {string} right
* @returns {import('../src/types.js').RankComparison}
*/
const nativeComp = (left, right) =>
// eslint-disable-next-line no-nested-ternary
left < right ? -1 : left > right ? 1 : 0;

const nativeSorted = sorted(strs, nativeComp);

t.deepEqual(nativeSorted, [str1, str3, str2, str0]);

const rankSorted = sorted(strs, compareRank);

t.deepEqual(rankSorted, [str1, str2, str0, str3]);

const nativeEncComp = (left, right) =>
nativeComp(encodePassable(left), encodePassable(right));

const nativeEncSorted = sorted(strs, nativeEncComp);

t.deepEqual(nativeEncSorted, nativeSorted);

const rankEncComp = (left, right) =>
compareRank(encodePassable(left), encodePassable(right));

const rankEncSorted = sorted(strs, rankEncComp);

t.deepEqual(rankEncSorted, rankSorted);
});
5 changes: 5 additions & 0 deletions packages/patterns/NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
User-visible changes in `@endo/patterns`:

# Next release

- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareKeys` and associated functions compared strings using this JavaScript-native comparison. Now `compareKeys` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***.
- See the NEWS.md of @endo/marshal for more on this change.

# v1.4.0 (2024-05-06)

- `Passable` is now an accurate type instead of `any`. Downstream type checking may require changes ([example](https://github.com/Agoric/agoric-sdk/pull/8774))
Expand Down
54 changes: 54 additions & 0 deletions packages/patterns/test/test-string-key-order.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
// modeled on test-string-rank-order.js
import test from '@endo/ses-ava/prepare-endo.js';

import { compareKeys } from '../src/keys/compareKeys.js';

/**
* Essentially a ponyfill for Array.prototype.toSorted, for use before
* we can always rely on the platform to provide it.
*
* @param {string[]} strings
* @param {(
* left: string,
* right: string
* ) => import('@endo/marshal').RankComparison} comp
* @returns {string[]}
*/
const sorted = (strings, comp) => [...strings].sort(comp);

test('unicode code point order', t => {
// Test case from
// https://icu-project.org/docs/papers/utf16_code_point_order.html
const str0 = '\u{ff61}';
const str3 = '\u{d800}\u{dc02}';

// str1 and str2 become impossible examples once we prohibit
// non - well - formed strings.
// See https://github.com/endojs/endo/pull/2002
const str1 = '\u{d800}X';
const str2 = '\u{d800}\u{ff61}';

// harden to ensure it is not sorted in place, just for sanity
const strs = harden([str0, str1, str2, str3]);

/**
* @param {string} left
* @param {string} right
* @returns {import('@endo/marshal').RankComparison}
*/
const nativeComp = (left, right) =>
// eslint-disable-next-line no-nested-ternary
left < right ? -1 : left > right ? 1 : 0;

const nativeSorted = sorted(strs, nativeComp);

t.deepEqual(nativeSorted, [str1, str3, str2, str0]);

// @ts-expect-error We know that for strings, `compareKeys` never returns
// NaN because it never judges strings to be incomparable. Thus, the
// KeyComparison it returns happens to also be a RankComparison we can
// sort with.
const keySorted = sorted(strs, compareKeys);

t.deepEqual(keySorted, [str1, str2, str0, str3]);
});
Loading