Fix some WPT url tests failing due to invalid surrogates in UTF8 #3554

npaun · 2025-02-14T17:32:38Z

Adds jsg::USVString, which is guaranteed not to contain invalid characters in UTF8. This type is intended to make it convenient and consistent to implement Web APIs which say they take a USVString without having to deal with this issue every time.
Makes jsg::Dict use a HashMap during construction because it's possible that two JS strings will become the same USVString. Last value wins.

anonrig · 2025-02-14T17:35:47Z

Why can't we change kj::String to use correct utf-8 values? wouldn't it be safer? @kentonv @jasnell

anonrig · 2025-02-14T17:37:17Z

src/workerd/jsg/value.h

+      kj::Maybe<v8::Local<v8::Object>> parentObject) {
+    v8::Local<v8::String> str = check(handle->ToString(context));
+    v8::Isolate* isolate = context->GetIsolate();
+    auto buf = kj::heapArray<char>(str->Utf8LengthV2(isolate) + 1);


technically we can check for str->ContainsOneByteOnly() and omit utf8 calls for one byte strings

npaun · 2025-02-14T17:37:28Z

@anonrig Some APIs specify that a value is a DOMString which allows those invalid characters.

anonrig · 2025-02-14T17:40:14Z

@anonrig Some APIs specify that a value is a DOMString which allows those invalid characters.

I'm thinking out loud:

Maybe we can tackle the problem from a different perspective. Rather than allowing invalid characters for all inputs, we can just disallow by default, and introduce DOMString() rather than USVString().

Allowing invalid utf8 values by default can cause security issues.

kentonv · 2025-02-14T21:26:39Z

Why can't we change kj::String to use correct utf-8 values? wouldn't it be safer?

kj::String accepts whatever you give it. It contains an array of chars. By convention it is supposed to contain UTF-8 chars, but it is not kj::String's job to validate this. Validating in kj::String would be a bad idea because:

It would be incredibly slow as kj::String is used everywhere, so presumably any validation checks would happen everywhere.
Over-validation can lead to serious problems. For example, most systems that are based around UTF-16 do not validate surrogate pairs. This includes Java, JavaScript, Windows NT, etc. On Windows in particular, filenames can contain invalid surrogate pairs. If kj::String aggressively validated surrogate pairs, then people would be unable to use the KJ filesystem APIs to list and manipulate filenames on Windows that contain invalid surrogate pairs. This would be bad! In fact, this could be a DoS attack vector: drop a file with a bad name in the right place and break any KJ program trying to look at that directory. Yes that's right: aggressive validation can create security problems; it is not safer.

kentonv

I feel like this is one of those things that nobody cares about except for some pedantic test-writers. Do we really have to do this? What if we just didn't?

src/workerd/jsg/value.h

kentonv · 2025-02-14T21:29:48Z

src/workerd/api/url-standard.h

-  kj::Array<kj::ArrayPtr<const char>> getAll(kj::String name);
-  bool has(jsg::Lock& js, kj::String name, jsg::Optional<kj::String> value);
-  void set(kj::String name, kj::String value);
+  void append(jsg::USVString name, jsg::USVString value);


Isn't this a breaking change? Don't we need a compat flag?

Maybe we could even have a compat flag that enables the strict checks, and just never even bother turning it on by default... but use it for the tests?

I'm guessing this isn't the only API where we are technically supposed to be using USVString. It'd probably be best if we had a single compat flag that enables the checks everywhere, and we don't consider turning it on by default until we've actually audited all our APIs to make sure they are using USVString if they are supposed to.

If we're going to do that it might be easier to have a compatibility flag that updates the kj::String type mapping in general to always perform the invalid character replacement. I think the only web platform APIs we have that would not be USVString are the Header and we already have jsg::ByteString to cover that case. Either way, I agree that this likely does need a compat flag.

Well, if we change it everywhere, then cases where an API round-trips text back to JavaScript now have a side effect that might be unexpected...

isn't that already the case when unpaired surrogates are used tho? For instance, if I have a test like...

struct StringsContext: public ContextGlobalObject { kj::String str(kj::String value) { return kj::mv(value); } JSG_RESOURCE_TYPE(StringsContext) { JSG_METHOD(str); } }; JSG_DECLARE_ISOLATE_TYPE(StringsIsolate, StringsContext); KJ_TEST("...") { Evaluator<StringsContext, StringsIsolate> e(v8System); e.expectEval("const m = '\\ud83c'; str(m) === m", "boolean", "false"); }

The kj::String is not going to roundtrip successfully since while the unpaired surrogate will not be replaced with the FFFD replacement character on the inbound type mapping, it is replaced on the outbound type mapping.

it is replaced on the outbound type mapping.

WTF!

I totally assumed that if V8 was happy to produce WTF-8, it would also accept it, but apparently not. Apparently V8's UTF-8 parser, when presented with a UTF-8 sequence representing an unpaired surrogate -- which is a three-byte sequence -- just replaces each of the three bytes with a replacement character?!?!

Well I guess that means no one is relying on us round-tripping unpaired surrogates after all. But why the heck does V8 only do this validation in one direction and not the other?

Yikes. The inbound case is a mess.

So how about this as a path forward.... Let's

introduce a compat flag to treat kj::String inbound as USVString (replacing invalid sequences).

introduce jsg::UsvString as a more explicit indicator to use replacement that works with or without the compat flag

introduce jsg::DomString as an explicit indicator to use the current inbound type mapping and we can find a way to ensure that it works as expected on the outbound too (that is, that it produces WTF and round trips correctly... this might require a v8 patch).

Incrementally move our existing direct uses of kj::String over to either one of these explicit string types.

I think most of our APIs are going to handle the switch to USVString semantics easily. It's more likely that we have a far smaller set of APIs that may rely on the current behavior.

IMO feels like a lot of work to solve a problem that no one is actually hitting. But I won't stop you.

npaun · 2025-02-14T21:46:05Z

@kentonv The main motivation I had for fixing this is that the underlying library we use for URL stuff, ada-url, says the caller is responsible for never passing invalid UTF-8 characters.

I feel like this test is pretty pedantic but it does seem a bit unpleasant that we'd parse a URL differently than a browser or Node.

kentonv · 2025-02-14T22:08:09Z

ada-url, says the caller is responsible for never passing invalid UTF-8 characters.

I think we need to distinguish between the specific case of invalid surrogate pairs, vs. lower-level invalid UTF-8 byte sequences.

As it stands today, JavaScript strings are UTF-16 -- or more accurately WTF-16, in that they may have unpaired surrogates. We convert them to UTF-8 ourselves; we don't accept UTF-8 from the application. So the only way that our UTF-8 might be "invalid" is specifically unpaired surrogates. Does ada-url barf on unpaired surrogates? Or does it just treat them like any regular 16-bit code point? I would expect the latter, in which case there's no real problem here?

src/workerd/jsg/value.h

jasnell · 2025-02-14T23:18:16Z

src/workerd/api/url-standard.h

+  kj::Maybe<kj::ArrayPtr<const char>> get(jsg::USVString name);
+  kj::Array<kj::ArrayPtr<const char>> getAll(jsg::USVString name);
+  bool has(jsg::Lock& js, jsg::USVString name, jsg::Optional<jsg::USVString> value);
+  void set(jsg::USVString name, jsg::USVString value);


Note: you'll likely need to also update the type generation code to handle USVString

jasnell · 2025-02-15T00:24:23Z

... kj::String accepts whatever you give it. It contains an array of chars. By convention it is supposed to contain UTF-8 chars, but it is not kj::String's job to validate this

Couldn't we have a variation of the kj::String construction that performs the validation and replacement? It wouldn't incur the additional costs and would be explicitly opt in.

auto wellformed = kj::usvstr("...");

auto notWellformed = kj::str("...");
auto wellformed = kj::usvstr(kj::mv(notWellformed));

(the "WellFormed" terminology is borrowed by the recent API additions in JS to accomplish the same)

jasnell · 2025-02-15T00:29:02Z

... On Windows in particular, filenames can contain invalid surrogate pairs. If kj::String aggressively validated surrogate pairs, then people would be unable to use the KJ filesystem APIs to list and manipulate filenames on Windows that contain invalid surrogate pairs

Side note... stuff like this is why Node.js' file system apis accept string or Buffer for filenames. Years ago I had a consulting client in Japan who was hitting up against the fact that their filesystem contained files and directory names that used multiple text encodings (since on some OS's file paths are just byte sequences). Node.js originally treated all file paths as UTF8. Sometimes we do have to just let invalid sequences be invalid sequences without validation or replacement

npaun · 2025-02-21T23:19:13Z

...

npaun · 2025-02-24T16:04:47Z

Hey guys, here's an update on this PR:

Creates USVString and DOMString to allow code to explicitly specify what behaviour it intends.
Reverted the behaviour where jsg::Dict<K = USVString> would silently ignore duplicate keys. Instead jsg::Dict is always a list of pairs and the caller must decide what to do in this case.

I propose we merge this PR at this point and follow-on with further PRs.

Here's what's left to discuss and implement:

Add a compat flag to make kj::String act as a USVString.
Fix the inbound case where V8 replaces a unpaired surrogate with three replacement chars (this might involve fixing stuff on their side).

npaun requested review from jasnell and anonrig February 14, 2025 17:32

npaun requested review from a team as code owners February 14, 2025 17:32

anonrig requested a review from kentonv February 14, 2025 17:33

anonrig reviewed Feb 14, 2025

View reviewed changes

kentonv requested changes Feb 14, 2025

View reviewed changes

jasnell reviewed Feb 14, 2025

View reviewed changes

src/workerd/jsg/value.h Outdated Show resolved Hide resolved

jasnell reviewed Feb 14, 2025

View reviewed changes

npaun force-pushed the npaun/usvstring branch from d4b07f6 to 1aae6a6 Compare February 21, 2025 23:18

Create USVString for methods that cannot accept invalid UTF8 chars

9fd4c88

npaun force-pushed the npaun/usvstring branch from 1aae6a6 to 9fd4c88 Compare February 21, 2025 23:19

npaun requested review from jasnell, anonrig and kentonv February 24, 2025 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some WPT url tests failing due to invalid surrogates in UTF8 #3554

Fix some WPT url tests failing due to invalid surrogates in UTF8 #3554

npaun commented Feb 14, 2025

anonrig commented Feb 14, 2025

anonrig Feb 14, 2025

npaun commented Feb 14, 2025

anonrig commented Feb 14, 2025

kentonv commented Feb 14, 2025

kentonv left a comment

kentonv Feb 14, 2025

kentonv Feb 14, 2025

jasnell Feb 14, 2025

kentonv Feb 14, 2025

jasnell Feb 15, 2025 •

edited

Loading

kentonv Feb 15, 2025

kentonv Feb 15, 2025

npaun Feb 15, 2025

jasnell Feb 15, 2025 •

edited

Loading

kentonv Feb 15, 2025

npaun commented Feb 14, 2025

kentonv commented Feb 14, 2025

jasnell Feb 14, 2025

jasnell commented Feb 15, 2025 •

edited

Loading

jasnell commented Feb 15, 2025

npaun commented Feb 21, 2025

npaun commented Feb 24, 2025

Fix some WPT url tests failing due to invalid surrogates in UTF8 #3554

Are you sure you want to change the base?

Fix some WPT url tests failing due to invalid surrogates in UTF8 #3554

Conversation

npaun commented Feb 14, 2025

anonrig commented Feb 14, 2025

Choose a reason for hiding this comment

npaun commented Feb 14, 2025

anonrig commented Feb 14, 2025

kentonv commented Feb 14, 2025

kentonv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasnell Feb 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasnell Feb 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npaun commented Feb 14, 2025

kentonv commented Feb 14, 2025

Choose a reason for hiding this comment

jasnell commented Feb 15, 2025 • edited Loading

jasnell commented Feb 15, 2025

npaun commented Feb 21, 2025

npaun commented Feb 24, 2025

jasnell Feb 15, 2025 •

edited

Loading

jasnell Feb 15, 2025 •

edited

Loading

jasnell commented Feb 15, 2025 •

edited

Loading