Consistently deal with string encodings over FFI #2520

Manishearth · 2022-09-06T17:03:05Z

Currently we use &str over FFI when we're accepting strings from the user (which is not that common, but crops up in segmenter, collator, etc).

Over FFI, this maps to std::string_view in C++ and a String in JS (that gets autoconverted to UTF8, throwing on surrogate codepoints). string_view is not guaranteed UTF8 in C++.

Eventually I would like Diplomat to allow for:

A config option that forces str to get validated over C++ FFI before being passed in. (This can be done withotu changing APIs)
More string types: one for UTF16, etc
Perhaps a way to differentiate between "assume-validated" strings, "unvalidated" strings, and "diplomat-will-validate-for-you" strings

But for now we shoudl be consistent. The easiest way to do this safely is to always accept &str but to always validate it as valid UTF8 immediately. The alternative is to accept &str and document that it's up to C++ callers to ensure it's UTF-8.

Collator's happening in #2498, but I think we'll need to handle this for bidi and segmenter. It ought to be a matter of simply casting to a byte slice, validating it, and returning an error if not.

cc @sffc @hsivonen

The text was updated successfully, but these errors were encountered:

hsivonen · 2022-09-07T06:17:59Z

I'm not convinced that consistency at the cost of perf is the right way to go. I am in favor of consistency in terms of types, i.e. std::string_view for UTF-8-ish input, but I have doubts about it being good to favor consistency on the point of what memory-safe behavior (early error vs. as-if-replacement-characters) takes places in case of ill-formedness. Notably, an up-front well-formedness check pessimizes the collator more than letting the collator check a string for well-formedness as it goes, because the collator returns early when it finds a primary-level difference early in the string.

Is it schedule-wise infeasible for 1.0 to have a Diplomat directive that would allow a binding to choose from two options for &str-as-argument treatment for a given method?

Check well-formedness on Diplomat layer error out early, otherwise route to &str back end (for segmenter and bidi; I gather they don't have the back end code to deal with potentially-ill-formed UTF-8)
Don't check well-formedness and route to the back end's &[u8] flavor when the back end can deal (for normalizer and collator, which are both designed to deal with the three cases: guaranteed-well-formed UTF-8, potentially-ill-formed UTF-8, and potentially-ill-formed UTF-16).

hsivonen · 2022-09-07T06:20:09Z

Also, would the check go on the C++ layer only or on the C/FFI layer? I.e. would JS go through a conversion from UTF-16 to guaranteed-well-format UTF-8 and still undergo a UTF-8 well-formedness check subsequently.

Manishearth · 2022-09-07T06:31:10Z

At the moment? It would have to go through the FFI layer. In the future we can make this configurable on Diplomat so that it automatically does this for C++.

(It is not a hard change to do and I am open to having that configurability for C++ now as well, however I personally have higher priority things to work on and this can be done post-1.0 without breaking things so I'd rather punt.)

hsivonen · 2022-09-07T06:50:20Z

Considering that the normalizer and the collator have been designed to be able to deal with all three of

guaranteed-well-formed UTF-8
potentially-ill-formed UTF-8
potentially-ill-formed UTF-16

performance-wise optimally, I'm uncomfortable with not letting that show through to the intended beneficiaries. Even though there are actual perf problems in comparison with ICU4C even given perfect FFI, it would be sad for ICU4X normalizer & collator to be perceived as slower than they have to be perceived as due to extra validation that doesn't logically have to be there.

(But then, my current priorities don't allow for committing to implementing the requisite Diplomat configurability at this time.)

Manishearth · 2022-09-07T14:06:25Z

I don't see how this issue precludes any of that.

For APIs that support different kinds of inputs we can have different methods. We can even add the unvalidated one later if we don't want to start with

We can also fix the problem with the validation cost later.

Neither of these things are breaking.

We do not have time to do the ideal FFI for 1.0, this issue is about what we need to do for 1.0. Fortunately this can be fixed shortly afterwards without breaking, provided we prep for it and that's what this is about.

Manishearth · 2022-09-07T18:48:00Z

Okay, so here's a plan, which I discussed a bit with @sffc

The general idea is that for 1.0 we will expose std::string_view APIs but not assume valid UTF8 (except in one case around bidi), plus additional utf16/etc APIs if desired. In 1.x we will have the Diplomat support to avoid any double-validation cost and expose more thoughtful per-language APIs.

The 1.x plans will need rust-diplomat/diplomat#240 (plus more configurability around string types) , rust-diplomat/diplomat#233, and perhaps the renaming part of rust-diplomat/diplomat#234. 1.0 plans are doable today with some minor fixes to ICU4X APIs.

There are three categories of stringy APIs today:

If potentially invalid utf8 is supported as an API

For collator and normalizer, and segmenter. Segmenter doesn't support potentially invalud utf8 now, but it can, by using https://docs.rs/utf8_iter/latest/utf8_iter/)

Using #2498 compare as an example, but would also apply to the normalizer functions:

For 1.0:

foo() takes &str (std::string_view in cpp), use the potentially invalid utf8 path
foo_utf16() takes &[u16] (std::span<uint16_t>), assumes potentially invalid

For 1.x:

foo() takes &[u8] (on the Rust side) but still maps to std::string_view and String (via some annotation) in C++ and JS, uses the potentially invalid utf8 path
- We can use function-renaming so that the JS function can use the valid UTF8 codepath instead (i.e., we'd have foo_valid_utf8() in JS but renamed to foo() and foo() is disabled.
foo_valid_utf8() takes &str, maps to std::string_view in C++. Is disabled or renamed for JS.
foo_utf16() takes &[u16], maps to nice u16 string types for C++ and JS, still assumes potentially invalid since we do not require surrogate pairs
foo_valid_utf16() could be added in the future though it's not something asked for yet and I don't think we need it

APIs that need UTF8

ListFormatter is an example of this, as well as FixedDecimal, Locale, etc. FsDataProvider should move over to bytes. We may for now exempt bidi from this and have it just documented as requiring valid UTF8.

For 1.0:

Accept &str, but revalidate in Rust bridge code, JS eats double validation cost. ListFormatter needs to allocate anyway so this isn't a huge deal.

If possible, use a bytes API instead (e.g. FixedDecimal, Locale, PluralOperands, etc), exposed as &str but not assumed to be valid.

For 1.x:

Potentially two APIs: default API accepts &[u8] but maps to idiomatic string types over FFI, pays validation cost only if necessary (so not in JS). Additional API can exist for C++ that assumes valid UTF8.

Stringy APIs around identifiers

These go to TinyStr, we should just ensure we use TinyStr::from_bytes(), and switch over to std::string_view &[u8]. After 1.x we can have Diplomat generate &[u8] APIs that map to idiomatic string types over FFI to prevent footguns.

hsivonen · 2022-09-08T06:19:16Z

The UTF-8 part of the above makes sense. Thank you.

Why do you say &[u16] needs to be validated? What UTF-16-taking APIs do we have that don't already deal with unpaired surrogates? Both the collator and the normalizer deal with ill-formed UTF-16 by as-iffing unpaired surrogates to U+FFFD, so up-front validation is unwanted. Do callers that have guaranteed-well-formed UTF-16 even exist?

Manishearth · 2022-09-08T14:59:44Z

Why do you say &[u16] needs to be validated? What UTF-16-taking APIs do we have that don't already deal with unpaired surrogates? Both the collator and the normalizer deal with ill-formed UTF-16 by as-iffing unpaired surrogates to U+FFFD, so up-front validation is unwanted. Do callers that have guaranteed-well-formed UTF-16 even exist?

Sorry, I should rephrase, when I say "needs to be validated", I mean "should not be assumed to be valid", which in our case means we don't validate (but we also don't write code assuming validity)

sffc · 2022-09-08T18:15:49Z

@younies - How the Bi-di algorithm works is that no matter the encoding, it gives levels to chunks. So the algorithm seems easy to update.
@Manishearth - The APIs would need to be updated, too. Everything uses &str but we need it to take &[u8].
@younies - That sounds like something doable.
@Manishearth - Yeah, we'd just need to design that layer. It's an API design question. It's not technically challenging; it's challenging from a library design point of view.
@younies - OK; we can file a bug and do it post-1.0.
@robertbastian - I'm wondering what our motivation is to support invalid UTF-8. That seems like it is catering for bad programming practices.
@Manishearth - Invalid UTF-8 exists; in C++ it's not the norm for strings to be guaranteed valid UTF-8. It's a Rust safety thing. The situation is different from UTF-16; languages like Java and JavaScript don't guarantee that the UTF-16 is valid. So this reflects norms over FFI into other languages. I recognize that as ICU4X we should try to push the needle, but we shouldn't sacrifice safety. We deal with untrusted string inputs. It's a matter of norms.
@sffc - It's like GIGO behavior, not dissimilar to data provider. It's better to have GIGO than UB.

Manishearth · 2022-09-09T23:22:06Z

With #2534 and #2542, the 1.0 plan has been fully implemented. This is now a 1.x issue.

Removed all panics #2520

#2520

Manishearth added this to the ICU4X 1.0 (Final) milestone Sep 6, 2022

Manishearth mentioned this issue Sep 7, 2022

Add FFI bindings for Collator #2498

Merged

echeran mentioned this issue Sep 7, 2022

Provide a C++ interface to the collator #2218

Closed

Manishearth added the discuss-priority Discuss at the next ICU4X meeting label Sep 8, 2022

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Sep 8, 2022

Manishearth self-assigned this Sep 8, 2022

Manishearth changed the title ~~Consistently use "revalidated " &strs over FFI~~ Consistently deal with string encodings over FFI Sep 8, 2022

Manishearth mentioned this issue Sep 8, 2022

Sensibly handle invalid strings across FFI #2534

Merged

Manishearth modified the milestones: ICU4X 1.0 (Final), ICU4X 1.x Untriaged Sep 9, 2022

sffc added the C-ffi-infra Component: Diplomat, horizontal FFI label Dec 22, 2022

robertbastian self-assigned this Nov 22, 2023

robertbastian mentioned this issue Nov 22, 2023

Bump Diplomat and use DiplomatChar #4349

Merged

robertbastian added a commit that referenced this issue Nov 22, 2023

Bump Diplomat and use DiplomatChar (#4349)

6f1a04b

Removed all panics #2520

robertbastian mentioned this issue Nov 22, 2023

Bump Diplomat and use DiplomatStr[16] #4353

Merged

robertbastian added a commit that referenced this issue Nov 28, 2023

Bump Diplomat and use DiplomatStr[16] (#4353)

afc612d

#2520

robertbastian closed this as completed Nov 28, 2023

robertbastian added a commit that referenced this issue Dec 12, 2023

Unifying FFI string documentation (#4444)

c37276b

#2520

robertbastian mentioned this issue Apr 23, 2024

Remove last panic with FFI encoding #4837

Merged

robertbastian added a commit that referenced this issue Apr 23, 2024

Remove last panic with FFI encoding (#4837)

4313e85

#2520

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistently deal with string encodings over FFI #2520

Consistently deal with string encodings over FFI #2520

Manishearth commented Sep 6, 2022

hsivonen commented Sep 7, 2022

hsivonen commented Sep 7, 2022

Manishearth commented Sep 7, 2022

hsivonen commented Sep 7, 2022 •

edited

Loading

Manishearth commented Sep 7, 2022

Manishearth commented Sep 7, 2022 •

edited

Loading

hsivonen commented Sep 8, 2022

Manishearth commented Sep 8, 2022

sffc commented Sep 8, 2022

Manishearth commented Sep 9, 2022 •

edited

Loading

Consistently deal with string encodings over FFI #2520

Consistently deal with string encodings over FFI #2520

Comments

Manishearth commented Sep 6, 2022

hsivonen commented Sep 7, 2022

hsivonen commented Sep 7, 2022

Manishearth commented Sep 7, 2022

hsivonen commented Sep 7, 2022 • edited Loading

Manishearth commented Sep 7, 2022

Manishearth commented Sep 7, 2022 • edited Loading

If potentially invalid utf8 is supported as an API

APIs that need UTF8

Stringy APIs around identifiers

hsivonen commented Sep 8, 2022

Manishearth commented Sep 8, 2022

sffc commented Sep 8, 2022

Manishearth commented Sep 9, 2022 • edited Loading

hsivonen commented Sep 7, 2022 •

edited

Loading

Manishearth commented Sep 7, 2022 •

edited

Loading

Manishearth commented Sep 9, 2022 •

edited

Loading