-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistently deal with string encodings over FFI #2520
Comments
I'm not convinced that consistency at the cost of perf is the right way to go. I am in favor of consistency in terms of types, i.e. Is it schedule-wise infeasible for 1.0 to have a Diplomat directive that would allow a binding to choose from two options for
|
Also, would the check go on the C++ layer only or on the C/FFI layer? I.e. would JS go through a conversion from UTF-16 to guaranteed-well-format UTF-8 and still undergo a UTF-8 well-formedness check subsequently. |
At the moment? It would have to go through the FFI layer. In the future we can make this configurable on Diplomat so that it automatically does this for C++. (It is not a hard change to do and I am open to having that configurability for C++ now as well, however I personally have higher priority things to work on and this can be done post-1.0 without breaking things so I'd rather punt.) |
Considering that the normalizer and the collator have been designed to be able to deal with all three of
performance-wise optimally, I'm uncomfortable with not letting that show through to the intended beneficiaries. Even though there are actual perf problems in comparison with ICU4C even given perfect FFI, it would be sad for ICU4X normalizer & collator to be perceived as slower than they have to be perceived as due to extra validation that doesn't logically have to be there. (But then, my current priorities don't allow for committing to implementing the requisite Diplomat configurability at this time.) |
I don't see how this issue precludes any of that. For APIs that support different kinds of inputs we can have different methods. We can even add the unvalidated one later if we don't want to start with We can also fix the problem with the validation cost later. Neither of these things are breaking. We do not have time to do the ideal FFI for 1.0, this issue is about what we need to do for 1.0. Fortunately this can be fixed shortly afterwards without breaking, provided we prep for it and that's what this is about. |
Okay, so here's a plan, which I discussed a bit with @sffc The general idea is that for 1.0 we will expose The 1.x plans will need rust-diplomat/diplomat#240 (plus more configurability around string types) , rust-diplomat/diplomat#233, and perhaps the renaming part of rust-diplomat/diplomat#234. 1.0 plans are doable today with some minor fixes to ICU4X APIs. There are three categories of stringy APIs today: If potentially invalid utf8 is supported as an APIFor collator and normalizer, and segmenter. Segmenter doesn't support potentially invalud utf8 now, but it can, by using https://docs.rs/utf8_iter/latest/utf8_iter/) Using #2498 For 1.0:
For 1.x:
APIs that need UTF8ListFormatter is an example of this, as well as FixedDecimal, Locale, etc. FsDataProvider should move over to bytes. We may for now exempt For 1.0: Accept If possible, use a bytes API instead (e.g. FixedDecimal, Locale, PluralOperands, etc), exposed as For 1.x: Potentially two APIs: default API accepts Stringy APIs around identifiersThese go to TinyStr, we should just ensure we use |
The UTF-8 part of the above makes sense. Thank you. Why do you say |
Sorry, I should rephrase, when I say "needs to be validated", I mean "should not be assumed to be valid", which in our case means we don't validate (but we also don't write code assuming validity) |
|
Currently we use
&str
over FFI when we're accepting strings from the user (which is not that common, but crops up in segmenter, collator, etc).Over FFI, this maps to
std::string_view
in C++ and aString
in JS (that gets autoconverted to UTF8, throwing on surrogate codepoints).string_view
is not guaranteed UTF8 in C++.Eventually I would like Diplomat to allow for:
str
to get validated over C++ FFI before being passed in. (This can be done withotu changing APIs)But for now we shoudl be consistent. The easiest way to do this safely is to always accept
&str
but to always validate it as valid UTF8 immediately. The alternative is to accept&str
and document that it's up to C++ callers to ensure it's UTF-8.Collator's happening in #2498, but I think we'll need to handle this for bidi and segmenter. It ought to be a matter of simply casting to a byte slice, validating it, and returning an error if not.
cc @sffc @hsivonen
The text was updated successfully, but these errors were encountered: