-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Consider an UTF-16 code units length validator #250
Comments
Fixable by #245 |
I didn't realise it was used for the textarea... |
This sounds quite ergonomic to me. |
Example inconsistency:A 🖐🏽 = Two unicode code points:
Per each const emoji = "🖐🏽";
console.log(emoji.length); // 4 Sending this data as UTF-8 to a Rust backend and validating for the same length of For the Rust UTF-8 based validation to succeed, the emojis need to be doubled to a count of 10 to satisfy the validator: UTF-8 uses a varying (1-4) number of one-byte code units, depending on the encoded code point's unicode plane. Emoji use 4 UTF-8 code units (4 bytes) per code point. Therefore, in UTF-8, each As it turns out, the current length validator (using https://doc.rust-lang.org/std/str/struct.Chars.html) seems to count code points (not units), requiring 10 Implementation: https://doc.rust-lang.org/src/core/str/count.rs.html If it was counting code units, not code points (as I would have expected, TBH), then the length of Of course it also does not count graphemes, otherwise 20 I had assumed both ImplementationNevertheless, https://doc.rust-lang.org/std/primitive.str.html#method.encode_utf16 "returns an iterator of u16 over the string encoded as UTF-16." These u16 are obviously code units, which does match the The correct implementation for enforcing string lengths consistent with HTML and JavaScript therefore appears to be to use NamingGiven that Rust - with glorious superiority - counts chars as code points,
I would propose |
The length calculations differ in more than just the UTF variant: The length validator counts Unicode code points, while A "param for a mode: utf-16, utf-8, bytes etc" would need to distinguish between code units and code points. I.e. you would need modes
Not sure if most of them have any common case for usage. Therefore: It might be more straightforward to add a specific validator for validating the length of HTML form input using UTF-16 code units. |
I'm not sure. Having stuff like |
Watchers of this issue might like to learn that the |
The current validator crate provides built-in validators for various use cases, but it lacks a validator for checking the length of a string based on its UTF-16 code units. This feature request proposes the addition of a UTF-16 code units length validator to the crate.
The motivation behind this request stems from the need to match the behavior of the HTML textarea
maxlength
attribute, which counts UTF-16 code units. To provide better consistency between frontend and backend validation, it would be useful to have a validator that directly checks the length of a string based on its UTF-16 code units.The new validator could be used as follows:
Replace
N
with the desired UTF-16 code unit count. Use the sameN
for the HTMLtextarea
.N
would bemax bytes / 2
, as UTF-16 code units are 2 bytes long.This new validator would ensure that the code unit count limits are consistent between the HTML textarea and Rust, despite the different character encodings used, and avoid false negatives.
The text was updated successfully, but these errors were encountered: