Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

u8ArrayToString creates invalid utf8 strings #78

Closed
austinabell opened this issue Jun 3, 2022 · 8 comments · Fixed by #98
Closed

u8ArrayToString creates invalid utf8 strings #78

austinabell opened this issue Jun 3, 2022 · 8 comments · Fixed by #98
Labels
bug Something isn't working

Comments

@austinabell
Copy link
Contributor

austinabell commented Jun 3, 2022

Since it's unchecked, the strings created by it will be invalid strings. Unsure if this is lossy if JS keeps the internal bytes, but this seems unsafe to do.

The issue seems to be also if you're trying to parse a non unicode/utf8 character for the other side of this conversion that there will be an overflow. Unsure how the jsvm handles this tested and seems this just overflows, so someone could potentially overwrite values they don't have permissions for

@austinabell austinabell added the bug Something isn't working label Jun 3, 2022
@ailisp
Copy link
Member

ailisp commented Jun 6, 2022

It's more like a problem of the documentation. u8ArrayToString is creating bytes, although internally it's still JS String object. So it doesn't guarantee UTF-8 validity, just like you can write any binary sequence with host function write_storage in rust-sdk. Maybe this should be named u8ArrayToBytes and provide a checked version that does u8ArrayToString

@austinabell
Copy link
Contributor Author

No it's not just documentation, you can repro the overflow pretty easily:

let initial = "😇";
console.log(initial);
// 😇
let arr = stringToU8Array(initial);
console.log(arr);
// [61, 7]
let s = u8ArrayToString(arr);
console.log(s);
// =�

@ailisp
Copy link
Member

ailisp commented Jun 7, 2022

If you consider u8ArrayToString is actually u8ArrayToBytes, you know it's the expected behavior

@ailisp
Copy link
Member

ailisp commented Jun 7, 2022

@austinabell Please check the fix, thanks! Also, the new interface will indicate this is no longer a valid input:

let initial = "😇";
console.log(initial);
// 😇
let arr = bytesToU8Array(initial);

Bytes must be a string containing only \x00 to \xff chars.

@austinabell austinabell reopened this Jun 7, 2022
@austinabell
Copy link
Contributor Author

austinabell commented Jun 17, 2022

re-opened because the bug was not fixed:

let initial = "😇";
console.log(initial);
// 😇
let arr = bytesToU8Array(initial);
console.log(arr);
// [61, 7]
let s = u8ArrayToBytes(arr);
console.log(s);
// =�

For more context, JS strings are utf-16 so forcing them to be utf8 by truncation will lead to bugs. Maybe not directly from our tools, but indirectly definitely.

@ailisp
Copy link
Member

ailisp commented Jun 20, 2022

@austinabell
No it's fixed, the snippet above is the unexpected way to use it (I should have this documented in README),

bytesToU8Array expect a bytes, not a JS String. With current API, the safe mode is user should wrap bytes with bytes API: bytesToU8Array(bytes(initial)), and this will fail (bytes will check if every byte in the initial is between --255). For the unsafe use, user need to ensure what passed to bytesToU8Array is really a valid bytes, while the above snippet doesn't hold.

This is like all other JS library which needs to enforce the type. JS doesn't have compile time type check, so either it can be enforced via documentation or runtime checks. For smart contract, runtime check everywhere can be expensive to experienced users, so we provide the both safe and unsafe path.

@ailisp
Copy link
Member

ailisp commented Jun 20, 2022

Also, as you noticed, the function in other way is now called: u8ArrayToBytes instead of u8ArrayToString. It's by design to return bytes instead of string. So the reverse way is also not a bug.

See:
https://github.com/near/near-sdk-js/pull/103/files#diff-3274f1a37032fb0ae4e2823def0007c634e869ae0dfc304ff6a12c36513c3a52
and

- `Bytes` in both arguments and return represent a byte buffer, internally it's a JavaScript String Object. Any binary data `0x00-0xff` is stored as the char '\x00-\xff'. This is because QuickJS doesn't have ArrayBuffer in C API. If the bytes happens to have only 1-byte chars, it happens to be same as the the same content string.

@ailisp
Copy link
Member

ailisp commented Jul 20, 2022

As we intentionally dropped u8ArrayToString, and what bytesToU8Array provided behavior is right, I'm closing this issue. As Austin Abell said, all bytes function can't prevent misuse by passing a string, include bytesToU8Array, which is tracked on #117

@ailisp ailisp closed this as completed Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants