You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, encoding the binary data as string is a lossy, wrong way of passing the request body to WASM. The setRequestBody method should accept a body: string | Uint8Array argument instead.
Internally, JavaScript strings are stored as something that's like UTF-16, but actually it isn't totally because it allows creating strings that cannot be represented in Unicode. They allow an arbitrary stream of bytes to be a string, which mostly works in UTF-16 and does work in UCS-2, but in order to support extended character ranges in Unicode, UTF-16 said that the code points referenced by each surrogate half are invalid and cannot be represented. So U+D83C U+DC00 is an invalid sequence of code points, but the UTF-16 sequence of bytes 0xd83cdc00 is valid and converts to U+1F000. This is why they cannot be split, because you cannot represent U+D83C in UTF-8 or even abstractly in Unicode.
If provided an invalid string this conversion will be lossy. Invalid code points will be converted to U+FFFD and Module.lengthBytesUTF8 will report the number of bytes after converting those code points. E.g. a\ud83cb turns into the string a�b and the length is 1 + 3 + 1 = 5.
Also, consider a string split inside a surrogate pair boundary.
`'I feel 😊.'.slice(0, 8)`
We might expect this to occupy 8 bytes because we split the string at 8 characters, or to occupy 11 bytes because we expected to get the emoji as the 8th character, and it requires four bytes in UTF8, but instead we invalidated the string and receive I feel �, which takes a total of 10 bytes in UTF8: 7 to encode I feel and then 3 to encode the replacement character U+FFFD.
There's a lot more, for sure. The ultimate fix is to implement the UInt8Array approach. Here's a small diff I started exploring that needs more love and attention:
diff --git a/packages/php-wasm/universal/src/lib/universal-php.ts b/packages/php-wasm/universal/src/lib/universal-php.ts
index 0c06be89a..27d934b16 100644
--- a/packages/php-wasm/universal/src/lib/universal-php.ts+++ b/packages/php-wasm/universal/src/lib/universal-php.ts@@ -493,7 +493,7 @@ export interface PHPRequest {
/**
* Request body without the files.
*/
- body?: string;+ body?: string | Uint8Array;
/**
* Form data. If set, the request body will be ignored and
@@ -531,7 +531,7 @@ export interface PHPRunOptions {
/**
* Request body without the files.
*/
- body?: string;+ body?: string | Uint8Array;
/**
* Uploaded files.
diff --git a/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts b/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts
index 5c8e9b334..0f01b91e8 100644
--- a/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts+++ b/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts@@ -226,6 +226,8 @@ async function rewritePost(request: Request) {
return {
contentType: 'application/x-www-form-urlencoded',
+ // @TODO: Check what happes if the submitted multipart value+ // consists of arbitrary bytes, not just text.
body: new URLSearchParams(post).toString(),
files,
};
@@ -237,7 +239,7 @@ async function rewritePost(request: Request) {
// Otherwise, grab body as literal text
return {
contentType,
- body: await request.clone().text(),+ body: await request.clone().arrayBuffer(),
files: {},
};
}
The text was updated successfully, but these errors were encountered:
adamziel
changed the title
PHP: Consume request body as UInt8Array, not as a string
PHP: Consume the request body as UInt8Array, not as a string
Feb 5, 2024
Removes the custom file upload handler and rely on PHP body parsing
to populate the $_FILES array. Instead of encoding the body bytes as
a string, parsing that string, and re-encoding it as bytes, we keep
the body in a binary form and pass it directly to PHP HEAP memory.
Closes#997Closes#1006Closes#914
## Testing instructions
Confirm the CI checks pass (it will take a few iterations to get them right I'm sure :D)
Removes the custom file upload handler and rely on PHP body parsing to
populate the $_FILES array. Instead of encoding the body bytes as a
string, parsing that string, and re-encoding it as bytes, we keep the
body in a binary form and pass it directly to PHP HEAP memory.
Closes#997Closes#1006Closes#914
The request body is currently passed to the PHP instance as a string:
wordpress-playground/packages/php-wasm/universal/src/lib/base-php.ts
Lines 519 to 529 in 73745f2
wordpress-playground/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts
Lines 227 to 242 in 73745f2
However, encoding the binary data as string is a lossy, wrong way of passing the request body to WASM. The setRequestBody method should accept a
body: string | Uint8Array
argument instead.Internally, JavaScript strings are stored as something that's like UTF-16, but actually it isn't totally because it allows creating strings that cannot be represented in Unicode. They allow an arbitrary stream of bytes to be a string, which mostly works in UTF-16 and does work in UCS-2, but in order to support extended character ranges in Unicode, UTF-16 said that the code points referenced by each surrogate half are invalid and cannot be represented. So U+D83C U+DC00 is an invalid sequence of code points, but the UTF-16 sequence of bytes 0xd83cdc00 is valid and converts to U+1F000. This is why they cannot be split, because you cannot represent U+D83C in UTF-8 or even abstractly in Unicode.
If provided an invalid string this conversion will be lossy. Invalid code points will be converted to U+FFFD and
Module.lengthBytesUTF8
will report the number of bytes after converting those code points. E.g.a\ud83cb
turns into the stringa�b
and the length is 1 + 3 + 1 = 5.Also, consider a string split inside a surrogate pair boundary.
We might expect this to occupy 8 bytes because we split the string at 8 characters, or to occupy 11 bytes because we expected to get the emoji as the 8th character, and it requires four bytes in UTF8, but instead we invalidated the string and receive
I feel �
, which takes a total of 10 bytes in UTF8: 7 to encodeI feel
and then 3 to encode the replacement character U+FFFD.There's a lot more, for sure. The ultimate fix is to implement the UInt8Array approach. Here's a small diff I started exploring that needs more love and attention:
The text was updated successfully, but these errors were encountered: