Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP: Consume the request body as UInt8Array, not as a string #997

Closed
adamziel opened this issue Feb 5, 2024 · 0 comments · Fixed by #1018
Closed

PHP: Consume the request body as UInt8Array, not as a string #997

adamziel opened this issue Feb 5, 2024 · 0 comments · Fixed by #1018

Comments

@adamziel
Copy link
Collaborator

adamziel commented Feb 5, 2024

The request body is currently passed to the PHP instance as a string:

const size = this[__private__dont__use].lengthBytesUTF8(body);
const heapBodyPointer = this[__private__dont__use].malloc(size + 1);
if (!heapBodyPointer) {
throw new Error('Could not allocate memory for the request body.');
}
// Write the string to the WASM memory
this[__private__dont__use].stringToUTF8(
body,
heapBodyPointer,
size + 1
);

return {
contentType: 'application/x-www-form-urlencoded',
body: new URLSearchParams(post).toString(),
files,
};
} catch (e) {
// ignore
}
}
// Otherwise, grab body as literal text
return {
contentType,
body: await request.clone().text(),
files: {},
};

However, encoding the binary data as string is a lossy, wrong way of passing the request body to WASM. The setRequestBody method should accept a body: string | Uint8Array argument instead.

Internally, JavaScript strings are stored as something that's like UTF-16, but actually it isn't totally because it allows creating strings that cannot be represented in Unicode. They allow an arbitrary stream of bytes to be a string, which mostly works in UTF-16 and does work in UCS-2, but in order to support extended character ranges in Unicode, UTF-16 said that the code points referenced by each surrogate half are invalid and cannot be represented. So U+D83C U+DC00 is an invalid sequence of code points, but the UTF-16 sequence of bytes 0xd83cdc00 is valid and converts to U+1F000. This is why they cannot be split, because you cannot represent U+D83C in UTF-8 or even abstractly in Unicode.

If provided an invalid string this conversion will be lossy. Invalid code points will be converted to U+FFFD and Module.lengthBytesUTF8 will report the number of bytes after converting those code points. E.g. a\ud83cb turns into the string a�b and the length is 1 + 3 + 1 = 5.

Also, consider a string split inside a surrogate pair boundary.

`'I feel 😊.'.slice(0, 8)`

We might expect this to occupy 8 bytes because we split the string at 8 characters, or to occupy 11 bytes because we expected to get the emoji as the 8th character, and it requires four bytes in UTF8, but instead we invalidated the string and receive I feel �, which takes a total of 10 bytes in UTF8: 7 to encode I feel and then 3 to encode the replacement character U+FFFD.

There's a lot more, for sure. The ultimate fix is to implement the UInt8Array approach. Here's a small diff I started exploring that needs more love and attention:

diff --git a/packages/php-wasm/universal/src/lib/universal-php.ts b/packages/php-wasm/universal/src/lib/universal-php.ts
index 0c06be89a..27d934b16 100644
--- a/packages/php-wasm/universal/src/lib/universal-php.ts
+++ b/packages/php-wasm/universal/src/lib/universal-php.ts
@@ -493,7 +493,7 @@ export interface PHPRequest {
 	/**
 	 * Request body without the files.
 	 */
-	body?: string;
+	body?: string | Uint8Array;
 
 	/**
 	 * Form data. If set, the request body will be ignored and
@@ -531,7 +531,7 @@ export interface PHPRunOptions {
 	/**
 	 * Request body without the files.
 	 */
-	body?: string;
+	body?: string | Uint8Array;
 
 	/**
 	 * Uploaded files.
diff --git a/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts b/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts
index 5c8e9b334..0f01b91e8 100644
--- a/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts
+++ b/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts
@@ -226,6 +226,8 @@ async function rewritePost(request: Request) {
 
 			return {
 				contentType: 'application/x-www-form-urlencoded',
+				// @TODO: Check what happes if the submitted multipart value
+				//        consists of arbitrary bytes, not just text.
 				body: new URLSearchParams(post).toString(),
 				files,
 			};
@@ -237,7 +239,7 @@ async function rewritePost(request: Request) {
 	// Otherwise, grab body as literal text
 	return {
 		contentType,
-		body: await request.clone().text(),
+		body: await request.clone().arrayBuffer(),
 		files: {},
 	};
 }
@adamziel adamziel changed the title PHP: Consume request body as UInt8Array, not as a string PHP: Consume the request body as UInt8Array, not as a string Feb 5, 2024
adamziel added a commit that referenced this issue Feb 8, 2024
Removes the custom file upload handler and rely on PHP body parsing
to populate the $_FILES array. Instead of encoding the body bytes as
a string, parsing that string, and re-encoding it as bytes, we keep
the body in a binary form and pass it directly to PHP HEAP memory.

Closes #997
Closes #1006
Closes #914

 ## Testing instructions

Confirm the CI checks pass (it will take a few iterations to get them right I'm sure :D)
adamziel added a commit that referenced this issue Feb 8, 2024
Removes the custom file upload handler and rely on PHP body parsing to
populate the $_FILES array. Instead of encoding the body bytes as a
string, parsing that string, and re-encoding it as bytes, we keep the
body in a binary form and pass it directly to PHP HEAP memory.

Closes #997
Closes #1006
Closes #914
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant