PHP: Consume the request body as UInt8Array, not as a string #997

adamziel · 2024-02-05T09:32:27Z

The request body is currently passed to the PHP instance as a string:

wordpress-playground/packages/php-wasm/universal/src/lib/base-php.ts

Lines 519 to 529 in 73745f2

    
           const size = this[__private__dont__use].lengthBytesUTF8(body); 
        
           const heapBodyPointer = this[__private__dont__use].malloc(size + 1); 
        
           if (!heapBodyPointer) { 
        
           	throw new Error('Could not allocate memory for the request body.'); 
        
           } 
        
           // Write the string to the WASM memory 
        
           this[__private__dont__use].stringToUTF8( 
        
           	body, 
        
           	heapBodyPointer, 
        
           	size + 1 
        
           );

wordpress-playground/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts

Lines 227 to 242 in 73745f2

    
           		return { 
        
           			contentType: 'application/x-www-form-urlencoded', 
        
           			body: new URLSearchParams(post).toString(), 
        
           			files, 
        
           		}; 
        
           	} catch (e) { 
        
           		// ignore 
        
           	} 
        
           } 
        
           // Otherwise, grab body as literal text 
        
           return { 
        
           	contentType, 
        
           	body: await request.clone().text(), 
        
           	files: {}, 
        
           };

However, encoding the binary data as string is a lossy, wrong way of passing the request body to WASM. The setRequestBody method should accept a body: string | Uint8Array argument instead.

Internally, JavaScript strings are stored as something that's like UTF-16, but actually it isn't totally because it allows creating strings that cannot be represented in Unicode. They allow an arbitrary stream of bytes to be a string, which mostly works in UTF-16 and does work in UCS-2, but in order to support extended character ranges in Unicode, UTF-16 said that the code points referenced by each surrogate half are invalid and cannot be represented. So U+D83C U+DC00 is an invalid sequence of code points, but the UTF-16 sequence of bytes 0xd83cdc00 is valid and converts to U+1F000. This is why they cannot be split, because you cannot represent U+D83C in UTF-8 or even abstractly in Unicode.

If provided an invalid string this conversion will be lossy. Invalid code points will be converted to U+FFFD and Module.lengthBytesUTF8 will report the number of bytes after converting those code points. E.g. a\ud83cb turns into the string a�b and the length is 1 + 3 + 1 = 5.

Also, consider a string split inside a surrogate pair boundary.

`'I feel 😊.'.slice(0, 8)`

We might expect this to occupy 8 bytes because we split the string at 8 characters, or to occupy 11 bytes because we expected to get the emoji as the 8th character, and it requires four bytes in UTF8, but instead we invalidated the string and receive I feel �, which takes a total of 10 bytes in UTF8: 7 to encode I feel and then 3 to encode the replacement character U+FFFD.

There's a lot more, for sure. The ultimate fix is to implement the UInt8Array approach. Here's a small diff I started exploring that needs more love and attention:

diff --git a/packages/php-wasm/universal/src/lib/universal-php.ts b/packages/php-wasm/universal/src/lib/universal-php.ts
index 0c06be89a..27d934b16 100644
--- a/packages/php-wasm/universal/src/lib/universal-php.ts
+++ b/packages/php-wasm/universal/src/lib/universal-php.ts
@@ -493,7 +493,7 @@ export interface PHPRequest {
 	/**
 	 * Request body without the files.
 	 */
-	body?: string;
+	body?: string | Uint8Array;
 
 	/**
 	 * Form data. If set, the request body will be ignored and
@@ -531,7 +531,7 @@ export interface PHPRunOptions {
 	/**
 	 * Request body without the files.
 	 */
-	body?: string;
+	body?: string | Uint8Array;
 
 	/**
 	 * Uploaded files.
diff --git a/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts b/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts
index 5c8e9b334..0f01b91e8 100644
--- a/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts
+++ b/packages/php-wasm/web-service-worker/src/initialize-service-worker.ts
@@ -226,6 +226,8 @@ async function rewritePost(request: Request) {
 
 			return {
 				contentType: 'application/x-www-form-urlencoded',
+				// @TODO: Check what happes if the submitted multipart value
+				//        consists of arbitrary bytes, not just text.
 				body: new URLSearchParams(post).toString(),
 				files,
 			};
@@ -237,7 +239,7 @@ async function rewritePost(request: Request) {
 	// Otherwise, grab body as literal text
 	return {
 		contentType,
-		body: await request.clone().text(),
+		body: await request.clone().arrayBuffer(),
 		files: {},
 	};
 }

The text was updated successfully, but these errors were encountered:

Removes the custom file upload handler and rely on PHP body parsing to populate the $_FILES array. Instead of encoding the body bytes as a string, parsing that string, and re-encoding it as bytes, we keep the body in a binary form and pass it directly to PHP HEAP memory. Closes #997 Closes #1006 Closes #914 ## Testing instructions Confirm the CI checks pass (it will take a few iterations to get them right I'm sure :D)

Removes the custom file upload handler and rely on PHP body parsing to populate the $_FILES array. Instead of encoding the body bytes as a string, parsing that string, and re-encoding it as bytes, we keep the body in a binary form and pass it directly to PHP HEAP memory. Closes #997 Closes #1006 Closes #914

adamziel added [Type] Enhancement New feature or request [Feature] PHP.wasm [Aspect] Service Worker labels Feb 5, 2024

adamziel changed the title ~~PHP: Consume request body as UInt8Array, not as a string~~ PHP: Consume the request body as UInt8Array, not as a string Feb 5, 2024

adamziel mentioned this issue Feb 8, 2024

PHP: Pass request body as UInt8Array #1018

Merged

adamziel closed this as completed in #1018 Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHP: Consume the request body as UInt8Array, not as a string #997

PHP: Consume the request body as UInt8Array, not as a string #997

adamziel commented Feb 5, 2024 •

edited

Loading

PHP: Consume the request body as UInt8Array, not as a string #997

PHP: Consume the request body as UInt8Array, not as a string #997

Comments

adamziel commented Feb 5, 2024 • edited Loading

adamziel commented Feb 5, 2024 •

edited

Loading