Headers are unnecessary encoded to `ByteString` if there is no `Transfer-Encoding: chunked` header #42579

AlttiRi · 2022-04-02T18:29:09Z

Version

v17.5.0

Platform

Windows 10

Subsystem

http

What steps will reproduce the bug?

Run this http server:

import http from "http";

const name = `Rock & roll 音楽 («🎵🎶»).txt`; // Also is used as the response body

const defaultCdHeader = getContentDisposition(name);
// console.log(defaultCdHeader === Buffer.from(`inline; filename="${name}"`).toString("binary")); // true

const host = "localhost";
const port = 8000;
const origin = `http://${host}:${port}`;
const server = http.createServer(requestListener);
server.listen(port, host, () => {
    console.log("Server is running on:     " + origin + "       (open the web page)");
    console.log("Downloads the file:       " + origin + "/?dl=1");
    console.log("Remove Transfer-Encoding: " + origin + "/?te=0" + " (open the web page)");
    console.log("Downloads w/o T-E header: " + origin + "/?dl=1&te=0");
});

function requestListener(req, res) {
    let cdHeader = defaultCdHeader;
    const {dl, te} = Object.fromEntries(new URL(origin + req.url).searchParams.entries());
    if (dl === "1") {
        cdHeader = getContentDisposition(name, {type: "attachment"});
    }

    res.setHeader("Content-Disposition", cdHeader);
    // The header must be a `ByteString`. For example, I can't do that:
    // res.setHeader("Header-X", name); // TypeError [ERR_INVALID_CHAR]: Invalid character in header content ["Header-X"]

    if (te === "0") { // Note: "Transfer-Encoding: chunked" is set by default.
        res.removeHeader("Transfer-Encoding"); // Any other TE header's value has the same effect as the header removing
        const byteCount = new TextEncoder().encode(name).length;
        res.setHeader("Content-Length", byteCount);
    }
    res.setHeader("Content-Type", "text/html; charset=utf-8");
    res.writeHead(200);
    res.end(name);
}

// Note: In case of this issue you can ignore the follow function code. It just returns C-D header as `ByteString`.
/**
 * Simple implementation for getting "Content-Disposition" header from filename
 * @example
 * // By default, it produces the same result as it (with replacing all double-quotes by underscore):
 * Buffer.from(`inline; filename="${name}"`).toString("binary")
 *
 * @param {string} name
 * @param {Object} opts
 * @param {"inline"|"attachment"} [opts.type="inline"]
 * @param {Boolean} [opts.encoded=false]
 * @param {Boolean} [opts.filename=true]
 * @param {String} [opts.replacer="_"]
 * @return {string}
 */
function getContentDisposition(name, opts = {}) {
    const {
        type, encoded, filename, replacer
    } = Object.assign({type: "inline", encoded: false, filename: true, replacer: "_"}, opts);
    const fixName = (name) => name.replaceAll(`"`, replacer); // The most trivial fix, since it uses the quoted filename. Prevents from the incorrect header parsing. For, example: `";"` (3 chars in ` quotes).
    const encodeMap = new Map([["'", "%27"], ["(", "%28"], [")", "%29"], ["*", "%30"]]); // Required to escape it for old browsers (For example, a Chromium of 2013)
    const getEncodedName = (name) => encodeURIComponent(name).replaceAll(/['()*]/g, (val) => encodeMap.get(val));
    const encodedStr = encoded ? `filename*=UTF-8''${getEncodedName(name)}` : "";
    const filenameStr = filename ? `filename="${fixName(name)}"` : "";
    const header = [type, filenameStr, encodedStr].filter(x => x).join("; ");
    return Buffer.from(header).toString("binary");
}

Click on the link http://localhost:8000/?dl=1 to download a file with Rock & roll 音楽 («🎵🎶»).txt name.
Click on the link http://localhost:8000/?dl=1&te=0 to download a file with Rock & roll 音楽 («🎵🎶»).txt name (in this case "Transfer-Encoding" header will be removed)

How often does it reproduce? Is there a required condition?

Always.

What is the expected behavior?

Both files are downloaded with Rock & roll 音楽 («🎵🎶»).txt name.

What do you see instead?

The first file has the correct name — Rock & roll 音楽 («🎵🎶»).txt.
The second one has wrong name — Rock & roll é_³æ¥½ (Â«ð__µð__¶Â»).txt

Additional information

TL;DR

If there is "Transfer-Encoding: chunked" header (exactly chunked) setHeader works properly, it sets the input header (ByteString) as is.
(Note: "Transfer-Encoding: chunked" is set by default.)

In any other case it additionally (unnecessary) encodes the header to ByteString.
So, the header is encoded twice, that is wrong.

Additional info

The most of HTTP headers are contains only ASCII characters. But when you need to put in a header (For example, "Content-Disposition", or any custom header) a string that contains non-ASCII* character(s), you can't just put it in as issetHeader.
For example:

// TypeError [ERR_INVALID_CHAR]: Invalid character in header content ["Header-X"]
res.setHeader("Content-Disposition", `attachment; filename="Rock & roll 音楽 («🎵🎶»).txt"`);

A HTTP header is a Binary String (ByteString) — UTF-8 bytes within String object.*

There is no problem with the headers which contain only ASCII characters, since ASCII charset is subset of UTF-8 and Latin 1 encodings, so toByteString(ASCIIString) === ASCIIString.

To get a ByteString from USVString you just need to take UTF-8 bytes from an input string then represent them in Latin 1 (ISO-8859-1) encoding.

For example, in Node.js:

function toByteString(string) {
    return Buffer.from(string).toString("binary"); // or "latin1"
}

*To be honest, the entire quote of [ByteString](https://webidl.spec.whatwg.org/#idl-ByteString:

Such sequences might be interpreted as UTF-8 encoded strings [RFC3629] or strings in some other 8-bit-per-code-unit encoding, although this is not required.

As I can see, a browser also can detect if the string is "just" 8859-1, not UTF-8 bytes encoded in 8859-1.

So, both "Content-Disposition" headers are ~~valid~~ "valid"**:

res.setHeader("Content-Disposition", `attachment; filename="¡«£»÷ÿ.png"`); // Correct ONLY for the some certain browser's languages
res.setHeader("Content-Disposition", `attachment; filename="${toByteString("¡«£»÷ÿ.png")}"`); // Always is correct

The result in ~~both~~ "both"** cases is a file with "¡«£»÷ÿ.png"** name, even while "¡«£»÷ÿ.png" !== toByteString("¡«£»÷ÿ.png").

UPDATE:
**Using non-UTF-8 bytes ("some other 8-bit-per-code-unit encoding") in ByteString is browser/OS language dependent!

For example, in Firefox with non-EN language using of "¡«£»÷ÿ.png" as is (without toByteString()) results to ��.png filename, instead of ¡«£»÷ÿ.png
In Chrome it will be Ў«Ј»чя.png for Cyrillic.

So, I think it (using of 8859-1 in "usual way") should be highly unrecommended.
Headers should always be a ByteString with only UTF-8 bytes represented as 8859-1 (Latin 1).

Problem

The problem is that I can't correctly set a header that is a ByteString (UTF-8 bytes in Latin 1) if the original string contains non-ASCII characters.
Like the other servers do it.

The problem appears only when the Transfer-Encoding: chunked header (which is present by default) is removed (or changed).
In this case setHeader encodes the header to Binary String.

That is unnecessary, since it's already a ByteString.

It's not possible to put in setHeader a USVString, since in this case it will throw TypeError [ERR_INVALID_CHAR]: Invalid character in header content error.

So, the header is encoded to "binary" twice, and browsers download the file with the wrong filenames:
Rock & roll é_³æ¥½ (Â«ð__µð__¶Â»).txt instead of Rock & roll 音楽 («🎵🎶»).txt.

You can open the demo server with disabled Transfer-Encoding: chunked header (http://localhost:8000/?te=0 ) and check it:

function bSrt2Str(bString) {
    return new TextDecoder().decode(binaryStringToArrayBuffer(bString));
}
function binaryStringToArrayBuffer(binaryString) {
    const u8Array = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        u8Array[i] = binaryString.charCodeAt(i);
    }
    return u8Array;
}
let cd = (await fetch("")).headers.get("Content-Disposition");
console.log(cd);
console.log(bSrt2Str(cd));
console.log(bSrt2Str(bSrt2Str(cd)));

The header is encoded twice!

Examples

A lot of forums encodes headers such way for the attached files (XenForo, vBulletin, for example).

The real life examples:

Oh, wait, it requires an account, if you don't have/(want to create an account), just use my demo server.
Anyway, just look at the screenshots below.

In the browser console you can verify that header are ByteString:

function bSrt2Str(bString) {
    return new TextDecoder().decode(binaryStringToArrayBuffer(bString));
}
function binaryStringToArrayBuffer(binaryString) {
    const u8Array = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        u8Array[i] = binaryString.charCodeAt(i);
    }
    return u8Array;
}
let cd = (await fetch("")).headers.get("Content-Disposition");
console.log(cd);
console.log(bSrt2Str(cd));

As a bonus, here is an example of Java server made with ServerSocket which also works properly:

Main.java

package com.company;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) {
        try (ServerSocket serverSocket = new ServerSocket(8000)) {
            System.out.println("Sever run on http://127.0.0.1:8000");
            while (true) {
                Socket socket = serverSocket.accept();
                try (BufferedReader input = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
                     PrintWriter output = new PrintWriter(socket.getOutputStream())) {
                    while (input.ready()) { // Print headers
                        System.out.println(input.readLine());
                    }
                    String name = "Rock & roll 音楽 («\uD83C\uDFB5\uD83C\uDFB6»).txt";
                    System.out.println(name);

                    output.println("HTTP/1.1 200 OK");
                    output.println("Content-Type: text/html; charset=utf-8");
                    output.println("Content-Disposition: attachment; filename=\"" + name + "\"");
                    output.println();
                    output.println(name);
                    output.flush();
                }
            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}

The text was updated successfully, but these errors were encountered:

benjamingr · 2022-04-04T08:09:57Z

@nodejs/http

ShogunPanda · 2022-04-06T09:49:21Z

Hello.
I have triaged this issue.

You are correct in saying the that Transfer-Encoding header make node behave differently: when TE is present, the headers are sent with the first chunk and the chunk encoding is binary.

When TE is not present, the headers are sent with the data but there is no encoding specified and Node defaults to UTF-8.
Your file name, which you binary encoded to latin1, is then re-interpreted as utf-8 and that's why the double encoding happened.

However, the problem is not in how Node processes the headers but in the value it self. No spec mandates (at least for what I could find) that non UTF-8 should be converted as latin1/binary and then the client should/might reinterpret them. My suggestion is to encode them using RFC8187 (which you already have in your code when encoded=true but somehow you don't do by default).

I have filed a PR that explicit this in the docs.
Thanks for your report!

xtx1130 · 2022-04-06T10:42:39Z

For RFC8187:

Producers MUST use the "UTF-8" ([RFC3629]) character encoding.

@ShogunPanda we should use encodeURIComponent & decodeURIComponent to deal with these Special characters, right?

AlttiRi · 2022-04-06T10:54:58Z

I see that PHP-based servers work this way — they encode headers (it fact, it's only needed for non-ASCII characters) as UTF-8 bytes within Latin1.

It's not something new.
It works fine with a 2013 browser (The older browsers that I found.)

Why I can't do the same with node.js?
To produce the same header, as it do PHP forum servers, like I do in my Java example above.

encoded=true is just a workaround.

However, it should work even without this.
And it will work if Node.js will not encode the headers itself if TE is missed.
It already expects Latin1 as a input string. A valid header value. But it additionally encodes it.

AlttiRi · 2022-04-06T11:07:11Z

When TE is not present, the headers are sent with the data but there is no encoding specified and Node defaults to UTF-8.

No spec mandates (at least for what I could find) that non UTF-8 should be converted as latin1/binary and then the client should/might reinterpret them.

So, it feels like a bug.
It encodes headers to UTF-8, while it's not needed.

Clients expect headers as is — as Latin1/binary (which may contain either UTF-8 bytes, or some 8-bit encoding), but Node.js (w/o TE header) encodes them in UTF-8.

Is it not a bug?

ShogunPanda · 2022-04-06T11:47:46Z

For RFC8187:

Producers MUST use the "UTF-8" ([RFC3629]) character encoding.

@ShogunPanda we should use encodeURIComponent & decodeURIComponent to deal with these Special characters, right?

Yes, encodeURIComponent on server->client, and in theory decodeURIComponent when receiving a client request.

ShogunPanda · 2022-04-06T11:54:02Z

So, it feels like a bug. It encodes headers to UTF-8, while it's not needed.

Clients expect headers as is — as Latin1/binary (which may contain either UTF-8 bytes, or some 8-bit encoding), but Node.js (w/o TE header) encodes them in UTF-8.

Here's the thing. The client expect ASCII only. Not binary. The big difference is that technically the RFC7230 only allows US-ASCII (so from 0 to 127) while we're using latin1 (0 to 255). The client might not understand the extension at all.
Putting latin1 encoded UTF-8 is arbitrary and needs agreement with the client.

Is it not a bug?
I see that PHP-based servers work this way — they encode headers (it fact, it's only needed for non-ASCII characters) as UTF-8 bytes within Latin1.

It's not something new. It works fine with a 2013 browser (The older browsers that I found.)

In my opinion supporting that is a deviation from the spec.

Why I can't do the same with node.js? To produce the same header, as it do PHP forum servers, like I do in my Java example above.

However, it should work even without this. And it will work if Node.js will not encode the headers itself if TE is missed. It already expects Latin1 as a input string. A valid header value. But it additionally encodes it.

That's another thing.
In node we have a optimization that sends headers along with the first part of the body (or the whole body). Only the headers are in latin1, body is not.
If we try to remove the optimization (and I tried when triaging this bug) we solve the issue but benchmarks show a nearly 100% decrease in speed, which we cannot afford for a edge case like this.

Once again, I'd like to remark that RFC8187 is the newer standard available (2018), built on top of older RFCs. Consider this to be correct case and any other client support as hackish, since it's not regulated anywhere.

PR-URL: nodejs#42624 Fixes: nodejs#42579 Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>

PR-URL: #42624 Fixes: #42579 Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>

PR-URL: nodejs/node#42624 Fixes: nodejs/node#42579 Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>

AlttiRi mentioned this issue Apr 2, 2022

Can't correctly set ByteString headers with non-ASCII chars fastify/fastify#3808

Closed

2 tasks

VoltrexKeyva added the http Issues or PRs related to the http subsystem. label Apr 2, 2022

AlttiRi mentioned this issue Apr 5, 2022

fetch fails on ByteString header with non-ASCII chars nodejs/undici#1317

Closed

ShogunPanda mentioned this issue Apr 6, 2022

doc: add note about header values encoding #42624

Closed

ShogunPanda closed this as completed in dfc2dc8 Apr 8, 2022

This was referenced Jan 28, 2023

Header encoding changes, leading to ERR_INVALID_CHAR fastify/fastify-reply-from#287

Closed

http: unify the header treatment #46395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headers are unnecessary encoded to `ByteString` if there is no `Transfer-Encoding: chunked` header #42579

Headers are unnecessary encoded to `ByteString` if there is no `Transfer-Encoding: chunked` header #42579

AlttiRi commented Apr 2, 2022 •

edited

Loading

benjamingr commented Apr 4, 2022

ShogunPanda commented Apr 6, 2022 •

edited

Loading

xtx1130 commented Apr 6, 2022

AlttiRi commented Apr 6, 2022 •

edited

Loading

AlttiRi commented Apr 6, 2022 •

edited

Loading

ShogunPanda commented Apr 6, 2022

ShogunPanda commented Apr 6, 2022 •

edited

Loading

Headers are unnecessary encoded to ByteString if there is no Transfer-Encoding: chunked header #42579

Headers are unnecessary encoded to ByteString if there is no Transfer-Encoding: chunked header #42579

Comments

AlttiRi commented Apr 2, 2022 • edited Loading

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior?

What do you see instead?

Additional information

TL;DR

Additional info

Problem

Examples

benjamingr commented Apr 4, 2022

ShogunPanda commented Apr 6, 2022 • edited Loading

xtx1130 commented Apr 6, 2022

AlttiRi commented Apr 6, 2022 • edited Loading

AlttiRi commented Apr 6, 2022 • edited Loading

ShogunPanda commented Apr 6, 2022

ShogunPanda commented Apr 6, 2022 • edited Loading

Headers are unnecessary encoded to `ByteString` if there is no `Transfer-Encoding: chunked` header #42579

Headers are unnecessary encoded to `ByteString` if there is no `Transfer-Encoding: chunked` header #42579

AlttiRi commented Apr 2, 2022 •

edited

Loading

ShogunPanda commented Apr 6, 2022 •

edited

Loading

AlttiRi commented Apr 6, 2022 •

edited

Loading

AlttiRi commented Apr 6, 2022 •

edited

Loading

ShogunPanda commented Apr 6, 2022 •

edited

Loading