Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headers with non-ASCII characters silently disappear #81

Closed
wolfgang42 opened this issue Oct 12, 2024 · 3 comments · Fixed by #83
Closed

Headers with non-ASCII characters silently disappear #81

wolfgang42 opened this issue Oct 12, 2024 · 3 comments · Fixed by #83

Comments

@wolfgang42
Copy link

wolfgang42 commented Oct 12, 2024

Steps to reproduce

Take this example WARC, produced with wget --no-verbose -O /dev/null --max-redirect=0 --warc-file=example 'https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)'. It contains the line:

Location: https://wiki.archlinux.jp/index.php/Arch_User_Repository?rdfrom=https%3A%2F%2Fwiki.archlinux.org%2Findex.php%3Ftitle%3DAUR_Metadata_%28%25E6%2597%25A5%25E6%259C%25AC%25E8%25AA%259E%29%26redirect%3Dno#AUR_メタデータ

Actual behavior

However, trying to read that header produces null:

import {WARCParser} from 'warcio'

for await (const record of new WARCParser(process.stdin)) {
	if (record.warcType !== 'response') continue
	console.log(record.httpHeaders.headers.get('Location'))
}
// => null

Expected behavior

I am unclear on what the “correct” behavior here is (RFC 7230 leads me to believe that the server probably shouldn't have sent a non-ASCII value to begin with, but since it did this header should have been interpreted as ISO-8859-1 mojibake; I can’t find anything in the Fetch spec that describes these values as anything other than a “byte sequence” though in practice Chrome, Firefox, and Node all limit values to U+00FF and below). Unfortunately since this behavior is in the wild I have to handle it anyway.

However, I do know that the header silently disappearing is very confusing and definitely not what I want to happen; it makes this problem hard to debug. I would have much preferred a crash, which would have let me immediately identify the cause.

Workaround

Setting keepHeadersCase: true when creating the WARCParser uses a plain Map instead of a Headers, which doesn’t have the same limitation on values. Unfortunately this loses the case-insensitive aspect.

Cause

The silent disappearance appears to be because of this code:

try {
if (canAppend && name.toLowerCase() === "set-cookie") {
headers.append(name, value);
} else {
headers.set(name, value);
}
} catch (_e) {
// ignore
}

...which is swallowing the error (“TypeError: Cannot convert argument to a ByteString because the character at index 204 has a value of 12513 which is greater than 255.”), along with anything else that might go wrong.

This behavior seems to have been there since 4061003:

try {
headers.set(name, value);
} catch(e) {
// try to sanitize value, removing newlines
//headers.set(name, value.replace(/[\r\n]+/g, ', '));
}

...and since that was the initial commit, there's no obvious explanation for why it exists.

Suggested fix

This is enough for my use case:

-try { 
+// If this crashes with an error like "String contains non ISO-8859-1 code point"
+// or "character has a value which is greater than 255", the response contains a
+// header which cannot be represented by the Headers class; enable `keepHeadersCase`
+// to use a Map (which does not have this restriction) instead.
   if (canAppend && name.toLowerCase() === "set-cookie") { 
     headers.append(name, value); 
   } else { 
     headers.set(name, value); 
   } 
-} catch (_e) { 
-  // ignore 
-} 

However, this may be a problem for users who have code that depends on the current behavior, are fine with that, and don't want to have to rewrite their code to handle lookups in a Map. For that use case it might be necessary to thread through a flag (call it discardInvalidHeaders, say) to keep the current behavior. (Either way I think the defaults should change, which would be a bump to the semver major version.)

@ikreymer
Copy link
Member

Thanks for reporting this! Yes, I think the initial approach was to skip invalid header, but that's not a good approach, as you mention. I think a better approach, and what we've done in the past, is to re-encode the header as latin1 / ascii compatible encoding, which should always be possible.

It appears to be what the standard Node fetch implementation does as well:
nodejs/undici#1560

Here's a proposed approach, using a locally defined decodeLatin1 so that it can work in browsers as well:

function decodeLatin1(buf) {
  let str = "";
  for (let i = 0; i < buf.length; i++) {
    str += String.fromCharCode(buf[i]);
  }
  return str;
}
d = decodeLatin1(new TextEncoder().encode("https://wiki.archlinux.jp/index.php/Arch_User_Repository?rdfrom=https%3A%2F%2Fwiki.archlinux.org%2Findex.php%3Ftitle%3DAUR_Metadata_%28%25E6%2597%25A5%25E6%259C%25AC%25E8%25AA%259E%29%26redirect%3Dno#AUR_メタデータ"));

// this matches the headers from node fetch:
a = await fetch("https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)", {redirect: "manual"})
assert(a.headers.get("location") === d)

@wolfgang42
Copy link
Author

Wow, thanks for the quick response!

Aha, not sure why re-encoding didn’t occur to me but I think it makes perfect sense.

One minor nit: “latin1” is unfortunately an ambiguous encoding name; Node uses it for the 1:1 byte:unicode mapping, but modern web browsers follow the WhatWG spec which defines it as a synonym for Windows-1252 instead. Even when it refers to the ISO 8859-1 standard it’s sometimes interpreted as an encoding which is technically missing definitions for the C0/C1 control characters (the values for them are simply left undefined); the whole thing is a mess. I usually just call what we want here the “codepoint identity mapping” since there doesn’t seem to be an unambiguous official name for this encoding.

I can submit a PR for this at some point but it may take me a while to get around to it, so feel free to just make the change if you have the time before I do.

ikreymer added a commit that referenced this issue Nov 7, 2024
…with encodeURIComponent-encoded value instead of just skipping the header

fixes #81
@ikreymer
Copy link
Member

ikreymer commented Nov 7, 2024

On second thought, perhaps using encodeURI() is a better option and fairly standard for encoding unicode strings and easy to decode back. (that was probably the intent in the example you provided).

Added that in #83, hopefully that works.

Back to using encodeLatin1 after further discussion for compatibility with fetch() more accurate representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants