Headers with non-ASCII characters silently disappear #81

wolfgang42 · 2024-10-12T05:43:05Z

Steps to reproduce

Take this example WARC, produced with wget --no-verbose -O /dev/null --max-redirect=0 --warc-file=example 'https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)'. It contains the line:

Location: https://wiki.archlinux.jp/index.php/Arch_User_Repository?rdfrom=https%3A%2F%2Fwiki.archlinux.org%2Findex.php%3Ftitle%3DAUR_Metadata_%28%25E6%2597%25A5%25E6%259C%25AC%25E8%25AA%259E%29%26redirect%3Dno#AUR_メタデータ

Actual behavior

However, trying to read that header produces null:

import {WARCParser} from 'warcio'

for await (const record of new WARCParser(process.stdin)) {
	if (record.warcType !== 'response') continue
	console.log(record.httpHeaders.headers.get('Location'))
}
// => null

Expected behavior

I am unclear on what the “correct” behavior here is (RFC 7230 leads me to believe that the server probably shouldn't have sent a non-ASCII value to begin with, but since it did this header should have been interpreted as ISO-8859-1 mojibake; I can’t find anything in the Fetch spec that describes these values as anything other than a “byte sequence” though in practice Chrome, Firefox, and Node all limit values to U+00FF and below). Unfortunately since this behavior is in the wild I have to handle it anyway.

However, I do know that the header silently disappearing is very confusing and definitely not what I want to happen; it makes this problem hard to debug. I would have much preferred a crash, which would have let me immediately identify the cause.

Workaround

Setting keepHeadersCase: true when creating the WARCParser uses a plain Map instead of a Headers, which doesn’t have the same limitation on values. Unfortunately this loses the case-insensitive aspect.

Cause

The silent disappearance appears to be because of this code:

warcio.js/src/lib/statusandheaders.ts

Lines 142 to 150 in fb1ff9c

    
           try { 
        
             if (canAppend && name.toLowerCase() === "set-cookie") { 
        
               headers.append(name, value); 
        
             } else { 
        
               headers.set(name, value); 
        
             } 
        
           } catch (_e) { 
        
             // ignore 
        
           }

...which is swallowing the error (“TypeError: Cannot convert argument to a ByteString because the character at index 204 has a value of 12513 which is greater than 255.”), along with anything else that might go wrong.

This behavior seems to have been there since 4061003:

warcio.js/src/statusandheaders.js

Lines 80 to 85 in 4061003

    
           try { 
        
             headers.set(name, value); 
        
           } catch(e) { 
        
             // try to sanitize value, removing newlines 
        
             //headers.set(name, value.replace(/[\r\n]+/g, ', ')); 
        
           }

...and since that was the initial commit, there's no obvious explanation for why it exists.

Suggested fix

This is enough for my use case:

-try { 
+// If this crashes with an error like "String contains non ISO-8859-1 code point"
+// or "character has a value which is greater than 255", the response contains a
+// header which cannot be represented by the Headers class; enable `keepHeadersCase`
+// to use a Map (which does not have this restriction) instead.
   if (canAppend && name.toLowerCase() === "set-cookie") { 
     headers.append(name, value); 
   } else { 
     headers.set(name, value); 
   } 
-} catch (_e) { 
-  // ignore 
-}

However, this may be a problem for users who have code that depends on the current behavior, are fine with that, and don't want to have to rewrite their code to handle lookups in a Map. For that use case it might be necessary to thread through a flag (call it discardInvalidHeaders, say) to keep the current behavior. (Either way I think the defaults should change, which would be a bump to the semver major version.)

The text was updated successfully, but these errors were encountered:

ikreymer · 2024-10-12T06:12:07Z

Thanks for reporting this! Yes, I think the initial approach was to skip invalid header, but that's not a good approach, as you mention. I think a better approach, and what we've done in the past, is to re-encode the header as latin1 / ascii compatible encoding, which should always be possible.

It appears to be what the standard Node fetch implementation does as well:
nodejs/undici#1560

Here's a proposed approach, using a locally defined decodeLatin1 so that it can work in browsers as well:

function decodeLatin1(buf) {
  let str = "";
  for (let i = 0; i < buf.length; i++) {
    str += String.fromCharCode(buf[i]);
  }
  return str;
}
d = decodeLatin1(new TextEncoder().encode("https://wiki.archlinux.jp/index.php/Arch_User_Repository?rdfrom=https%3A%2F%2Fwiki.archlinux.org%2Findex.php%3Ftitle%3DAUR_Metadata_%28%25E6%2597%25A5%25E6%259C%25AC%25E8%25AA%259E%29%26redirect%3Dno#AUR_メタデータ"));

// this matches the headers from node fetch:
a = await fetch("https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)", {redirect: "manual"})
assert(a.headers.get("location") === d)

wolfgang42 · 2024-10-12T17:35:11Z

Wow, thanks for the quick response!

Aha, not sure why re-encoding didn’t occur to me but I think it makes perfect sense.

One minor nit: “latin1” is unfortunately an ambiguous encoding name; Node uses it for the 1:1 byte:unicode mapping, but modern web browsers follow the WhatWG spec which defines it as a synonym for Windows-1252 instead. Even when it refers to the ISO 8859-1 standard it’s sometimes interpreted as an encoding which is technically missing definitions for the C0/C1 control characters (the values for them are simply left undefined); the whole thing is a mess. I usually just call what we want here the “codepoint identity mapping” since there doesn’t seem to be an unambiguous official name for this encoding.

I can submit a PR for this at some point but it may take me a while to get around to it, so feel free to just make the change if you have the time before I do.

…with encodeURIComponent-encoded value instead of just skipping the header fixes #81

ikreymer · 2024-11-07T06:15:21Z

~~On second thought, perhaps using encodeURI() is a better option and fairly standard for encoding unicode strings and easy to decode back. (that was probably the intent in the example you provided).~~

~~Added that in #83, hopefully that works.~~

Back to using encodeLatin1 after further discussion for compatibility with fetch() more accurate representation.

ikreymer added a commit that referenced this issue Nov 7, 2024

statusandheaders: if setting a header fails on Headers object, retry …

e03bca9

…with encodeURIComponent-encoded value instead of just skipping the header fixes #81

ikreymer mentioned this issue Nov 7, 2024

statusandheaders: if setting a header fails on Headers object, retry with encoded header #83

Merged

ikreymer closed this as completed in #83 Nov 11, 2024

ikreymer closed this as completed in d20880d Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headers with non-ASCII characters silently disappear #81

Headers with non-ASCII characters silently disappear #81

wolfgang42 commented Oct 12, 2024 •

edited

Loading

ikreymer commented Oct 12, 2024

wolfgang42 commented Oct 12, 2024

ikreymer commented Nov 7, 2024 •

edited

Loading

Headers with non-ASCII characters silently disappear #81

Headers with non-ASCII characters silently disappear #81

Comments

wolfgang42 commented Oct 12, 2024 • edited Loading

Steps to reproduce

Actual behavior

Expected behavior

Workaround

Cause

Suggested fix

ikreymer commented Oct 12, 2024

wolfgang42 commented Oct 12, 2024

ikreymer commented Nov 7, 2024 • edited Loading

wolfgang42 commented Oct 12, 2024 •

edited

Loading

ikreymer commented Nov 7, 2024 •

edited

Loading