-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Headers with non-ASCII characters silently disappear #81
Comments
Thanks for reporting this! Yes, I think the initial approach was to skip invalid header, but that's not a good approach, as you mention. I think a better approach, and what we've done in the past, is to re-encode the header as latin1 / ascii compatible encoding, which should always be possible. It appears to be what the standard Node fetch implementation does as well: Here's a proposed approach, using a locally defined decodeLatin1 so that it can work in browsers as well:
|
Wow, thanks for the quick response! Aha, not sure why re-encoding didn’t occur to me but I think it makes perfect sense. One minor nit: “latin1” is unfortunately an ambiguous encoding name; Node uses it for the 1:1 byte:unicode mapping, but modern web browsers follow the WhatWG spec which defines it as a synonym for Windows-1252 instead. Even when it refers to the ISO 8859-1 standard it’s sometimes interpreted as an encoding which is technically missing definitions for the C0/C1 control characters (the values for them are simply left undefined); the whole thing is a mess. I usually just call what we want here the “codepoint identity mapping” since there doesn’t seem to be an unambiguous official name for this encoding. I can submit a PR for this at some point but it may take me a while to get around to it, so feel free to just make the change if you have the time before I do. |
…with encodeURIComponent-encoded value instead of just skipping the header fixes #81
Back to using encodeLatin1 after further discussion for compatibility with fetch() more accurate representation. |
Steps to reproduce
Take this example WARC, produced with
wget --no-verbose -O /dev/null --max-redirect=0 --warc-file=example 'https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)'
. It contains the line:Actual behavior
However, trying to read that header produces
null
:Expected behavior
I am unclear on what the “correct” behavior here is (RFC 7230 leads me to believe that the server probably shouldn't have sent a non-ASCII value to begin with, but since it did this header should have been interpreted as ISO-8859-1 mojibake; I can’t find anything in the Fetch spec that describes these values as anything other than a “byte sequence” though in practice Chrome, Firefox, and Node all limit values to U+00FF and below). Unfortunately since this behavior is in the wild I have to handle it anyway.
However, I do know that the header silently disappearing is very confusing and definitely not what I want to happen; it makes this problem hard to debug. I would have much preferred a crash, which would have let me immediately identify the cause.
Workaround
Setting
keepHeadersCase: true
when creating theWARCParser
uses a plainMap
instead of aHeaders
, which doesn’t have the same limitation on values. Unfortunately this loses the case-insensitive aspect.Cause
The silent disappearance appears to be because of this code:
warcio.js/src/lib/statusandheaders.ts
Lines 142 to 150 in fb1ff9c
...which is swallowing the error (“TypeError: Cannot convert argument to a ByteString because the character at index 204 has a value of 12513 which is greater than 255.”), along with anything else that might go wrong.
This behavior seems to have been there since 4061003:
warcio.js/src/statusandheaders.js
Lines 80 to 85 in 4061003
...and since that was the initial commit, there's no obvious explanation for why it exists.
Suggested fix
This is enough for my use case:
However, this may be a problem for users who have code that depends on the current behavior, are fine with that, and don't want to have to rewrite their code to handle lookups in a Map. For that use case it might be necessary to thread through a flag (call it
discardInvalidHeaders
, say) to keep the current behavior. (Either way I think the defaults should change, which would be a bump to the semver major version.)The text was updated successfully, but these errors were encountered: