Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore unencoded special characters #130

Closed
ThomasLandauer opened this issue Aug 6, 2020 · 3 comments
Closed

Ignore unencoded special characters #130

ThomasLandauer opened this issue Aug 6, 2020 · 3 comments
Labels

Comments

@ThomasLandauer
Copy link
Contributor

Continuing my homework ;-)

If I have this in the email (notice the unencoded tab and ä):

Subject: =?utf-8?Q?f=C3=B6=C3=B6	bär?=

...$message->getHeader('subject')->getValue() just returns the undecoded string:

=?utf-8?Q?f=C3=B6=C3=B6 bär?=

However, php-mime-mail-parser returns föö bär, since it just throws the string into quoted_printable_decode().

I don't know if leaving some characters unencoded is legal or not - didn't look in the RFCs.

But my question is: Why are you doing more work (i.e. somehow "validate" the string), instead of just throwing it into quoted_printable_decode() and take whatever it returns? Where is this happening in your code (couldn't find it)?

@ThomasLandauer ThomasLandauer changed the title Ignore unencode special characters Ignore unencoded special characters Aug 6, 2020
@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Continuing my homework ;-)

Hahaha

However, php-mime-mail-parser returns föö bär, since it just throws the string into quoted_printable_decode().

Well, that's easy... it's because I'm following the RFC 😄

I wrote a parser to handle as much of the RFC as possible. That means you need to use whitespace as a delimiter, and is why RFC 2047 specifically prohibits whitespace in the 'encoded-word' part.

An 'encoded-word' is defined by the following ABNF grammar. The
notation of RFC 822 is used, with the exception that white space
characters MUST NOT appear between components of an 'encoded-word'.

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

I'm not specifically prohibiting whitespace, it's just that the 'delimiting' of components of a header happens before the decoding of RFC 2047 happens in most cases (there had to be an exception made for 'message-id' to allow it to happen first because of #109 ).

This way of doing things allows me to fully support 'valid' headers... with it's comments, weird nested comments, quoted parts, escaped characters, address groups, RFC 2047, RFC 2231, and whatever other weird things thrown at it.

I don't know specifically what of that is or isn't supported by php-mime-mail-parser and it probably doesn't matter... the quirkier bits of the standards are so rarely encountered anyway that it doesn't matter for it. My goals are different -- which is why I don't use that project as a gauge myself... but it's also why thinking up random scenarios and testing them might not be useful also... some standards need to be followed (you still put =?utf-8 in your test... what if the header had =&utf-8 instead? Point being, both are equally invalid) 😝

@ThomasLandauer
Copy link
Contributor Author

Here's the more relevant part of RFC 2047:

encoded-text = 1*<Any printable ASCII character other than "?" or SPACE> (but see "Use of encoded-words in message headers", section 5)

... and Tab and ä are no "printable ASCII characters".

Still wondering why quoted_printable_decode() does decode it. A possible explanation might be that they don't follow our "email" RFC 2047, but a (maybe) more liberal general-purpose quoted printable specification.

Anyway, you can set this to "wontfix" and close it :-)

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Yeah, quoted_printable_decode isn't a header-specific function (could be the body of a mime part with Content-Transfer-Encoding set to quoted-printable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants