-
-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configuration to set the default charset for content without a specified charset #110
Comments
Hi @johnss -- Is this a mime-encoded quoted printable part, or part of a message body? What encoding is used for the part? Preferably a full example would help me test it/confirm the issue... All the best |
Part of message body using quoted printable as content transfer encoding via getHtmlContent() method, UTF-8 as html encoding |
I created it using android chrome and save it as mhtml, it actually saved pages of x.com but i modified it to reproduce this issue. |
here is bin2hex result |
Hi @johnss, The html part of the message in your example doesn't correctly define a charset. You can manually override that if you want by calling setCharsetOverride, for example: $message->getHtmlPart()->setCharsetOverride('utf-8');
echo $message->getHtmlContent(); All the best. |
what you not mention it docs? please add it to documentation |
setCharsetOverride only mentioned in api docs generated by phpdocumentor, which many people rarely visit those pages, so many dev are not aware that method exist, please mention to pages with higher traffic |
what encoding used when setCharsetOverride is not set? utf-8 is de facto standard used by nearly all web sites, why not default to utf-8? |
Hi @johnss, It's not a bad suggestion -- my understanding is UTF-8 is fully backwards-compatible with ISO-8859-1. In researching this a bit, I couldn't find a reason not to default to UTF-8, but also it surprised me that Thunderbird defaults to ISO-8859-1 given they're fully compatible. I think the ideal would be to have the default configurable rather than setting an override for a single email... and have the default configured charset UTF-8. I'd be interested to hear from others more knowledgeable on this -- any reason why we shouldn't default to UTF-8? |
Looking more closely at this, UTF-8 and ISO-8859-1 are only the same for 0-127 (ASCII). This causes problems if an email contains non-ASCII characters and expects the default to be considered ISO-8859-1 instead of UTF-8. Setting the default to UTF-8 causes tests/_data/emails/m0009 to fail, but not tests/_data/emails/m0008 -- m0009 is ISO-8859-1 encoded without specifying a charset, m0008 is UTF-8 encoded. You can also note the differences in the files as they're the same text, the UTF-8 variant uses multiple bytes to encode codepoints above 127, whereas the ISO-8859-1 variant doesn't. Instead, the option could be available though to change the default if you're interested in submitting a pull request. |
I read the RFC (see #133 (comment)) as if you use non-ASCII characters you must declare a charset in the |
Yeah, although there's no harm in expanding that to either ISO-8859-1 or UTF-8, as they're both compatible for the first 127 bytes. |
First: I'm not sure if Second:
To sum it up, the situation is: The RFC demands that you declare a I wanted to provide some data for this from the mails I'm currently analyzing. (They're mostly German, so probably every single one does contain some non-ASCII characters.) Well, but since |
The point is though, that you can check if a charset isn't set, and use |
I don't know why there are no results on your specific case and emails without further details of what you're doing. |
Well, if I override the existing charset, it's not "default" anymore! Default means: If there is no value, use this one. Do you want me to run this check at all? If yes, please give me the code part I'm missing: Check if there is a charset declared - for the entire message or just for the |
We're running in circles a bit here 😛 . I said "you can check if a charset isn't set, and use setCharsetOverride if it isn't" You can call |
I can now report from the emails I'm analyzing: With the code from #136 (comment) I now have 0.02% that have text in ISO-8859-1 (or similar) without a So implementing what this issue asks for (a configuration to let the user set the default charset to e.g. UTF-8) is a good idea IMO, since it reduces the problem cases by more than factor 10. Just for the records: It looks like most (German speaking) companies that do not declare a |
it seems that QP encoding cannot support multi equal sign it only support 1 equal sign
for example =E2=80=93 should convert to – but it shows â��
The text was updated successfully, but these errors were encountered: