Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charset should be determined based on the RTF's ansicpg control word #13

Closed
bbottema opened this issue May 25, 2024 · 2 comments
Closed
Labels
enhancement New feature or request

Comments

@bbottema
Copy link
Owner

bbottema commented May 25, 2024

Solves bbottema/simple-java-mail#526.

There are two problems, one in the library, and one in the problematic RTF.

  1. The library converts RTF content to HTML using hardcoded Windows charset "CP1252".
  2. The problematic RTF specifies the wrong charset, Windows charset "CP1252", but uses a chinese font face and its symbols in the content

It's unclear how the email with a wrong charset was obtained, but it's probing problematic to compensate for in the library. I've actually got it working, but it breaks a bunch of other charset use-cases, among which Russian and mixed-charsets.

chinese message garbled.zip

@bbottema bbottema added the enhancement New feature or request label May 25, 2024
bbottema added a commit that referenced this issue May 25, 2024
Resolved issues where charsets were incorrectly detected when symbol-based fonts (e.g., Chinese fonts) were used, despite the RTF content specifying windows-1252 charset.
bbottema added a commit that referenced this issue May 25, 2024
@bbottema bbottema changed the title Charset detection should be more robus by looking at used font faces as well (to detect chinese for example) Charset should be determined based on the RTF's ansicpg control word May 25, 2024
@bbottema
Copy link
Owner Author

I'm unsure where I went wrong, but after starting to determine the charset based on the RTF's ansicpg control word, while defaulting to "windows-1252", everything started to work again. I've shelved the more advanced detection heuristics that looked at the font being used and even the character encodings.

@bbottema
Copy link
Owner Author

Released in 1.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant