-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8 #20690
Conversation
Signed-off-by: Pratiksha.Sawant <Pratiksha.Sawant@ibm.com>
👋 Welcome back psawant19! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
@psawant19 The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
I have attached a test case for the charset issue. Without the charset fix, below issue is seen:
After applying the fix, able to decode characters using ISO-8859-8-I charset.
|
@jaikiran, could you please review my PR. |
Hello Pratiksha, this is not an area that I have knowledge in. Naoto and Justin review changes in this area and I believe they will take a look at this when they are around. Having said that, I notice that in your comment you mention that you ran a test with this change that fixes the issue. It looks like that was tested as a standalone application? Could you add that as a jtreg test to reproduce the issue and verify the fix? |
Hi, |
@naotoj The origin comes from an old JavaMail ticket that Bill Shannon was working on. The link is here: jakartaee/mail-api#302 I'm picking up where Bill left off and @psawant19 is just addressing the matching jdk bug. I can add the mappings to JakaraMail but Bill wanted the evaluation of root issue in the JDK before doing that. The history is in the linked ticket. |
/csr |
@AlanBateman has indicated that a compatibility and specification (CSR) request is needed for this pull request. @psawant19 please create a CSR request for issue JDK-8195686 with the correct fix version. This pull request cannot be integrated until the CSR request is approved. |
I've added the "csr" label as this is adding support for "ISO8859-8-I". Naoto asked me about it but I'm not 100% sure if it's an alias or a different charset. I think this topic may require input from those more familiar with charsets in environment that require bidi processing. Or if the mappings are available then I think we can see if they are identical to ISO8859-8. |
"ISO-8859-8-I" is a charset name for character encoding "ISO-8859-8".(https://en.wikipedia.org/wiki/ISO-8859-8-I). We had found 2 files where the aliases for charsets are added in jdk code base.
“ISO-8859-8-I” charset is referenced in the headers as the charset of the email contents in few clients when the email is generated from Middle East and China. As it is supposed to be a duplicate of ISO-8859-8, and we are supporting this ISO-8859-8-I in EncodingMap.java, supporting this encoding in charsets file also makes the behaviour consistent through the JDK. There is a ticket raised in angus-mail for similar issue :- eclipse-ee4j/angus-mail#147 |
/label -build |
@magicus |
I looked at this issue a bit more. Looking at the IANA Charset registry (https://www.iana.org/assignments/character-sets/character-sets.xhtml) which |
@naotoj does the mapping need to be removed from: jdk/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/util/EncodingMap.java Line 770 in 5e5942a
I ask because JakartaMail /Angus Mail is a similar usecase to this code. |
@jmehrens I would like to, but I don't know the possible issues that would be caused by the removal. So my take is no. |
@naotoj Makes sense. I did find a few links: https://blog.netbsd.org/tnf/entry/handling_non_utf_8_hebrew https://support.oracle.com/knowledge/Oracle%20Cloud/2991085_1.html Any advice on adding the alias to JakartaMail? I see web search results of libraries using what is done in xerces so I'm trying to balance your advice with that. |
Sorry, but I cannot speak for Jakarta Mail. If they see ISO-8859-8-I encoding important, they may introduce it as a new charset (again it is not an alias to ISO-8859-8) |
Understood. I'll close out those tickets then with alternatives.
Well that is not exactly true. The following are all the same ticket from 2018 as a request from JavaMail/JakartaMail:
The OpenJDK ticket JDK-8195686 has not had a proper evaluation since 2018. However, looks like this PR has that covered and I'm grateful for that. Then in May of 2024 the following was created: eclipse-ee4j/angus-mail#147 by @davecrighton on the Angus Mail project. Then in June @psawant19 commented on that ticket and later created this PR in OpenJDK. So 3 unique users and all related to JavaMail/JakartaMail/Angus Mail on this very topic. It seems pretty clear that we would have to contribute the new Charset implementation to move this forward. |
As the original bug submitter I might add that adding a mapping from ISO-8859-8-i to ISO-8859-8 is almost certainly correct and makes sense in the real world. The character encodings for ISO-8859-8 and ISO-8859-8-i charsets are exactly the same, and the distinction is only due to historical reasons. Email clients in the past did not "know about" right-to-left languages, instead the text was sent as regular ISO-8859-8 mail but sent line-by-line but with each line reversed. The reversed lines are displayed LTR (left-to-right) as-is. This is what's known as "visual ordering", and is required for old email clients. Newer email clients can do right-to-left, i.e. their text display engines started to support RTL display. So it was no longer necessary to send emails in "visual order" with reversed lines. But now there's a problem: how does the email client know whether the text is in "visual order" (displayed as-is LTR) or in "logical order" (displayed as RTL text). Thus ISO-8859-8-i was introduced. The charset decoding is exactly the same as ISO-8859-8, the only difference is in instructing the email client to display the lines not as-is LTR, but RTL (more precisely the "-i" stands for "implicit mode", where the directionality depends on the content). Old email clients cannot show these mails, as they do not know about ISO-8859-8-i and do not support RTL display anyways. (Sidenote: there are also "ISO-8859-8" mails in the wild that are actually in logical order already. RTL applications are pretty good at figuring this out heuristically nowadays.) The only drawback to adding the alias from ISO-8859-8-i to ISO-8859-8 is if you have a very old application (email client) that cannot do RTL display , doesn't look at the charset, has no heuristics for RTL, but used the newest JDK. Instead of showing an "unsupported charset" error it would then read the email as LTR with each line reversed. |
Based on our analysis, we've identified that the file “EncodingMap.java” includes an entry where "ISO-8859-8-I" is defined as an alias for "ISO8859_8." This entry is found in the headstream repository, and we believe it makes sense to include this in the charsets file as well. Moreover, the original bug submitter, jmiserez has expressed agreement with our proposed solution, as noted in the discussion here. Even if we decide to create a new charset mapping for "ISO-8859-8-I," it would essentially mirror "ISO-8859-8," differing only in the naming convention. This would function similarly to creating an alias in the charsets file. Therefore, we propose that this approach is valid and appropriate for implementation. |
@naotoj In light of @jmiserez and @psawant19 's comments does this change the position of the openjdk team? We are interested as we are currently maintaining a fork of Jakarta mail in order to allow our customers to use this charset and would like to limit the amount of time we need to do this for. For what it is worth our customer has deployed this into production and is successfully processing ISO-8859-8-i without any complaints from users. Appreciate your work on reviewing this. |
One more thing: I forgot to explain why the alias ISO-8859-8-i -> ISO-8859-8 would definitely be correct. Java strings are stored in logical order. That is true for both LTR and RTL languages. This is plainly apparent from the OpenJDK String source code, but also explicitly mentioned/explained e.g. by official tutorials such as here: https://docs.oracle.com/javase/tutorial/2d/text/textlayoutbidirectionaltext.html#ordering_text ISO-8859-8-i texts are always sent in logical order (by definition). So decoding a ISO-8859-8-i text into a Java string using the ISO-8859-8 alias will result in the correct order of characters in the Java string, i.e. logical order, and thus is always 100% correct by definition. Technically speaking, and for completeness sake here is the full list of cases for regular ISO-8859-8 today:
So: Case 2 is what would happen if the alias was added. Now as long as nobody adds a "auto-reverse visual to logical order" heuristic for RTL ISO-8859-8 text decoding in OpenJDK (which I'm fairly certain can't / mustn't be done), using a simple alias ISO-8859-8-i -> ISO-8859-8 will thus always be correct. The alias will result in case 2, i.e. texts will always be decoded into the correct Java string in logical order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is the right way to handle it.
As ISO-8859 encodings are pretty much obsolete, not sure it is worth adding this encoding now.
Yes, but ISO-8859-8-I
is still referenced by WHATWG as well. It's up to an application layer to make a distinction as far as the visual or logical order, it doesn't make sense for a converter to try to do anything.
Anyway, it would be ISO-8859-8-E
which would have explicit visual controls in it. ISO-8859-8-I
as an encoding will match exactly what data in the wild for ISO-8859-8
.
IBM and ICU's mapping tables have had this equivalent for 25+ years. Merging this PR corrects the oversight in the ISO-8859-8 compatibility.
I think it would be fine to say that ISO-8859-8-E is not supported here, as it would be ISO-8859-8 / ISO-8859-8-I but with additional controls requiring a shaper. That could be mentioned in a comment.
@jmiserez wrote:
In other words, Java may have been incorrectly handling |
@AlanBateman Since the mapping is just an alias to ISO-8859-8 do we still need CSR request to be created for the pull request? |
@naotoj does it make sense? |
Sorry, but I still don't believe that making "ISO-8859-8-I" as an alias to "ISO-8859-8" is the right solution, per the IANA character sets definition (https://www.iana.org/assignments/character-sets/character-sets.xhtml). The current PR would make "ISO-8859-8-I" charset appear in That said, I just wonder if this issue can better be addressed exploiting the Charset SPI. This way mail servers can install "ISO-8859-8-I" charset by themselves. This means that mail servers do not need to rely on the underlying JDK which may or may not have that charset. |
I agree. From the Charset specification,
Practically speaking it does seem to be an alias, but implementing as such would violate the Charset specification. So either defining as a new Charset for ISO-8859-8-I (if there is sufficient demand) or as Naoto pointed out, utilize the CharsetProvider would seem like appropriate solutions to me. A pro to the SPI solution is that you can also easily include all the other bidi supported implicit/explicit Charsets as well. |
Fair enough and feel free to reject my review if need be.
It seems like from an API perspective, you are both saying it should be a new Charset provider (though with identical behavior) but separate and not an alias. That preserves the invariant about IANA registration.
It does still seem that the JDK is probably currently treating ISO-8859-8 as if it were ISO-8859-8-I. I wonder why the implementation was done the way it is, but that’s only of historical interest.
|
I don't fully understand the conditional acceptance. Can't @psawant19 abandon the alias PR and use the existing ISO-8859-8 source from OpenJDK to create new ISO-8859-8-I Charset? The level off effort to share common code, proxy wrap, or so forth between two Charsets wouldn't be that much of a lift or long term debt. If the community is willing to to the work then acceptance is really a willingness to approve the change. Are all housed OpenJDK solutions around this a no?
That has been the solution suggested for years. They have been documented JavaMail/JakartaMail FAQ. I copied them into the ticket here: I'll leave that AngusMail ticket open until this comes to a close. |
That would seem to be what @naotoj stated would make the API contract (concerning IANA identity) correct. |
Correct. I gathered that point. What I was trying to convey is that the contribution of the intellectual property is from OpenJDK itself so there is proven track record of quality of the code. Alias route is dead, done, rejected. Rejecting a PR on that route that is a 'clone of another charset' is either compatiblely concern or a unwillingness to accept the new charset. Just trying to find a path forward on this. Thus my intent is to figure out why charset approach would be rejected on the grounds that ISO-8859-8-I is "obsolete", does not have "sufficient demand", or is not "important" enough. These are reject words sprinked in thread. Contributors are here to help out work on this. Working on obsolete, unpopular, unimportant stuff is what we do sometimes. We just need direction. |
I noticed that the embedded xerces treates 8859-8-I as 8859-8 here:
|
@psawant19 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@psawant19 This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
Mapping ISO-8859-8-I charset to ISO-8859-8.
Below mentioned 2 aliases are added as part of this:-
ISO-8859-8-I
ISO8859-8-I
The bug report for the same:- https://bugs.openjdk.org/browse/JDK-8195686
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20690/head:pull/20690
$ git checkout pull/20690
Update a local copy of the PR:
$ git checkout pull/20690
$ git pull https://git.openjdk.org/jdk.git pull/20690/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20690
View PR using the GUI difftool:
$ git pr show -t 20690
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20690.diff
Using Webrev
Link to Webrev Comment