Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GB18030 character 龦 can't be parsed by SAXParser #54

Open
zhxiaoliibm opened this issue Jul 14, 2023 · 7 comments
Open

GB18030 character 龦 can't be parsed by SAXParser #54

zhxiaoliibm opened this issue Jul 14, 2023 · 7 comments

Comments

@zhxiaoliibm
Copy link

this is the xml file:
GB1803_002.zip

this is the demo code:

         SAXParserFactory spf = SAXParserFactory.newInstance();
         DummySAXEventHandler saxParserHandler = new DummySAXEventHandler();

         try {
             SAXParser saxParser = spf.newSAXParser();
             saxParser.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
             saxParser.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
             XMLReader xmlReader = saxParser.getXMLReader();
             xmlReader.setContentHandler(saxParserHandler);
             xmlReader.setEntityResolver(saxParserHandler);
             xmlReader.parse(xmlFileName);
         }
         catch (SAXException e) {
         }
         catch (Exception e) {
         }

when I try to parse this xml file, it will throw an org.xml.sax.SAXParseException with error message

lineNumber: 8; columnNumber: 11; Element type "aBB〢_郎嗀秊_" must be followed by either attribute specifications, ">" or "/>".
@pshipton
Copy link
Member

pshipton commented Jul 21, 2023

Does this need GB18030-2022 support? Support for GB18030-2022 is added in the next release. You didn't mention which version/platform you are using. There are some preliminary builds for the next release available as announced in Slack at https://openj9.slack.com/archives/C01C8PL6319/p1689274282032669, copying the details here.

Semeru Open Edition Milestone 2 build for the July release has been published on a subset of platforms
https://github.com/ibmruntimes/semeru8-binaries/releases/tag/jdk8u382-b04_openj9-0.40.0-m2
https://github.com/ibmruntimes/semeru11-binaries/releases/tag/jdk-11.0.20%2B7_openj9-0.40.0-m2
https://github.com/ibmruntimes/semeru17-binaries/releases/tag/jdk-17.0.8%2B6_openj9-0.40.0-m2
https://github.com/ibmruntimes/semeru20-binaries/releases/tag/jdk-20.0.1%2B9_openj9-0.40.0-m2

@pshipton
Copy link
Member

The support for GB18030-2022 is not in the preliminary jdk8 or jdk20 builds, but will be in the final builds.

@knn-k
Copy link

knn-k commented Jul 21, 2023

I reproduced the failure using Semeru 11.0.20+7 m2 above.
Interestingly, the SAX parser reads the xml file successfully when I replace all the occurrences of the character '龦' (U+9FA6) by '龥' (U+9FA5).

Semeru 17.0.8+6 m2 gives the same result.

@knn-k
Copy link

knn-k commented Jul 21, 2023

The following program gives the same result with 11.0.19, 11.0.20, 17.0.7, and 17.0.8.
Both U+9FA5 and U+9FA6 are defined, and their type is 5 (Character.OTHER_LETTER).

public class CharType {

	public static void showProperties(char c) {
		System.out.println("U+" + Integer.toHexString(c) + ": Type=" + Character.getType(c) + ", isDefined=" + Character.isDefined(c));
	}

	public static void main(String[] args) {
		showProperties('\u9FA5');
		showProperties('\u9FA6');
	}

}

@pshipton
Copy link
Member

Does it work on a Temurin build? Until the next release is completed, the most recent builds are nightly builds.
https://adoptium.net/temurin/nightly/

@knn-k
Copy link

knn-k commented Jul 21, 2023

Temurin 11.0.20-beta fails in the same way.

[Fatal Error] GB1803_002.xml:8:11: Element type "aBB〢_郎嗀秊_" must be followed by either attribute specifications, ">" or "/>".

$ jdk-11.0.20+7/bin/java -version
openjdk version "11.0.20-beta" 2023-07-18
OpenJDK Runtime Environment Temurin-11.0.20+7-202307151707 (build 11.0.20-beta+7-202307151707)
OpenJDK 64-Bit Server VM Temurin-11.0.20+7-202307151707 (build 11.0.20-beta+7-202307151707, mixed mode)

@pshipton
Copy link
Member

Seems the problem, if it is a problem, should be reported to OpenJDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants