Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode breaks xml serialization #1496

Open
lambdaupb opened this issue Feb 24, 2021 · 2 comments
Open

Unicode breaks xml serialization #1496

lambdaupb opened this issue Feb 24, 2021 · 2 comments

Comments

@lambdaupb
Copy link

lambdaupb commented Feb 24, 2021

The parsed html is clearly weird and broken, but my assumption is that the output, after re-serializing it, should be valid.

  • There are unicode characters in tag names, which does not agree with document.outputSettings().charset("ASCII");

Version: 1.13.1

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.io.StringReader;

public class Test2 {
    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
        Document document = Jsoup.parse("  <div id=\"emid\"> <p\u226F\u0322\u0329\u032B\u0320\u0309\u030A\u0366\u0364\u036D\u030A..\u0337\u0359\u036F\u030A\u033D\u0313\u0346\u0309\u036B.\u0347\u032A\u0367\u0305\u0301>\n    &lt; p=\"\"&gt; \n   </p\u226F\u0322\u0329\u032B\u0320\u0309\u030A\u0366\u0364\u036D\u030A..\u0337\u0359\u036F\u030A\u033D\u0313\u0346\u0309\u036B.\u0347\u032A\u0367\u0305\u0301>&lt;&gt; \n  </div> ");
        document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
        document.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
        document.outputSettings().prettyPrint(true);
        document.outputSettings().charset("ASCII");
        String html = document.html();

        System.out.println(html);

        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();

        saxParser.parse(new InputSource(new StringReader(html)), new DefaultHandler() {
            @Override
            public void warning(SAXParseException e) throws SAXException {
                e.printStackTrace();
            }

            @Override
            public void error(SAXParseException e) throws SAXException {
                e.printStackTrace();
            }

            @Override
            public void fatalError(SAXParseException e) throws SAXException {
                e.printStackTrace();
            }
        });

    }
}

output:

<html>
 <head></head>
 <body>
  <div id="emid"> <p≯̢̩̫̠̉̊ͦͤͭ̊..̷͙ͯ̊̽̓͆̉ͫ.͇̪ͧ̅́>
     &lt; p=""&gt; 
   </p≯̢̩̫̠̉̊ͦͤͭ̊..̷͙ͯ̊̽̓͆̉ͫ.͇̪ͧ̅́>&lt;&gt; 
  </div> 
 </body>
</html>
org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 21; Element type "p" must be followed by either attribute specifications, ">" or "/>".
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at Test2.main(Test2.java:31)
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 21; Element type "p" must be followed by either attribute specifications, ">" or "/>".
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at Test2.main(Test2.java:31)

Process finished with exit code 1
@LIKP0
Copy link

LIKP0 commented Mar 5, 2021

Hi, we are a student group and we would like to take a crack at this. Can't guarantee that we'll be able to complete it with high enough quality but we'll like to try.

@LIKP0
Copy link

LIKP0 commented Apr 16, 2021

Hello! I think there is no error with document.outputSettings().charset("ASCII"); You can look for an online Unicode translator and try "\u226F\u0322\u0329\u032B\u0320\u0309\u030A", then you can see that it do translate it into "≯̢̩̫̠̉̊". By the way, unicode like "\u226F" has no correspoding ASCII character.
You can try below code which proves the correctness of jsoup.

Document document = Jsoup.parse("\u0041\u0042\u0043"); //ABC
document.outputSettings().charset("ASCII");
String html = document.html();
System.out.println(html);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants