error with PDFTextExtractor #529

Wugengxian · 2021-04-26T04:02:46Z

Describe the bug
/toUnicode is error, then PDFTextExtractor will make mistake.

To Reproduce
this is code:

void testToUnicode() throws Exception {
        Document document = new Document();
        Document.compress = false;
        FileOutputStream outputStream = new FileOutputStream("output.pdf");
        PdfWriter.getInstance(document, outputStream);
        document.open();

        document.add(new Chunk("ετε", new Font(Font.SYMBOL)));
        document.close();
        PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(new PdfReader("output.pdf"));
        Assertions.assertEquals("ετε", pdfTextExtractor.getTextFromPage(1));
    }

Expected behavior
when we copy "ετε" in html or use PdfTextExtractor, it show "ͧͶͧ". which is error
Expected behavior
when we copy "ετε" in html or use PdfTextExtractor, it should show "ετε".

Screenshots

System (please complete the following information):

OS: Windows 10
Used Font:

Additional context
I have fixed it, the error happen in /ToUnicode.
error /ToUnicode:

The text was updated successfully, but these errors were encountered:

Wugengxian · 2021-05-02T18:22:12Z

it also has other problem when we use font.Symbol in Chunk. I will provide a pull request in the future.

Wugengxian added the bug label Apr 26, 2021

Wugengxian mentioned this issue Apr 26, 2021

Bug fix 529: error /toUnicode #530

Merged

2 tasks

Wugengxian closed this as completed May 2, 2021

asturio added this to the 1.3.26 milestone May 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error with PDFTextExtractor #529

error with PDFTextExtractor #529

Wugengxian commented Apr 26, 2021 •

edited

Loading

Wugengxian commented May 2, 2021

error with PDFTextExtractor #529

error with PDFTextExtractor #529

Comments

Wugengxian commented Apr 26, 2021 • edited Loading

Wugengxian commented May 2, 2021

Wugengxian commented Apr 26, 2021 •

edited

Loading