Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix 330: Text extract #521

Merged
merged 8 commits into from
Apr 25, 2021
Merged

Bug fix 330: Text extract #521

merged 8 commits into from
Apr 25, 2021

Conversation

Wugengxian
Copy link

@Wugengxian Wugengxian commented Apr 23, 2021

Description of the new Feature/Bugfix

Though it is Identity-H font, it is ok to decode it by /ToUnicode. So I can decode it if the font has /ToUnicode. Then I can get the char array according to it is one byte or two bytes

Related Issue: #330

Unit-Tests for the new Feature/Bugfix

@Test
public void textExtractTest1() throws IOException {
PdfReader reader = new PdfReader(TextExtractTest.class.getResourceAsStream("/identity-h.pdf"));
PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(reader);
Assertions.assertEquals("Hello World", pdfTextExtractor.getTextFromPage(1));
}
@Test
public void textExtractTest2() throws IOException {
PdfReader reader = new PdfReader(TextExtractTest.class.getResourceAsStream("/HelloWorldMeta.pdf"));
PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(reader);
Assertions.assertEquals("Hello World", pdfTextExtractor.getTextFromPage(1));
}

Compatibilities Issues

I add a new method to CMapAwareDocumentFont to check if it has two bytes mapping.

/**
* @return true if this font has unicode information available and if it is two bytes.
*/
public boolean hasTwoByteUnicodeCMAP() {
return toUnicodeCmap != null && toUnicodeCmap.hasTwoByteMappings();
}

cannot pass 1:
failure reason: it also error when we use itext5. because when it was Chunk.NEWLINE, it will replace '\n' with '0x3' with a Identity-H font with /Unicode which maps 0x3 to 0x3. It makes error. The solution is to delete this key '0x3' or change the way to wirte Chunk.NEWLINE.
screen snapshoot of iText5:
image
@Test
void getTextFromPageWithParagraphs_expectsTextHasNoMultipleSpaces() throws IOException {
// given
final Paragraph loremIpsumParagraph = new Paragraph(LOREM_IPSUM);
loremIpsumParagraph.setAlignment(Element.ALIGN_JUSTIFIED);
byte[] pdfBytes = createSimpleDocumentWithElements(
loremIpsumParagraph,
Chunk.NEWLINE,
loremIpsumParagraph
);
final String expected = LOREM_IPSUM + " " + LOREM_IPSUM;
// when
final String extracted = new PdfTextExtractor(new PdfReader(pdfBytes)).getTextFromPage(1);
// then
assertThat(extracted, equalToCompressingWhiteSpace(expected));
assertThat(extracted, not(containsString(" ")));
}

cannot pass 2:
failure reason: it should delete the whitespace "data\ttable ".
@Test
public void TabTest1() throws IOException {
Document document = new Document(PageSize.A4.rotate(), 10, 10, 10, 10);
Document.compress = false;
ByteArrayOutputStream stream = new ByteArrayOutputStream();
try {
PdfWriter.getInstance(document,
stream);
document.open();
Chunk a = new Chunk("data\ttable");
document.add(a);
} catch (Exception de) {
de.printStackTrace();
}
document.close();
PdfReader rd = new PdfReader(stream.toByteArray());
PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(rd);
Assertions.assertEquals(pdfTextExtractor.getTextFromPage(1), "data\ttable ");
}

technology details

9.10 Extraction of Text Content of
PDF32000_2008.pdf
Later I will give a solution to fix the issue about Chunk.NEWLINE

@Wugengxian Wugengxian changed the title Text extract Bug fix 330: Text extract Apr 23, 2021
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@asturio asturio self-assigned this Apr 25, 2021
@asturio asturio linked an issue Apr 25, 2021 that may be closed by this pull request
@asturio asturio merged commit 231c90c into LibrePDF:master Apr 25, 2021
@Wugengxian Wugengxian deleted the Text-Extract branch April 25, 2021 13:27
@Wugengxian Wugengxian mentioned this pull request Apr 26, 2021
2 tasks
@Wugengxian
Copy link
Author

Wugengxian commented Apr 26, 2021

@asturio I think you can merge #530. If not, OpenPDF will provide incorrect /ToUnicode and then it can not extracte text from PDF provided by itself.

@asturio asturio added this to the 1.3.26 milestone May 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extracting text from PDF with embedded Identity-H font fails
2 participants