Bug fix 330: Text extract #521

Wugengxian · 2021-04-23T00:14:20Z

Description of the new Feature/Bugfix

Though it is Identity-H font, it is ok to decode it by /ToUnicode. So I can decode it if the font has /ToUnicode. Then I can get the char array according to it is one byte or two bytes

Related Issue: #330

Unit-Tests for the new Feature/Bugfix

OpenPDF/openpdf/src/test/java/com/lowagie/text/pdf/TextExtractTest.java

Lines 10 to 22 in 422b562

    
           @Test 
        
           public void textExtractTest1() throws IOException { 
        
               PdfReader reader = new PdfReader(TextExtractTest.class.getResourceAsStream("/identity-h.pdf")); 
        
               PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(reader); 
        
               Assertions.assertEquals("Hello World", pdfTextExtractor.getTextFromPage(1)); 
        
           } 
        
           @Test 
        
           public void textExtractTest2() throws IOException { 
        
               PdfReader reader = new PdfReader(TextExtractTest.class.getResourceAsStream("/HelloWorldMeta.pdf")); 
        
               PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(reader); 
        
               Assertions.assertEquals("Hello World", pdfTextExtractor.getTextFromPage(1)); 
        
           }

Compatibilities Issues

I add a new method to CMapAwareDocumentFont to check if it has two bytes mapping.

OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/CMapAwareDocumentFont.java

Lines 226 to 231 in 422b562

    
               /** 
        
                * @return true if this font has unicode information available and if it is two bytes. 
        
                */ 
        
               public boolean hasTwoByteUnicodeCMAP() { 
        
                   return toUnicodeCmap != null && toUnicodeCmap.hasTwoByteMappings(); 
        
               }

cannot pass 1:
failure reason: it also error when we use itext5. because when it was Chunk.NEWLINE, it will replace '\n' with '0x3' with a Identity-H font with /Unicode which maps 0x3 to 0x3. It makes error. The solution is to delete this key '0x3' or change the way to wirte Chunk.NEWLINE.
screen snapshoot of iText5:

OpenPDF/openpdf/src/test/java/com/lowagie/text/pdf/parser/PdfTextExtractorTest.java

Lines 85 to 101 in d144eaa

    
           @Test 
        
           void getTextFromPageWithParagraphs_expectsTextHasNoMultipleSpaces() throws IOException { 
        
               // given 
        
               final Paragraph loremIpsumParagraph = new Paragraph(LOREM_IPSUM); 
        
               loremIpsumParagraph.setAlignment(Element.ALIGN_JUSTIFIED); 
        
               byte[] pdfBytes = createSimpleDocumentWithElements( 
        
                       loremIpsumParagraph, 
        
                       Chunk.NEWLINE, 
        
                       loremIpsumParagraph 
        
               ); 
        
               final String expected = LOREM_IPSUM + " " + LOREM_IPSUM; 
        
               // when 
        
               final String extracted = new PdfTextExtractor(new PdfReader(pdfBytes)).getTextFromPage(1); 
        
               // then 
        
               assertThat(extracted, equalToCompressingWhiteSpace(expected)); 
        
               assertThat(extracted, not(containsString("  "))); 
        
           }

cannot pass 2:
failure reason: it should delete the whitespace "data\ttable ".

OpenPDF/openpdf/src/test/java/com/lowagie/text/pdf/TabTest.java

Lines 14 to 32 in d144eaa

    
           @Test 
        
           public void TabTest1() throws IOException { 
        
               Document document = new Document(PageSize.A4.rotate(), 10, 10, 10, 10); 
        
               Document.compress = false; 
        
               ByteArrayOutputStream stream = new ByteArrayOutputStream(); 
        
               try { 
        
                   PdfWriter.getInstance(document, 
        
                           stream); 
        
                   document.open(); 
        
                   Chunk a = new Chunk("data\ttable"); 
        
                   document.add(a); 
        
               } catch (Exception de) { 
        
                   de.printStackTrace(); 
        
               } 
        
               document.close(); 
        
               PdfReader rd = new PdfReader(stream.toByteArray()); 
        
               PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(rd); 
        
               Assertions.assertEquals(pdfTextExtractor.getTextFromPage(1), "data\ttable "); 
        
           }

technology details

9.10 Extraction of Text Content of
PDF32000_2008.pdf
Later I will give a solution to fix the issue about Chunk.NEWLINE

sonarqubecloud · 2021-04-23T12:23:08Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

Wugengxian · 2021-04-26T05:22:33Z

@asturio I think you can merge #530. If not, OpenPDF will provide incorrect /ToUnicode and then it can not extracte text from PDF provided by itself.

Wugengxian added 5 commits April 23, 2021 08:02

Update CMapAwareDocumentFont.java

1c6034b

Update ParsedText.java

e25012d

Update PdfString.java

f1457d2

Create TextExtractTest.java

3f2e605

Create identity-h.pdf

110f5a7

Wugengxian mentioned this pull request Apr 23, 2021

Extracting text from PDF with embedded Identity-H font fails #330

Closed

Wugengxian added 2 commits April 23, 2021 08:26

Update CMapAwareDocumentFont.java

08f301f

Update PdfString.java

422b562

Wugengxian changed the title ~~Text extract~~ Bug fix 330: Text extract Apr 23, 2021

Update TabTest.java

e4d2f3a

asturio self-assigned this Apr 25, 2021

asturio linked an issue Apr 25, 2021 that may be closed by this pull request

Extracting text from PDF with embedded Identity-H font fails #330

Closed

asturio merged commit 231c90c into LibrePDF:master Apr 25, 2021

Wugengxian deleted the Text-Extract branch April 25, 2021 13:27

Wugengxian mentioned this pull request Apr 26, 2021

Bug fix 529: error /toUnicode #530

Merged

2 tasks

asturio added this to the 1.3.26 milestone May 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix 330: Text extract #521

Bug fix 330: Text extract #521

Wugengxian commented Apr 23, 2021 •

edited by asturio

Loading

sonarqubecloud bot commented Apr 23, 2021

Wugengxian commented Apr 26, 2021 •

edited

Loading

	@Test
	public void textExtractTest1() throws IOException {
	PdfReader reader = new PdfReader(TextExtractTest.class.getResourceAsStream("/identity-h.pdf"));
	PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(reader);
	Assertions.assertEquals("Hello World", pdfTextExtractor.getTextFromPage(1));
	}

	@Test
	public void textExtractTest2() throws IOException {
	PdfReader reader = new PdfReader(TextExtractTest.class.getResourceAsStream("/HelloWorldMeta.pdf"));
	PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(reader);
	Assertions.assertEquals("Hello World", pdfTextExtractor.getTextFromPage(1));
	}

	/**
	* @return true if this font has unicode information available and if it is two bytes.
	*/
	public boolean hasTwoByteUnicodeCMAP() {
	return toUnicodeCmap != null && toUnicodeCmap.hasTwoByteMappings();
	}

	@Test
	void getTextFromPageWithParagraphs_expectsTextHasNoMultipleSpaces() throws IOException {
	// given
	final Paragraph loremIpsumParagraph = new Paragraph(LOREM_IPSUM);
	loremIpsumParagraph.setAlignment(Element.ALIGN_JUSTIFIED);
	byte[] pdfBytes = createSimpleDocumentWithElements(
	loremIpsumParagraph,
	Chunk.NEWLINE,
	loremIpsumParagraph
	);
	final String expected = LOREM_IPSUM + " " + LOREM_IPSUM;
	// when
	final String extracted = new PdfTextExtractor(new PdfReader(pdfBytes)).getTextFromPage(1);
	// then
	assertThat(extracted, equalToCompressingWhiteSpace(expected));
	assertThat(extracted, not(containsString(" ")));
	}

	@Test
	public void TabTest1() throws IOException {
	Document document = new Document(PageSize.A4.rotate(), 10, 10, 10, 10);
	Document.compress = false;
	ByteArrayOutputStream stream = new ByteArrayOutputStream();
	try {
	PdfWriter.getInstance(document,
	stream);
	document.open();
	Chunk a = new Chunk("data\ttable");
	document.add(a);
	} catch (Exception de) {
	de.printStackTrace();
	}
	document.close();
	PdfReader rd = new PdfReader(stream.toByteArray());
	PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(rd);
	Assertions.assertEquals(pdfTextExtractor.getTextFromPage(1), "data\ttable ");
	}

Bug fix 330: Text extract #521

Bug fix 330: Text extract #521

Conversation

Wugengxian commented Apr 23, 2021 • edited by asturio Loading

Description of the new Feature/Bugfix

Unit-Tests for the new Feature/Bugfix

Compatibilities Issues

technology details

sonarqubecloud bot commented Apr 23, 2021

Wugengxian commented Apr 26, 2021 • edited Loading

Wugengxian commented Apr 23, 2021 •

edited by asturio

Loading

Wugengxian commented Apr 26, 2021 •

edited

Loading