Skip to content

Commit

Permalink
#561 Added optional getText() argument to return limited number of do…
Browse files Browse the repository at this point in the history
…cument pages if set (#562)

* #561 Added optional getText() argument to return limited number of document pages if set.

* Update src/Smalot/PdfParser/Document.php

* Update src/Smalot/PdfParser/Document.php

* fixed changes for PHP <7.4; added basic two tests

* fixed coding style issue

* Usage.md: added a short example how to use new functionality

Co-authored-by: Konrad Abicht <hi@inspirito.de>
  • Loading branch information
alesrebec and k00ni authored Dec 16, 2022
1 parent e2e3581 commit d430fe6
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 1 deletion.
4 changes: 4 additions & 0 deletions doc/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,14 @@ $pdf = $parser->parseContent(file_get_contents('document.pdf'))
A common scenario is to extract text.

```php
// extract text of the whole PDF
$text = $pdf->getText();

// or extract the text of a specific page (in this case the first page)
$text = $pdf->getPages()[0]->getText();

// you can also extract text of a limited amount of pages. here, it will only use the first five pages.
$text = $pdf->getText(5);
```

## Extract text positions
Expand Down
7 changes: 6 additions & 1 deletion src/Smalot/PdfParser/Document.php
Original file line number Diff line number Diff line change
Expand Up @@ -264,11 +264,16 @@ public function getPages()
throw new \Exception('Missing catalog.');
}

public function getText(): string
public function getText(?int $pageLimit = null): string
{
$texts = [];
$pages = $this->getPages();

// Only use the first X number of pages if $pageLimit is set and numeric.
if (\is_int($pageLimit) && 0 < $pageLimit) {
$pages = \array_slice($pages, 0, $pageLimit);
}

foreach ($pages as $index => $page) {
/**
* In some cases, the $page variable may be null.
Expand Down
26 changes: 26 additions & 0 deletions tests/Integration/DocumentTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
use Smalot\PdfParser\Header;
use Smalot\PdfParser\Page;
use Smalot\PdfParser\Pages;
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\PDFObject;
use Tests\Smalot\PdfParser\TestCase;

Expand Down Expand Up @@ -229,4 +230,29 @@ public function testGetPagesMissingCatalog(): void
$document = $this->getDocumentInstance();
$document->getPages();
}

/**
* Tests getText method without a given page limit.
*
* @see https://github.com/smalot/pdfparser/pull/562
*/
public function testGetTextNoPageLimit(): void
{
$document = (new Parser())->parseFile($this->rootDir.'/samples/bugs/Issue331.pdf');

self::assertStringContainsString('Medeni Usul ve İcra İflas Hukuku', $document->getText());
}

/**
* Tests getText method with a given page limit.
*
* @see https://github.com/smalot/pdfparser/pull/562
*/
public function testGetTextWithPageLimit(): void
{
$document = (new Parser())->parseFile($this->rootDir.'/samples/bugs/Issue331.pdf');

// given text is on page 2, it has to be ignored because of that
self::assertStringNotContainsString('Medeni Usul ve İcra İflas Hukuku', $document->getText(1));
}
}

0 comments on commit d430fe6

Please sign in to comment.