A package for extracting tables from PDF files. The package is a wrapper completely written in Typescript
, built on top of the popular tabula-java
library.
To use this package, install it via your favourite package manager:
pnpm i @krakz999/tabula-node
import { extractTables } from '@krakz999/tabula-node';
const results = await extractTables("./test.pdf", {
pages: "all",
guess: true
});
console.log(results);
The extractTables
function accepts an options
object with the following properties:
area
: Portion of the page to analyze. Example: "269.875,12.75,790.5,561". Accepts top,left,bottom,right i.e. y1,x1,y2,x2 where all values are in points relative to the top left corner. If all values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Example: "%0,0,100,50". To specify multiple areas, pass an array. Default is entire page.columns
: X coordinates of column boundaries. Example "10.1,20.2,30.3". If all values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual width of the page. Example: "%25,50,80.6"format
: Output format: (CSV,TSV,JSON). Default: CSVguess
: Guess the portion of the page to analyze per page.lattice
: Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)noSpreadsheet
: [Deprecated in favor of -t/--stream] Force PDF not to be extracted using spreadsheet-style extraction (if there are no ruling lines separating each cell)pages
: Comma separated list of ranges, or all. Examples: "1-3,5-7", --pages 3 or "all". Default is "1"password
: Password to decrypt document. Default is emptystream
: Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell)useLineReturns
: Use embedded line returns in cells. (Only in spreadsheet mode.)