Textract Response Parser for JavaScript/TypeScript

This library loads Amazon Textract API response JSONs into structured classes with helper methods, for easier post-processing.

It's designed to work in both NodeJS and browser environments, and to support projects in either JavaScript or TypeScript.

⚠️ Warning: If you're migrating from another TRP implementation such as the Textract Response Parser for Python, please note that the APIs and available features may be substantially different - due to differences between the languages' ecosystems.

Installation

You can use TRP in your JavaScript or TypeScript NPM projects:

$ npm install amazon-textract-response-parser

import { TextractDocument, TextractExpense } from "amazon-textract-response-parser";
const { TextractDocument, TextractIdentity } = require("amazon-textract-response-parser");

...Or link directly in the browser - for example via the unpkg CDN:

<script src="https://unpkg.com/amazon-textract-response-parser@x.y.z"></script>

<script>
  // Use via `trp`:
  var doc = new trp.TextractDocument(...);
</script>

At a low level, the distribution of this library provides multiple builds:

dist/cjs (default main), for CommonJS environments like NodeJS,
dist/es (default module), for ES6/ES2015/esnext capable environments,
dist/browser (default browser), for use directly in the browser with no module framework (IIFE), and
(Deprecated): dist/umd, for other Universal Module Definition-compatible environments. This build is slated to be removed in a future release so please let us know via GitHub issues if you have blockers for migrating to another build.

Loading data

Initialize a TextractDocument (or TextractExpense, TextractIdentity) by providing the parsed response JSON object from the corresponding Amazon Textract APIs such as GetDocumentAnalysis, AnalyzeID, or AnalyzeExpense. In most cases, providing an array of response objects is also supported (for use when a large Amazon Textract response was split/paginated).

For example, loading a response JSON from file in NodeJS:

fs.readFile("./my-analyze-document-response.json", (err, resBuffer) => {
  if (err) throw err;
  const doc = new TextractDocument(JSON.parse(resBuffer));
  // ...
});

If you're using TypeScript, you may need to typecast your input JSON while loading it.

The ApiResponsePage input interface exposed and expected by this module is subtly different from - but functionally compatible with - the output types produced by the AWS SDK for JavaScript Textract Client.

import { ApiAnalyzeExpenseResponse } from "amazon-textract-response-parser";
import { TextractClient, AnalyzeExpenseCommand } from "@aws-sdk/client-textract";
const textract = new TextractClient({});

async function main() {
  const textractResponse = await textract.send(
    new AnalyzeExpenseCommand({
      Document: { Bytes: await fs.readFile("...") },
    })
  );
  const expense = new TextractExpense((textractResponse as unknown) as ApiAnalyzeExpenseResponse);
}

With your data loaded in to a TRP TextractDocument or similar, you're ready to take advantage of the higher-level TRP.js functions to navigate and analyze the result.

Generic document text navigation

In general, this library avoids directly exposing arrays in results (see the Mutation operations section below). Instead, you can use:

.n*** properties to count items
.list***() functions to return a copy of the underlying array
.iter***() functions to iterate through collections, or
.***At***() functions to fetch a specific item from a collection

For example:

// Navigate the document hierarchy:
console.log(`Opened doc with ${doc.nPages} pages`);
console.log(
  `The first word of the first line is ${doc.pageNumber(1).lineAtIndex(0).wordAtIndex(0).text}`
);

// Iterate through content:
for (const page of doc.iterPages()) {
  // (In Textract's output order...)
  for (const line of page.iterLines()) {
    for (const word of line.iterWords()) {
      console.log(word.text);
    }
  }
}

// ...Or get snapshot arrays instead of iterators, if you need:
const linesArrsByPage = doc.listPages().map((p) => p.listLines());

These arrays are in the raw order returned by Amazon Textract, which is not necessarily a logical human reading order especially for multi-column documents. See the Other generic document analyses section below for extra content sorting utilities.

Forms

As well as looping through the form data key-value pairs in the document, you can query fields by key:

console.log(doc.form.nFields);
const fields = doc.form.listFields();

// Exact match:
const addr = doc.form.getFieldByKey("Address").value?.text;

// Search key containing (case-insensitive):
const addresses = doc.form.searchFieldsByKey("address");
addresses.forEach((addrField) => { console.log(addrField.key.text); });

You can also search form keys at the individual page level, or look up the page number for detected fields:

const fieldByDoc = doc.form.getFieldByKey("Address");
console.log(`Detected Address on page ${fieldByDoc.parentPage.pageNumber}`);

const page = doc.pageNumber(1);
const fieldByPage = page.form.getFieldByKey("Address");

Tables

This library's table navigation tools address merged cells by default, for convenience.

console.log(page.nTables);
const table = page.tableAtIndex(0);

// Index cells by row, column, or both:
const headerStrs = table.cellsAt(1, null)?.map(cell => cell.text);
const firstColCells = table.cellsAt(null, 1);
const targetCell = table.cellAt(2, 4);

// Iterate over rows/cells:
for (const row of table.iterRows()) {
  for (const cell of row.iterCells()) {
    console.log(cell.text);
  }
}

Further configuration arguments can be used to change the treatment of merged cells if needed:

// Iterate over rows repeating any cells merged across rows:
for (const row of table.iterRows(true)) {}

// Return split sub-cells instead of merged cells when indexing:
const firstColCellFragments = table.cellsAt(null, 1, true);

Expense (invoice and receipt) objects

Since the format of responses for Amazon Textract's Expense results is very different from the general document analysis APIs, you can use the separate TextractExpense class in this library to process these.

const expense = new TextractExpense(textractResponse);

// Iterate through content:
console.log(`Found ${expense.nDocs} expense docs in file`);
const expenseDoc = [...expense.iterDocs()][0];
for (const group of expenseDoc.iterLineItemGroups()) {
  for (const item of group.iterLineItems()) {
    console.log(`Found line item with ${item.nFields} fields`);
    for (const field of item.iterFields()) {
      ...
    }
  }
}

// Get snapshot arrays instead of iterators, if you need:
const summaryFieldsArrByDoc = expense.listDocs().map((doc) => doc.listSummaryFields());
const linesArrsByPage = doc.listPages().map((p) => p.listLines())

// Retrieve item fields by their tagged 'type':
const vendorNameFields = expenseDoc.searchSummaryFieldsByType("VENDOR_NAME");
console.log(`Found ${vendorNameFields.length} vendor name fields in doc summary`);
console.log(vendorNameFields[0].fieldType.text); // "VENDOR_NAME"
console.log(vendorNameFields[0].value.text); // e.g. "Amazon.com"

Identity document objects

Similarly to expenses mentioned above, Amazon Textract offers specific APIs for identity document analysis. You can use the separate TextractIdentity class in this library to process these.

import { ApiAnalyzeIdResponse, TextractIdentity } from "amazon-textract-response-parser";
import { TextractClient, AnalyzeIDCommand } from "@aws-sdk/client-textract";
const textract = new TextractClient({});

async function main() {
  const textractResponse = await textract.send(
    new AnalyzeIDCommand({
      Document: { Bytes: await fs.readFile("...") },
    })
  );
  const identity = new TextractIdentity((textractResponse as unknown) as ApiAnalyzeIdResponse);
}

The library implements some enumerations of known values (for field types, ID types, and so on) to make processing AnalyzeID responses a little simpler:

import { IdDocumentType, IdFieldType } from "amazon-textract-response-parser";

const idDoc = identity.getDocAtIndex(0); // (Or iterate, list docs in a result)

if (idDoc.idType === IdDocumentType.Passport) {
  // Fetch fields by known type:
  const passNumField = idDoc.getFieldByType(IdFieldType.DocumentNumber);
  console.log(
    `Passport number ${passNumField.value}, confidence ${passNumField.valueConfidence}%`
  );

} else if (idDoc.idType === IdDocumentType.DrivingLicense) {
  // ...Or list or iterate the document's fields:
  for (const field of idDoc.iterFields()) {
    console.log(`${field.fieldTypeRaw}: ${field.valueRaw}`);
  }

} else {
  // Produce human-readable representations of fields, documents, or whole responses:
  console.log(idDoc.str());
}

Mutation operations

Easier analysis and querying of Textract results is useful, but what if you want to augment or edit your Textract JSONs with JS/TS Textract Response Parser?

In general:

Where the library classes (TextractDocument, Page, Word, etc) offer mutation operations, these should modify the source API JSON object in-place and ensure self-consistency.
For library classes that are backed by a specific object in the source API JSON, you can access it via the .dict property (word.dict, table.dict, etc) but are responsible for updating any required references in other objects if making changes there.

In particular for array properties, you'll note that TRP generally exposes getters and iterators (such as table.nRows, table.iterRows(), table.listRows(), table.cellsAt()) rather than direct access to lists - to avoid implying that arbitrary array mutations (such as table.rows.pop()) are properly supported.

Other features and examples

For more examples of the features of the library, you can refer to the tests folder and/or the source code. If you have suggestions for additional features that would be useful, please open a GitHub issue!

Development

The integration tests for this library validate the end-to-end toolchain for calling Amazon Textract and parsing the result, so note that to run the full npm run test command:

Your environment will need to be configured with a login to AWS (e.g. via the AWS CLI)
Billable API requests may be made

You can alternatively run just the local/unit tests via npm run test:unit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Textract Response Parser for JavaScript/TypeScript

Installation

Loading data

Generic document text navigation

Forms

Tables

Other generic document analyses

List text in approximate human reading order

Segment headers and footers from main content

Calculate average skew of page text

Expense (invoice and receipt) objects

Identity document objects

Mutation operations

Other features and examples

Development

Files

README.md

Latest commit

History

README.md

File metadata and controls

Textract Response Parser for JavaScript/TypeScript

Installation

Loading data

Generic document text navigation

Forms

Tables

Other generic document analyses

List text in approximate human reading order

Segment headers and footers from main content

Calculate average skew of page text

Expense (invoice and receipt) objects

Identity document objects

Mutation operations

Other features and examples

Development