Skip to content

Commit

Permalink
feat!: resolvePDFJS for missing top-level await support in Cf workers
Browse files Browse the repository at this point in the history
  • Loading branch information
johannschopplich committed Nov 9, 2023
1 parent ad5eb92 commit 128a01e
Show file tree
Hide file tree
Showing 10 changed files with 188 additions and 68 deletions.
93 changes: 82 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
# pdfjs-serverless

A redistribution of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js) for serverless environments, like Deno Deploy and Cloudflare Workers with zero dependencies. All named exports of the `PDF.js` library are available at roughly 1.4 MB (minified).
A redistribution of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js) for serverless environments, like Deno Deploy and Cloudflare Workers with zero dependencies. The whole export is about 1.4 MB (minified).

## PDF.js Compatibility

> [!NOTE]
> This package is currently using PDF.js v4.0.189.
If you run into issues with the current version, please open an [issue](https://github.com/johannschopplich/pdfjs-serverless/issues/new/choose) or even better, open a [pull request](https://github.com/johannschopplich/pdfjs-serverless/compare).

## Installation

Run the following command to add `pdfjs-serverless` to your project.
Expand All @@ -24,28 +22,44 @@ npm install pdfjs-serverless
yarn add pdfjs-serverless
```

## How It Works
## Usage

First, some string replacements of the `PDF.js` library is necessary, i.e. removing browser context references and checks like `typeof window`. Additionally, we enforce Node.js compatibility (might sound paradox at first, bear with me), i.e. mocking the `canvas` module and setting the `isNodeJS` flag to `true`.
Since PDF.js v4, the library migrated to ESM. Which is great. However, it also uses a top-level await, which is not supported by Cloudflare workers yet. Therefore, we have to wrap all named exports in a function that resolves the PDF.js library:

PDF.js uses a worker to parse and work with PDF documents. This worker is a separate file that is loaded by the main library. For the serverless build, we need to inline the worker code into the main library.
```ts
declare function resolvePDFJS(): Promise<typeof PDFJS>
```

To achieve the final nodeless build, [`unenv`](https://github.com/unjs/unenv) does the heavy lifting by converting Node.js specific code to be platform-agnostic. This ensures that Node.js built-in modules like `fs` are mocked.
So, instead of importing the named exports directly:

See the [`rollup.config.ts`](./rollup.config.ts) file for more information.
```ts
import { getDocument } from 'pdfjs-serverless'
```

We have to use the `resolvePDFJS` function to get the named exports:

```ts
import { resolvePDFJS } from 'pdfjs-serverless'
const { getDocument } = await resolvePDFJS()
```

## Example Usage
> [!NOTE]
> Once Cloudflare workers support top-level await, we can remove this wrapper and pass all named exports directly again.

### 🦕 Deno

```ts
import { getDocument } from 'https://esm.sh/pdfjs-serverless'
import { resolvePDFJS } from 'https://esm.sh/pdfjs-serverless'
const data = Deno.readFileSync('dummy.pdf')
// Initialize PDF.js
const { getDocument } = await resolvePDFJS()
const data = Deno.readFileSync('sample.pdf')
const doc = await getDocument(data).promise
console.log(await doc.getMetadata())
// Iterate through each page and fetch the text content
for (let i = 1; i <= doc.numPages; i++) {
const page = await doc.getPage(i)
const textContent = await page.getTextContent()
Expand All @@ -54,6 +68,63 @@ for (let i = 1; i <= doc.numPages; i++) {
}
```

### 🌩 Cloudflare Workers

```ts
import { resolvePDFJS } from 'pdfjs-serverless'
addEventListener('fetch', (event) => {
event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
if (request.method !== 'POST')
return new Response('Method Not Allowed', { status: 405 })
// Get the PDF file from the POST request body as a buffer
const data = await request.arrayBuffer()
// Initialize PDF.js
const { getDocument } = await resolvePDFJS()
const doc = await getDocument(data).promise
// Get metadata and initialize output object
const metadata = await doc.getMetadata()
const output = {
metadata,
pages: []
}
// Iterate through each page and fetch the text content
for (let i = 1; i <= doc.numPages; i++) {
const page = await doc.getPage(i)
const textContent = await page.getTextContent()
const contents = textContent.items.map(item => item.str).join(' ')
// Add page content to output
output.pages.push({
pageNumber: i,
content: contents
})
}
// Return the results as JSON
return new Response(JSON.stringify(output), {
headers: { 'Content-Type': 'application/json' }
})
}
```

## How It Works

First, some string replacements of the `PDF.js` library is necessary, i.e. removing browser context references and checks like `typeof window`. Additionally, we enforce Node.js compatibility (might sound paradox at first, bear with me), i.e. mocking the `canvas` module and setting the `isNodeJS` flag to `true`.

PDF.js uses a worker to parse and work with PDF documents. This worker is a separate file that is loaded by the main library. For the serverless build, we need to inline the worker code into the main library.

To achieve the final nodeless build, [`unenv`](https://github.com/unjs/unenv) does the heavy lifting by converting Node.js specific code to be platform-agnostic. This ensures that Node.js built-in modules like `fs` are mocked.

See the [`rollup.config.ts`](./rollup.config.ts) file for more information.

## Inspiration

- [`pdf.mjs`](https://github.com/bru02/pdf.mjs), a nodeless build of PDF.js v2.
Expand Down
5 changes: 5 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,10 @@
"tslib": "^2.6.2",
"typescript": "5.2.2",
"unenv": "^1.7.4"
},
"pnpm": {
"patchedDependencies": {
"pdfjs-dist@4.0.189": "patches/pdfjs-dist@4.0.189.patch"
}
}
}
22 changes: 22 additions & 0 deletions patches/pdfjs-dist@4.0.189.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
diff --git a/build/pdf.mjs b/build/pdf.mjs
index aa2be7dae0e57a4602cf87f3846e042b33ae419c..69959962a681710a684c0055b239ba2a22a720d4 100644
--- a/build/pdf.mjs
+++ b/build/pdf.mjs
@@ -20,6 +20,7 @@
* JavaScript code in this page
*/

+export async function initPDFJS() {
/******/ var __webpack_modules__ = ({

/***/ 640:
@@ -16877,7 +16878,8 @@ const AnnotationPrefix = "pdfjs_internal_id_";
/******/ var __webpack_exports__shadow = __webpack_exports__.shadow;
/******/ var __webpack_exports__updateTextLayer = __webpack_exports__.updateTextLayer;
/******/ var __webpack_exports__version = __webpack_exports__.version;
-/******/ export { __webpack_exports__AbortException as AbortException, __webpack_exports__AnnotationEditorLayer as AnnotationEditorLayer, __webpack_exports__AnnotationEditorParamsType as AnnotationEditorParamsType, __webpack_exports__AnnotationEditorType as AnnotationEditorType, __webpack_exports__AnnotationEditorUIManager as AnnotationEditorUIManager, __webpack_exports__AnnotationLayer as AnnotationLayer, __webpack_exports__AnnotationMode as AnnotationMode, __webpack_exports__CMapCompressionType as CMapCompressionType, __webpack_exports__DOMSVGFactory as DOMSVGFactory, __webpack_exports__FeatureTest as FeatureTest, __webpack_exports__GlobalWorkerOptions as GlobalWorkerOptions, __webpack_exports__ImageKind as ImageKind, __webpack_exports__InvalidPDFException as InvalidPDFException, __webpack_exports__MissingPDFException as MissingPDFException, __webpack_exports__OPS as OPS, __webpack_exports__PDFDataRangeTransport as PDFDataRangeTransport, __webpack_exports__PDFDateString as PDFDateString, __webpack_exports__PDFWorker as PDFWorker, __webpack_exports__PasswordResponses as PasswordResponses, __webpack_exports__PermissionFlag as PermissionFlag, __webpack_exports__PixelsPerInch as PixelsPerInch, __webpack_exports__PromiseCapability as PromiseCapability, __webpack_exports__RenderingCancelledException as RenderingCancelledException, __webpack_exports__UnexpectedResponseException as UnexpectedResponseException, __webpack_exports__Util as Util, __webpack_exports__VerbosityLevel as VerbosityLevel, __webpack_exports__XfaLayer as XfaLayer, __webpack_exports__build as build, __webpack_exports__createValidAbsoluteUrl as createValidAbsoluteUrl, __webpack_exports__getDocument as getDocument, __webpack_exports__getFilenameFromUrl as getFilenameFromUrl, __webpack_exports__getPdfFilenameFromUrl as getPdfFilenameFromUrl, __webpack_exports__getXfaPageViewport as getXfaPageViewport, __webpack_exports__isDataScheme as isDataScheme, __webpack_exports__isPdfFile as isPdfFile, __webpack_exports__noContextMenu as noContextMenu, __webpack_exports__normalizeUnicode as normalizeUnicode, __webpack_exports__renderTextLayer as renderTextLayer, __webpack_exports__setLayerDimensions as setLayerDimensions, __webpack_exports__shadow as shadow, __webpack_exports__updateTextLayer as updateTextLayer, __webpack_exports__version as version };
+/******/ return { AbortException: __webpack_exports__AbortException, AnnotationEditorLayer: __webpack_exports__AnnotationEditorLayer, AnnotationEditorParamsType: __webpack_exports__AnnotationEditorParamsType, AnnotationEditorType: __webpack_exports__AnnotationEditorType, AnnotationEditorUIManager: __webpack_exports__AnnotationEditorUIManager, AnnotationLayer: __webpack_exports__AnnotationLayer, AnnotationMode: __webpack_exports__AnnotationMode, CMapCompressionType: __webpack_exports__CMapCompressionType, DOMSVGFactory: __webpack_exports__DOMSVGFactory, FeatureTest: __webpack_exports__FeatureTest, GlobalWorkerOptions: __webpack_exports__GlobalWorkerOptions, ImageKind: __webpack_exports__ImageKind, InvalidPDFException: __webpack_exports__InvalidPDFException, MissingPDFException: __webpack_exports__MissingPDFException, OPS: __webpack_exports__OPS, PDFDataRangeTransport: __webpack_exports__PDFDataRangeTransport, PDFDateString: __webpack_exports__PDFDateString, PDFWorker: __webpack_exports__PDFWorker, PasswordResponses: __webpack_exports__PasswordResponses, PermissionFlag: __webpack_exports__PermissionFlag, PixelsPerInch: __webpack_exports__PixelsPerInch, PromiseCapability: __webpack_exports__PromiseCapability, RenderingCancelledException: __webpack_exports__RenderingCancelledException, UnexpectedResponseException: __webpack_exports__UnexpectedResponseException, Util: __webpack_exports__Util, VerbosityLevel: __webpack_exports__VerbosityLevel, XfaLayer: __webpack_exports__XfaLayer, build: __webpack_exports__build, createValidAbsoluteUrl: __webpack_exports__createValidAbsoluteUrl, getDocument: __webpack_exports__getDocument, getFilenameFromUrl: __webpack_exports__getFilenameFromUrl, getPdfFilenameFromUrl: __webpack_exports__getPdfFilenameFromUrl, getXfaPageViewport: __webpack_exports__getXfaPageViewport, isDataScheme: __webpack_exports__isDataScheme, isPdfFile: __webpack_exports__isPdfFile, noContextMenu: __webpack_exports__noContextMenu, normalizeUnicode: __webpack_exports__normalizeUnicode, renderTextLayer: __webpack_exports__renderTextLayer, setLayerDimensions: __webpack_exports__setLayerDimensions, shadow: __webpack_exports__shadow, updateTextLayer: __webpack_exports__updateTextLayer, version: __webpack_exports__version };
/******/
+}

//# sourceMappingURL=pdf.mjs.map
10 changes: 8 additions & 2 deletions pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

96 changes: 52 additions & 44 deletions src/index.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -11,47 +11,55 @@ import * as __pdfjsWorker__ from 'pdfjs-dist/build/pdf.worker.mjs'
// Although we just need: `getDocument`, `OPS` and `version`, we export
// everything, since the bundle size doesn't change, due to PDF.js's
// bundle structure by webpack.
export {
AbortException,
AnnotationEditorLayer,
AnnotationEditorParamsType,
AnnotationEditorType,
AnnotationEditorUIManager,
AnnotationLayer,
AnnotationMode,
build,
CMapCompressionType,
createValidAbsoluteUrl,
DOMSVGFactory,
FeatureTest,
getDocument,
getFilenameFromUrl,
getPdfFilenameFromUrl,
getXfaPageViewport,
GlobalWorkerOptions,
ImageKind,
InvalidPDFException,
isDataScheme,
isPdfFile,
MissingPDFException,
noContextMenu,
normalizeUnicode,
OPS,
PasswordResponses,
PDFDataRangeTransport,
PDFDateString,
PDFWorker,
PermissionFlag,
PixelsPerInch,
PromiseCapability,
RenderingCancelledException,
renderTextLayer,
setLayerDimensions,
shadow,
UnexpectedResponseException,
updateTextLayer,
Util,
VerbosityLevel,
version,
XfaLayer,
} from 'pdfjs-dist/build/pdf.mjs'
// TODO: Enable again when Cloudflare supports top-level await.
// export {
// AbortException,
// AnnotationEditorLayer,
// AnnotationEditorParamsType,
// AnnotationEditorType,
// AnnotationEditorUIManager,
// AnnotationLayer,
// AnnotationMode,
// build,
// CMapCompressionType,
// createValidAbsoluteUrl,
// DOMSVGFactory,
// FeatureTest,
// getDocument,
// getFilenameFromUrl,
// getPdfFilenameFromUrl,
// getXfaPageViewport,
// GlobalWorkerOptions,
// ImageKind,
// InvalidPDFException,
// isDataScheme,
// isPdfFile,
// MissingPDFException,
// noContextMenu,
// normalizeUnicode,
// OPS,
// PasswordResponses,
// PDFDataRangeTransport,
// PDFDateString,
// PDFWorker,
// PermissionFlag,
// PixelsPerInch,
// PromiseCapability,
// RenderingCancelledException,
// renderTextLayer,
// setLayerDimensions,
// shadow,
// UnexpectedResponseException,
// updateTextLayer,
// Util,
// VerbosityLevel,
// version,
// XfaLayer,
// } from "pdfjs-dist/build/pdf.mjs";

// Wrap PDF.js exports to circumvent Cloudflare's top-level await limitation.
import { initPDFJS } from 'pdfjs-dist/build/pdf.mjs'

export function resolvePDFJS() {
return initPDFJS()
}
15 changes: 9 additions & 6 deletions src/mock/canvas.mjs
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
export default new Proxy({}, {
get(target, prop) {
return () => {
throw new Error(`[pdfjs-serverless] canvas.${prop} is not implemented`)
}
export default new Proxy(
{},
{
get(target, prop) {
return () => {
throw new Error(`[pdfjs-serverless] canvas.${prop} is not implemented`)
}
},
},
})
)
2 changes: 1 addition & 1 deletion src/mock/path2d-polyfill.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ export default new Proxy(
{
get(target, prop) {
return () => {
throw new Error(`[unpdf] path2d-polyfill.${prop} is not implemented`)
throw new Error(`[pdfjs-serverless] path2d-polyfill.${prop} is not implemented`)
}
},
},
Expand Down
8 changes: 6 additions & 2 deletions src/rollup/plugins.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,14 @@ export function pdfjsTypes(): Plugin {
return {
name: 'pdfjs-serverless:types',
async writeBundle() {
const data = 'export * from \'./types/src/pdf.d.ts\'\n'
const data = `
import * as PDFJS from './types/src/pdf'
declare function resolvePDFJS(): Promise<typeof PDFJS>
export { resolvePDFJS }
`.trimStart()

for (const filename of ['index.d.ts', 'index.d.mts'])
await writeFile(`dist/${filename}`, data, 'utf-8')
await writeFile(`dist/${filename}`, data, 'utf8')
},
}
}
5 changes: 3 additions & 2 deletions test/deno.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
/* eslint-disable no-console */
import { getDocument } from '../dist/index.mjs'
import { resolvePDFJS } from '../dist/index.mjs'

const data = Deno.readFileSync('fixtures/dummy.pdf')
const { getDocument } = await resolvePDFJS()
const data = Deno.readFileSync('fixtures/sample.pdf')
const doc = await getDocument(data).promise

console.log(await doc.getMetadata())
Expand Down
File renamed without changes.

0 comments on commit 128a01e

Please sign in to comment.