Skip to content

Commit

Permalink
Merge pull request #60 from xi-00/es6
Browse files Browse the repository at this point in the history
Refactor PDF Library - Better compatibility and breaking changes
  • Loading branch information
ol-th authored Sep 27, 2024
2 parents c6e13be + ca89efa commit ba6580b
Show file tree
Hide file tree
Showing 21 changed files with 581 additions and 1,425 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,4 +103,7 @@ dist
# TernJS port file
.tern-port

# Webstorm idea folder
.idea

examples/outputImages/*.png
15 changes: 14 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,20 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Unreleased

### Added
- **ES6 Module Support**: Switched from `require()` to ES6 `import` for better compatibility with modern JavaScript frameworks.
- **Parallel Page Rendering**: Introduced parallel rendering of PDF pages using `Promise.all()` for improved performance.
- **GlobalWorkerOptions Setup**: Added explicit `GlobalWorkerOptions.workerSrc` to handle workers in ES6 environments more effectively.
- **Examples and Server Setup**: Added detailed examples and a simple Node.js server setup in the "examples" directory.
- **Improved Error Handling**: Added more informative error messages for invalid page numbers, dimensions, and scale values.
- **Updated `package.json` Keywords**: Added new keywords such as `es6`, `esm`, and `module` to improve discoverability in modern JavaScript environments.

### Fixed
- **Security**: Addressed a critical security vulnerability in previously used npm packages by updating `pdfjs-dist` to the latest versions.
- **Scaling Logic**: Refined the handling of `width`, `height`, and `scale` parameters to cover more edge cases in image rendering.

### Updated
- **README**: Revised the `README.md` to reflect new features and usage examples.
## 1.2.1 - 2023-04-16

## 1.2.0 - 2023-03-05
Expand Down
306 changes: 256 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
# pdf-img-convert.js
**A pure javascript package to convert a PDF into images**

**This package is powered mainly by Mozilla's [PDF.js](https://github.com/mozilla/pdf.js)**
**A lightweight JavaScript package for converting PDFs into images,
built on Mozilla's powerful [PDF.js](https://github.com/mozilla/pdf.js) library.**

## Motivation

There are a lot of solutions for converting PDFs with javascript already but they all make excessive use of the filesystem in the form of
temporary files and use non-native binaries like ghostscript.
While there are numerous JavaScript solutions for converting PDFs, many rely heavily on the filesystem by creating temporary files and using non-native binaries like Ghostscript.

This solution solely uses javascript arrays, cleaning up the pipeline significantly and (hopefully) making it faster.
This solution simplifies the process by exclusively using JavaScript arrays, significantly streamlining the pipeline and, ideally, enhancing performance.

## Installation

Expand All @@ -18,86 +16,294 @@ npm install pdf-img-convert

## Usage

The package returns an `Array` of `Uint8Array` objects, each of which represents an image encoded in png format.
The package by default returns an `Array` of `Uint8Array` objects,
each of which represents an image encoded in png format or a base64-encoded image output if required

Here are some examples of its usage - obviously import the module first:
Here are some examples of how to use the module. First, make sure to import it

```javascript
var pdf2img = require('pdf-img-convert');
const pdf2img = await import("pdf-img-convert");
```
The package provides a single function, convert, which accepts the following PDF formats as input:

The package has 1 function - `convert`. It accepts the following pdf formats as input:

* URL of a PDF (e.g. www.example.com/a.pdf)
* URL of a PDF (e.g., www.example.com/sample.pdf)

* Path to a local pdf file (e.g. ../example.pdf)
* Path to a local PDF file (e.g., ./examples/sample.pdf)

* A `Buffer` object containing PDF data

* A `Uint8Array` object containing PDF data

* Base64-encoded PDF data

**NB: it is an asynchronous function so returns a `promise` object.**
**Note: The convert function is asynchronous and returns a `Promise` object.**

The output can be manipulated using the `conversion_config` argument mentioned below.

Here's an example of how to use it in synchronous code:
Here's an example of how to use it in synchronous code, `uint8arrays` as the default output:

```javascript
// Both HTTP and local paths are supported
var outputImages1 = pdf2img.convert('http://www.example.com/pdf_online.pdf');
var outputImages2 = pdf2img.convert('../pdf_in_local_filesystem.pdf');

// From here, the images can be used for other stuff or just saved if that's required:

var fs = require('fs');
import fs from 'fs';
import path from 'path';

async function processPDFs() {
const pdf2img = await import("pdf-img-convert");

// Both HTTP, HTTPS, and local paths are supported
const outputWithExternalLink = await pdf2img.convert('https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf');
const outputWithLocalSample = await pdf2img.convert('./test_pdfs/sample.pdf');

// OUTPUT OPTIONS ARRAY
const outputs = [outputWithExternalLink, outputWithLocalSample];

const pdfArray = outputs[0]; // Change the index to select different outputs

function saveImages(pdfArray) {
pdfArray.forEach((image, index) => {
const outputPath = path.join('./outputImages', `saveImages_${index}.png`);
fs.writeFile(outputPath, image, (error) => {
if (error) {
console.error(`Error saving image ${index + 1}:`, error);
} else {
console.log(`Image ${index + 1} saved successfully`);
}
});
});
}
// Call the function to save images
saveImages(pdfArray);
}
// Call the async function
processPDFs();
```

outputImages1.then(function(outputImages) {
for (i = 0; i < outputImages.length; i++)
fs.writeFile("output"+i+".png", outputImages[i], function (error) {
if (error) { console.error("Error: " + error); }
Here's an example of how to use it in synchronous code, `base64-encoded` output as the chosen option:
```javascript
import fs from 'fs';
import path from 'path';

let config = {
base64: true,
scale: 2
};
async function processPDF() {
const pdf2img = await import("pdf-img-convert");
const outputWithLocalSampleAndConfig = await pdf2img.convert('./examples/example.pdf', config);
const pdfArray = outputWithLocalSampleAndConfig;
function saveBase64Images(pdfArray) {
console.log('Processing base64Images...');
pdfArray.forEach((base64Data, index) => {
// Convert Base64 string to binary buffer
const buffer = Buffer.from(base64Data, 'base64');
// Define an output path
const outputPath = path.join('./outputImages', `saveBase64Image_${index}.png`);
// Write the buffer to a PNG file
fs.writeFile(outputPath, buffer, (error) => {
if (error) {
console.error(`Error saving image ${index + 1} (base64):`, error);
} else {
console.log(`Image ${index + 1} (base64) saved successfully`);
}
});
});
});
}
// Example usage: call this function with the Base64 array
saveBase64Images(pdfArray);
}
// Call the async function
processPDF();
```

It's a lot easier and cleaner to implement inside an `async function` using `await`:
Here's an example of how to use it in Next.js api folder, `base64-encoded` output as the chosen option:

```javascript
*Ensure you have a dotenv file with `ROOT_PATH` pointing to your root directory*

```dotenv
ROOT_PATH=./
```

*You have a `/public/uploads directory` in the root entry of your project, if not create it.*

(async function () {
pdfArray = await pdf2img.convert('http://www.example.com/pdf_online.pdf');
console.log("saving");
for (i = 0; i < pdfArray.length; i++){
fs.writeFile("output"+i+".png", pdfArray[i], function (error) {
if (error) { console.error("Error: " + error); }
}); //writeFile
} // for
})();

```javascript
import path from "path";
import fs from "fs";

const UPLOAD_DIR = path.resolve(process.env.ROOT_PATH ?? "", "public/uploads");

/*SHARP ENHANCEMENT*/
const writeToDir = async (imageData, fileName, filePath) => {
try {
await fs.promises.writeFile(filePath, imageData);
console.log(`Image processed and saved successfully: ${filePath}`);
return {
fileName: fileName,
filePath: filePath
};
} catch (error) {
console.error("Error processing the image:", error);
return null;
}
};
export const POST = async (req: NextRequest) => {
const pdf2img = await import("pdf-img-convert");
const formData = await req.formData();
const pdfFiles = []
const convertedImages = []

// Loop over the FormData content
for (const [key, value] of formData.entries()) {
if (value instanceof Blob) {
if (key === "pdfFiles") {
pdfFiles.push(value);
}
}
}
// check for file uploads
if (pdfFiles.length === 0) {
return NextResponse.json({
success: false,
message: "No files uploaded"
});
}
for (const file of pdfFiles) {
const pdfBuffer = Buffer.from(await file.arrayBuffer());
const imagePages = await pdf2img.convert(pdfBuffer, { base64: true }); // Ensure base64 is true to get Base64 data

for (let i = 0; i < imagePages.length; i++) {
const imageData = imagePages[i];
const base64Data = imageData.replace(/^data:image\/\w+;base64,/, ""); // Remove Base64 header
const imageBuffer = Buffer.from(base64Data, "base64"); // Convert Base64 to Buffer

const fileName = `${path.basename(file.name, path.extname(file.name))}_page_${i + 1}.png`;
const filePath = path.join(UPLOAD_DIR, fileName);

const writeFile = await writeToDir(imageBuffer, fileName, filePath);
if (writeFile) {
convertedImages.push(writeFile);
}
}
}
// if successfully
return NextResponse.json({
success: true,
message: "Files uploaded successfully",
});
}
```
Here is how to fetch single or multiple PDF documents from the client side using a `.jsx or .tsx` component:
*In this case my `api` route points to `/app/api/pdf2img`, you can change it to suite your own naming*
```jsx
"use client";
import React, { useRef, useState } from "react";
export default function PDFUploader() {
const MAX_FILES = 4;
const [fileNames, setFileNames] = useState([]);
const [pdfFiles, setPdfFiles] = useState([]);
const [errorMessage, setErrorMessage] = useState("");
const fileInputRef = useRef(null);

const handleFileChange = (e) => {
const files = Array.from(e.target.files);
if (files.length > MAX_FILES) {
setErrorMessage(`You can only upload a maximum of ${MAX_FILES} files.`);
clearStates();
} else {
const names = files.map((file) => file.name);
const pdfs = files.filter(file => file.type === "application/pdf");
setStates(pdfs, names);
}
};

const handleSubmit = async (e) => {
e.preventDefault();
if (pdfFiles.length === 0) {
setErrorMessage("No PDF files selected.");
return;
}
try {
const formData = new FormData();
pdfFiles.forEach((file) => {
formData.append("pdfFiles", file); // Key is "pdfFiles"
});

const response = await fetch("/api/pdf2img", {
method: "POST",
body: formData
});

const result = await response.json();
if (result.success) {
clearStates();
alert(`Upload successful: ${result.message}`);
} else {
alert(`Upload failed: ${result.message}`);
}
} catch (error) {
console.error("Upload error:", error);
alert("An error occurred during upload.");
}
};

const clearStates = () => {
setFileNames([]);
setPdfFiles([]);
if (fileInputRef.current) {
fileInputRef.current.value = ""; // Reset the file input via ref
}
};

const setStates = (pdfs, names) => {
setPdfFiles(pdfs);
setFileNames(names);
};

return (
<html>
<body style={{padding:0, left: 0}}>
<form onSubmit={handleSubmit}>
<input
ref={fileInputRef}
onChange={handleFileChange}
type="file"
name="fileUpload"
multiple
accept=".pdf"
/>
<button type="submit">Submit</button>
</form>
{errorMessage && (
<div>
<p style={{ textAlign: "center", color: "red" }}>{errorMessage}</p>
</div>
)}
</body>
</html>
);
}
```
There is also an optional second `conversion_config` argument which accepts an object like this:

```javascript
{
width: 100, //Number in px
height: 100, // Number in px
page_numbers: [1, 2, 3], // A list of pages to render instead of all of them
base64: true,
scale: 2.0
width: 100 //Number in px
height: 100 // Number in px
scale: 2
page_numbers: [1, 2, 3] // A list of pages to render instead of all of them
base64: true
}
```
The following attributes are optional and can be omitted from the configuration object:
(Any of these attributes can be omitted from the object - they're all optional)

* `width` or `height` control the scale of the output images - One or the other, it ignores height if width is supplied too.
* `width` or `height`: Controls the scale of the output images. If both are provided, only width will be used, and height will be ignored.
* `page_numbers` controls which pages are rendered - pages are 1-indexed.
* `page_numbers`: Specifies which pages to render. Pages are 1-indexed.
* `base64` should be set to `true` if a base64-encoded image output is required. Otherwise it'll just output an array of `Uint8Array`s.
* `base64`: Set to true to receive base64-encoded image output. If omitted or set to false, the output will be an array of Uint8Arrays.
* `scale` is the viewport scale ratio, which defaults to 1 (original width and height).
* `scale`: Defines the viewport scale ratio, defaulting to 1 (original width and height).
## Contributing
Expand Down
Loading

0 comments on commit ba6580b

Please sign in to comment.