-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF pages > concatenate > resize > jpg (but faster) #3774
Comments
The timings here suggest it's the relatively simple PNG decode to JPG encode roundtrip that is the slowest part. Given you're using PDF inputs I presume you're also using a globally-installed libvips compiled with support for a PDF library. Choice of PNG and ZLIB libraries will have an impact on PNG decode time, which alternatives have you tried? As with any performance question, a standalone repo with all code, dependencies and images that allows someone else to reproduce would be useful. |
@lovell Thanks for the quick reply. I'm new to Sharp and just learning about how to optimize it as I take over some existing customer code. We have an image processing service (that spins up child processes to process queue jobs). The service keeps restarting when processing large files using the composite chain of command I shared above, with 30 page docs and above. The cluster detects service becoming unresponsive and spins up a new one in its place. I have yet to monitor and conclude it it's a cpu or memory spike. I'm leaning towards the second. Also, The service's dockerfile doesn't have libvips explicitly installed. This runs on generic EC2 instances in AWS via EKS if that's of interest to you. Also I see examples of docker files with SHARP_IGNORE_GLOBAL_LIBVIPS=1 but I don't know where to read more about it and if it's advised. |
@simoami Were you able to make any progress with this? |
Hi @lovell yes, and sorry for the last email lacking focus. The PDF to image conversion I wish to implement needs to scale up to hundreds of concurrent uses. To that effect, I started working on a process that limits memory usage by using the file system when the number of pages is above a predefined threshold. I will post the code and profiling charts shortly. |
@lovell Below is the code I wrote to implement the process workflow in my previous comment and corresponding cpu/memory profiles (ran on Mac M2 Max 32Gb). The memory based process takes ~ 14s, while the file system based process takes ~19s . Timing breakdown for memory based processing:
Timing breakdown for the file-based processing:
Follow-up questions:
Let me know if you see any possibility of improving this process, especially the file-based one. ProfilingMemory-based PDF conversion (50 page pdf doc) File System-based PDF conversion (50 page pdf doc) Source (Partial)/**
* Maximum dimension allowed for JPEG images.
*/
const MAX_JPEG_DIMENSION = 65_500;
const PAGE_CONVERSION_CONCURRENCY = 3;
/**
* threshold for saving to disk if the PDF page count exceeds this value
*/
const MIN_PAGES_FOR_DISK_SAVE = 10;
/**
* Main method
*/
async function pdfToImage(
pdfFileOrBuffer: string | ArrayBufferLike,
props?: PdfToImageOptions,
): Promise<sharp.OutputInfo> {
const pages = await _pdfToPages(pdfFileOrBuffer, props);
const shouldSaveToDisk = pages.length >= MIN_PAGES_FOR_DISK_SAVE;
const scale =
props?.viewportScale !== undefined
? props.viewportScale
: (PDF_CONVERSION_OPTIONS_DEFAULTS.viewportScale as number);
// find the page with the max width and return its value. We're going to use it as the reference width for all pages.
const maxWidth = _getMaxWidth(pages, scale);
const imagesToConcat: sharp.OverlayOptions[] = await _resizeImages(pages, maxWidth, shouldSaveToDisk);
// total height represents the vertical space occupied by all the pdf pages stacked up on top of each other.
// This is after stretching pages to have the same width and keeping their aspect ratio
const totalHeight = _getTotalHeight(imagesToConcat);
log.info(`generate a single page from ${imagesToConcat.length} page(s) with size ${maxWidth} x ${totalHeight}`);
const fullImage = await sharp({
create: {
width: maxWidth,
height: totalHeight,
channels: 3,
background: { r: 255, g: 255, b: 255 },
},
limitInputPixels: false,
})
.composite(imagesToConcat)
.raw()
.toBuffer({ resolveWithObject: true });
log.info(`output image produced with size ${fullImage.info.width} x ${fullImage.info.height}`);
// check if resulting height is larger than the JPEG size limit
// Resize output image accordingly height
const imageToSave: sharp.Sharp = await sharp(fullImage.data, {
raw: fullImage.info,
// prevents error: Input image exceeds pixel limit
limitInputPixels: false,
});
if (fullImage.info.height > MAX_JPEG_DIMENSION) {
imageToSave.resize(null, MAX_JPEG_DIMENSION);
log.info(
`output image resized to safe jpg limits with size ${Math.round(
(fullImage.info.width * MAX_JPEG_DIMENSION) / fullImage.info.height,
)} x ${MAX_JPEG_DIMENSION}`,
);
}
const outputFile = path.resolve(props?.outputFolder || __dirname, 'composite.jpg');
log.info(`save image to ${outputFile}`);
const output = imageToSave
// Enhance text clarity for OCR
// .normalise()
.jpeg({ quality: 80 })
.toFile(outputFile);
return output;
}
async function _pdfToPages(
pdfFileOrBuffer: string | ArrayBufferLike,
props?: PdfToImageOptions,
): Promise<PageOutput[]> {
try {
const isBuffer: boolean = Buffer.isBuffer(pdfFileOrBuffer);
const pdfFileBuffer: ArrayBuffer = isBuffer
? (pdfFileOrBuffer as ArrayBuffer)
: await readFile(pdfFileOrBuffer as string);
const canvasFactory = new NodeCanvasFactory();
const docInitParams = _getPDFDocInitParams(props);
docInitParams.data = new Uint8Array(pdfFileBuffer);
docInitParams.canvasFactory = canvasFactory;
const pdfDocument: pdfApiTypes.PDFDocumentProxy = await pdfjsLib.getDocument(docInitParams).promise;
const pageNumbers: number[] = Array.from({ length: pdfDocument.numPages }, (_, index) => index + 1);
const shouldSaveToDisk = pdfDocument.numPages >= MIN_PAGES_FOR_DISK_SAVE;
let pageName = PDF_CONVERSION_OPTIONS_DEFAULTS.outputFileMask;
if (props?.outputFileMask) {
pageName = props.outputFileMask;
}
if (!pageName && !isBuffer) {
pageName = path.parse(pdfFileOrBuffer as string).name;
}
const pageOutputs: PageOutput[] = await Bluebird.map(
pageNumbers,
(pageNumber) => _renderSinglePage(pdfDocument, pageNumber, pageName, canvasFactory, shouldSaveToDisk, props),
// no concurrency if saving to disk to reduce memory usage
{ concurrency: shouldSaveToDisk ? 1 : PAGE_CONVERSION_CONCURRENCY },
);
await pdfDocument.cleanup();
return pageOutputs;
} catch (err) {
log.error(err as Error);
throw err;
}
}
function _getMaxWidth(pages: PageOutput[], scale: number) {
return Math.min(
Math.floor(pages.reduce((previous, page) => Math.max(page.width, previous), 0) * scale),
MAX_JPEG_DIMENSION,
);
}
function _getTotalHeight(images: sharp.OverlayOptions[]) {
return images.reduce((total, current) => total + (current.raw?.height ?? 0), 0);
}
async function _resizeImages(pages: PageOutput[], targetWidth: number, shouldSaveToDisk: boolean) {
const imagesToConcat: sharp.OverlayOptions[] = [];
let totalHeight = 0;
for (let i = 0; i < pages.length; i++) {
const { content, path: filePath } = pages[i];
if (shouldSaveToDisk) {
// resize and save image to disk
const parsedPath = path.parse(filePath);
const newFilePath = path.join(parsedPath.dir, `${parsedPath.name}_resized${parsedPath.ext}`);
const resizedImage = await sharp(filePath).resize(targetWidth).jpeg({ quality: 80 }).toFile(newFilePath);
const roundedWidth = Math.floor(resizedImage.width);
const roundedHeight = Math.floor(resizedImage.height);
log.info(`resized page ${pages[i].pageNumber} to ${roundedWidth} x ${roundedWidth}`);
imagesToConcat.push({
input: filePath,
raw: { width: roundedWidth, height: roundedHeight, channels: resizedImage.channels },
left: 0,
top: totalHeight,
limitInputPixels: false,
});
totalHeight += roundedHeight;
} else {
// retain image as buffer
const resizedImage = await sharp(content).resize(targetWidth).raw().toBuffer({ resolveWithObject: true });
const roundedWidth = Math.floor(resizedImage.info.width);
const roundedHeight = Math.floor(resizedImage.info.height);
log.info(`resized page ${pages[i].pageNumber} to ${roundedWidth} x ${roundedWidth}`);
imagesToConcat.push({
input: resizedImage.data,
raw: { width: roundedWidth, height: roundedHeight, channels: resizedImage.info.channels },
left: 0,
top: totalHeight,
limitInputPixels: false,
});
totalHeight += roundedHeight;
}
}
return imagesToConcat;
}
|
Please see #1580
If I understand correctly, you could add a second pipeline to resize the concatenated image to within JPEG limits.
You could experiment with random access input: sharp(input, { sequentialRead: false })... ...which reduces disk I/O at the cost of increased memory usage. However this question is probably about the compositing step, so #1580 is relevant again, plus #179 might also be of interest. |
I hope this information helped. Please feel free to re-open with more details if further assistance is required. |
Hello, I have the following workflow to convert a PDF file to a JPEG image for web viewing. This workflow becomes slow with large PDFs (50 pages+) and would like to find out if there are avenues to improve processing speed:
I added some timing checkpoints to show the timing breakdown:
Is there a chance some of the intermediary steps can be omitted? e.g.
toBuffer()
called 3 times andpng()
is used to make the chained commands work even though the format isn't needed. Any tips to improve performance are welcome.The text was updated successfully, but these errors were encountered: