-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: integer out of range converting 10585497845 from a 8-byte signed type to a 4-byte signed type #836
Comments
Some followup. To try to determine where in the series of commands it was going wrong, I modified it in several ways. One, I had it process just one file at a time, as one idea is that it might be a memory error. Two, I disabled OCR and tried only to run the file optimization, since that seems to be where the error is happening. Three, I remade the PDFs using the JP2 images available at Archive.org, considering that perhaps the PDF itself was somehow at fault. I also enabled verbose logging to learn more about what was happening (and also because looking at a terminal that appears to be doing nothing for several hours is frustrating). The command now looks like this:
For the first file that has completed, this has avoided the error; however, image optimization did not improve the file, which is a surprise. PS: the verbose logging showed what is probably a typo of "image" in lines like:
Looks like the typo is in optimize.py at line 100. |
The optimizer does not try to optimize JPEG2000 images, because JPEG2000 images are widely used in medical imaging and to my knowledge anyway, haven't been adopted in many other fields. Also, if you are using whole page color images, there won't be anything to compress with JBIG2 since there will not be monochrome images. (We do not do page color segmentation at this time, i.e., finding regions of a page or image that can be represented with a reduced colorspace. It's not an easy feature to implement and will probably need a corporate sponsor so that I can work on it full time for a few weeks. You do get better compression if you're able to work with the original PDFs.) Your new method avoids the error entirely, because the optimizer has nothing to do, assuming it has something to do with the optimizer. If you're able to, please run the original file with the |
I should note that I made an error in my original post. The PDFs I linked to were not the issue. They were fine. In fact, after further testing, I confirmed the error occurs exclusively with some, but not all, PDFs made from JP2 files via img2pdf. For example, I ran my commands on PDF documents of 100, 200, 300, 400, and 500 pages made from JP2 files and they were fine. However, making a PDF out of the JP2 files of the book at https://archive.org/details/gamalnorskordbok00haeguoft/, which is 648 pages, consistently fails. Here is that PDF. https://www.dropbox.com/s/o7q5a8u4nujzsf6/GamalNorsk-JP2.pdf?dl=0 I ran the commands with the -k switch. However, the resulting temp directory is 86GB. Is there a particular part of it you would most like to see? I'm enclosing a directory listing and the debug log as they may be most helpful. I think you are right: there's something happening at the end when optimization is supposed to occur. The "metafix.pdf" file (10.3 GB, or else I would attach it) in the temp directory, is a fully formed PDF file that has OCRed text where it should be and seems to be correct in all respects. Created right after that is an empty "images" folder and a zero bytes "optimize.opt.pdf" file, after which the process failed and nothing further was done. If I then run an optimization command on metafix.pdf, skipping OCR, |
I ran into this error as well. Here is my stack trace in case that helps:
Looks like it might be occurring in pikepdf. |
@elliotwaite
qpdf input.pdf output.pdf
with pikepdf.open('input.pdf') as p:
p.save('output.pdf') Do you get a similar "integer out of range" error from either of these? |
@grantbarrett # issue836.pdf is GamalNorsk-JP2.pdf
ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf issue836.pdf issue836_.pdf Perhaps you ran out of disk space? What version of libqpdf is installed? |
I haven't tested out using Here's my version info:
And here's my OS info: I have 128 GB free disk space, so I don't think it was a disk space issue. |
Also having this issue with macOS + Homebrew. The 64-bit integer (4403520865) that causes the overflow error is suspiciously similar to the PDF file size, which is 4403741686 bytes. Stacktrace
Software versions
|
@jbarlow83 qpdf is at 10.4.0. I have more than 190GB of space free. For problem files for the time being, I am now converting JP2 images to JPG and then running img2pdf and then my usual ocrmypdf commands. |
Also couldn't reproduce with: Apple Silicon, macOS 12.0.1, GamalNorsk-JP2.pdf, ocrmypdf 13.1.1, Python 3.9.9 Homebrew, qpdf 10.4.0, img2pdf 0.4.3, ocrmypdf 13.1.1, same command line as previously attempted. I'll try my old Intel Mac.... |
Also not reproducible on Intel Mac, macOS 10.15 Catalina, same code and file. |
Closing due to inactivity and older versions. |
Trying OCRmyPDF for the first time today and ran into this issue. If it is helpful for troubleshooting, this is what I am seeing on my WSL2 Ubuntu 20.04 LTS environment
Since earlier in the thread, it seemed app versions were contributing to the problem I spun up another WSL2 environment running Ubuntu 22.x. The error goes away (this is the same PDF file).
My novice conclusion to this is that old software versions may be the root cause to this issue. Happy to provide additional detail if needed. |
Same issue here. Traceback:
|
Also getting the same issue. Was anyone able to determine how to fix it? |
For anyone still seeing this issue, it is because you have a very old version of pikepdf and qpdf. I am locking the thread because that is the answer. For Anaconda users, please note that Anaconda's version of OCRmyPDF is 2 major releases behind and its version of pikepdf is 3 major releases behind. Anaconda packages often lags significantly behind pretty much everything. Please consider switching to the standard Python ecosystem which is often much more stable. If you are able to reproduce on ocrmypdf 16+ or pikepdf 8+, open a new issue. |
When processing certain files, an "integer out of range" error occurs. The error on two files I processed today:
The command I usually run is something like the one I ran today. I typically have a directory with a number of files that require OCR using the same language(s) and this runs through the directory OCRing everything. The high timeout is to accommodate dense pages of text and multiple languages, as large, multilingual dictionaries (which I do a lot of) are slow to process.
The files I was working with can be downloaded from Archive.org here: (note that Archive.org is having uptime problems as of today Sept. 19 17h00 UTC).
https://archive.org/download/gamalnorskordbok00haeguoft/gamalnorskordbok00haeguoft.pdf
https://archive.org/download/nynorsketymologi00torp/nynorsketymologi00torp.pdf
Expected behavior
Usually these errors do not occur, and the post-processing happens fine. It is not clear to me why it happens on some files and not on others. It does not seem be related to file length. For example, the files it happened on today were 908 pages and 652 pages (which are not, by far, the upper boundaries of the files I work with; I process a lot of multilingual dictionaries with high page-counts and most of them do not run into this problem).
Here is the full text of the exception leading up to the error:
System
The text was updated successfully, but these errors were encountered: