Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted PDF file export with hOCR data #486

Closed
tukusejssirs opened this issue Jan 12, 2021 · 2 comments
Closed

Corrupted PDF file export with hOCR data #486

tukusejssirs opened this issue Jan 12, 2021 · 2 comments

Comments

@tukusejssirs
Copy link

@manisandro, I believe it is either not yet fixed or not fully fixed.

I installed gIR v3.3.1 (GTK) from Fedora repos, loaded 846 images (mostly text, three simple tables + front and back covers as images). All images processed in ST Advanced and in 600 dpi. An image size is around 500 KiB, 5 images are between 1.2 and 2.9 MiB and the covers are 16 MiB and 36 MiB respectively. Altogether, their size is 251 MiB.

gIR crashes a lot, loading hOCR HTML takes some time (I have 10 GB RAM installed in this computer, 2 cores, 4 threads).

However, sometimes it works (albeit slowly). I’ve exported the PDF file, but both Evince and Adobe Acrobat Reader (both on Linux) fails to open the file and say that it is corrupted/damaged.

I have no idea how to generate a PDF file with hOCR data. I’ve read about hocr-tools package and its hocr-pdf command; and about hocr2pdf from exactImage package, but I could not make such PDF (from either image files or pre-created PDF file and from hOCR data).

Originally posted by @tukusejssirs in #424 (comment)

@tukusejssirs
Copy link
Author

@manisandro, you have requested some traces. I have none nor I could reproduce it ad hoc. However, I have some core dumps available. However, 41 dumps I have 3 January 2021, take up 13 GiB, therefore I compressed the text the coredumpctl [pid] command outputs. You could check them all, I suggest to start with the latest (number 40). When you find something, you could tell me the number and I’ll upload the dump somewhere.

Note that not all core dumps are connected with this issue. Some crashes are connected with #479 (in general: with HTML entities, esp the ampersand character), others with the image data or whatever else I couldn’t figure out.

dump_texts.zip

@manisandro
Copy link
Owner

This looks like a crash in Gtk - if you run the application from the terminal, do you see any output when this happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants