Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: properly treat errors from recording out of bounds positions in very large files #290

Merged

Conversation

galkahana
Copy link
Owner

@galkahana galkahana commented Dec 7, 2024

The library current PDF output is limited to ~10gbs.
This is due to a limitation of PDF when writing regular xrefs. 10 digits is what's used to represent position and so anything larger than what can be represented by 10 digits is simply out of bounds. Seems like i always had a check if the object written already exceeds this position on object start but never bothered with propagating this fact. There's a recent issue attempting to write very large PDF files (#289), and there's bad artifacts given this neglected failure.

PDF1.5 and higher provide a method to write xrefs as xref streams...and more important - set the length of the position descriptors. so you can set it to more than 10 digits if you need to create larger files. See towards the end about plans to support this.

Now i knew this would be a terrible one to fix, given it's such a basic and low level functionality in the library, which directly affects things like starting a new object in either fashion. Still. what's needs fixing will be fixed.

Most of the methods were able to properly propagate status without breaking API. i sometimes had to convert a void function to return status...but other than that (which doesnt in itself break existing usages..though maybe it'll be good to start consulting the result) things seem ok.

One area where this change effects the convenience of the API is the content context. It's really nice to be able to write output without having to check status and this kinda ruins it.
So, to make things simple still I coded stuff so that you can write multiple commands without consulting the returned status, and when done you can then check the status of the context and through that learn if there's a failure or not. So...not each time...but rather after a sequence of commands. Use AbstractContentContext::GetCurrentStatusCode() to get the current status. it can only go from eSuccess to eFailure, so once it does no point in continuing.

Do note still that with all these options to check status things are the same. It is only that if you managed to exceed 10gbs or so than the new behavior will properly propagate the inability to represent positions of that size. You probably are not doing that now if the files do not come out corrupted. The only difference is that instead of getting an error on PDFWriter::EndPDF() you'll get it earlier, at the point where the library notices that it exceeded what can be represented by Xref.

While it's nice to get an early warning it'd be nicer to be able to create very large file (well at least till we get to long long which is how file sizes are represented here). Im thinking of taking care of this by providing 1.5 style xref streams. Been dealing with those when added file modifications years ago. i just need to provide a method to write them also on regular files, and provide the option to choose them (and the bytesize). I'll think positively about doing that is about what im willing to say at this point ;).

@galkahana galkahana merged commit 015bd5e into master Dec 7, 2024
7 checks passed
@galkahana galkahana deleted the galk.safe_xref_writing.check_status_when_starting_object branch December 8, 2024 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant