Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: compress pdf files merging identical objects #2795

Merged
merged 19 commits into from
Aug 13, 2024
Merged
20 changes: 7 additions & 13 deletions docs/user/file-size.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,17 @@ Some PDF documents contain the same object multiple times. For example, if an
image appears three times in a PDF it could be embedded three times. Or it can
be embedded once and referenced twice.

This can be done by reading and writing the file:
When adding data to a PdfWriter, the data is copied while respecting the original format.
For example, if two pages include the same image which is duplicated in the source document, the object will be duplicated in the PdfWriter object.

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("big-old-file.pdf")
writer = PdfWriter()
Additionally, when you delete objects in a document, pypdf cannot easily identify whether the objects are used elsewhere or not or if the user wants to keep them in. When writing the PDF file, these objects will be hidden within (part of the file, but not displayed).

for page in reader.pages:
writer.add_page(page)
In order to reduce the file size, use a compression call: `writer.compress_identical_objects(remove_identicals=True, remove_orphans=True)`

if reader.metadata is not None:
writer.add_metadata(reader.metadata)
* `remove_identicals` enables/disables compression merging identical objects.
* `remove_orphans` enables/disables suppression of unused objects.

with open("smaller-new-file.pdf", "wb") as fp:
writer.write(fp)
```
It is recommended to apply this process just before writing to file/stream.
stefan6419846 marked this conversation as resolved.
Show resolved Hide resolved

It depends on the PDF how well this works, but we have seen an 86% file
reduction (from 5.7 MB to 0.8 MB) within a real PDF.
Expand Down
22 changes: 17 additions & 5 deletions pypdf/_text_extraction/_layout_mode/_font.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def __post_init__(self) -> None:
self.font_dictionary["/DescendantFonts"]
):
while isinstance(d_font, IndirectObject):
d_font = d_font.get_object() # type: ignore[assignment]
d_font = d_font.get_object()
self.font_dictionary["/DescendantFonts"][d_font_idx] = d_font
ord_map = {
ord(_target): _surrogate
Expand Down Expand Up @@ -75,20 +75,32 @@ def __post_init__(self) -> None:
{
ord_map[_cidx]: _width
for _cidx, _width in zip(
range(cast(int, start_idx), cast(int, start_idx) + len(width_list), 1),
range(
cast(int, start_idx),
cast(int, start_idx) + len(width_list),
1,
),
width_list,
)
if _cidx in ord_map
}
)
skip_count = 1
# check for format (2): `int int int`
elif isinstance(w_next_entry, (int, float)) and isinstance(_w[idx + 2].get_object(), (int, float)):
start_idx, stop_idx, const_width = w_entry, w_next_entry, _w[idx + 2].get_object()
elif isinstance(w_next_entry, (int, float)) and isinstance(
_w[idx + 2].get_object(), (int, float)
):
start_idx, stop_idx, const_width = (
w_entry,
w_next_entry,
_w[idx + 2].get_object(),
)
self.width_map.update(
{
ord_map[_cidx]: const_width
for _cidx in range(cast(int, start_idx), cast(int, stop_idx + 1), 1)
for _cidx in range(
cast(int, start_idx), cast(int, stop_idx + 1), 1
)
if _cidx in ord_map
}
)
Expand Down
Loading
Loading