Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add level parameter to compress_content_streams #2044

Merged
merged 2 commits into from
Aug 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 21 additions & 6 deletions docs/user/file-size.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Reduce PDF Size
# Reduce PDF File Size

There are multiple ways to reduce the size of a given PDF file. The easiest
one is to remove content (e.g. images) or pages.
Expand Down Expand Up @@ -77,7 +77,8 @@ pypdf supports the FlateDecode filter which uses the zlib/deflate compression
method. It is a lossless compression, meaning the resulting PDF looks exactly
the same.

Deflate compression can be applied to a page via [`page.compress_content_streams`](https://pypdf.readthedocs.io/en/latest/modules/PageObject.html#pypdf._page.PageObject.compress_content_streams):
Deflate compression can be applied to a page via
[`page.compress_content_streams`](https://pypdf.readthedocs.io/en/latest/modules/PageObject.html#pypdf._page.PageObject.compress_content_streams):

```python
from pypdf import PdfReader, PdfWriter
Expand All @@ -96,15 +97,29 @@ with open("out.pdf", "wb") as f:
writer.write(f)
```

`page.compress_content_streams` uses [`zlib.compress`](https://docs.python.org/3/library/zlib.html#zlib.compress)
and supports the `level` paramter: `level=0` means no compression,
`level=9` refers to the highest compression.

Using this method, we have seen a reduction by 70% (from 11.8 MB to 3.5 MB)
with a real PDF.

## Removing Sources

When a page is removed from the page list, its content will still be present in the PDF file. This means that the data may still be used elsewhere.
When a page is removed from the page list, its content will still be present in
the PDF file. This means that the data may still be used elsewhere.

Simply removing a page from the page list will reduce the page count but not the file size. In order to exclude the content completely, the pages should not be added to the PDF using the PdfWriter.append() function. Instead, only the desired pages should be selected for inclusion (note: [PR #1843](https://github.com/py-pdf/pypdf/pull/1843) will add a page deletion feature).
Simply removing a page from the page list will reduce the page count but not the
file size. In order to exclude the content completely, the pages should not be
added to the PDF using the PdfWriter.append() function. Instead, only the
desired pages should be selected for inclusion
(note: [PR #1843](https://github.com/py-pdf/pypdf/pull/1843) will add a page
deletion feature).

There can be issues with poor PDF formatting, such as when all pages are linked to the same resource. In such cases, dropping references to specific pages becomes useless because there is only one source for all pages.
There can be issues with poor PDF formatting, such as when all pages are linked
to the same resource. In such cases, dropping references to specific pages
becomes useless because there is only one source for all pages.

Cropping is an ineffective method for reducing the file size because it only adjusts the viewboxes and not the external parts of the source image. Therefore, the content that is no longer visible will still be present in the PDF.
Cropping is an ineffective method for reducing the file size because it only
adjusts the viewboxes and not the external parts of the source image. Therefore,
the content that is no longer visible will still be present in the PDF.
4 changes: 2 additions & 2 deletions pypdf/_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -1763,7 +1763,7 @@ def scaleTo(self, width: float, height: float) -> None: # deprecated
deprecation_with_replacement("scaleTo", "scale_to", "3.0.0")
self.scale_to(width, height)

def compress_content_streams(self) -> None:
def compress_content_streams(self, level: int = -1) -> None:
"""
Compress the size of this page by joining all content streams and
applying a FlateDecode filter.
Expand All @@ -1773,7 +1773,7 @@ def compress_content_streams(self) -> None:
"""
content = self.get_contents()
if content is not None:
content_obj = content.flate_encode()
content_obj = content.flate_encode(level)
try:
content.indirect_reference.pdf._objects[ # type: ignore
content.indirect_reference.idnum - 1 # type: ignore
Expand Down
5 changes: 3 additions & 2 deletions pypdf/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,17 +225,18 @@ def _decode_png_prediction(data: str, columns: int, rowlength: int) -> bytes:
return output.getvalue()

@staticmethod
def encode(data: bytes) -> bytes:
def encode(data: bytes, level: int = -1) -> bytes:
"""
Compress the input data using zlib.

Args:
data: The data to be compressed.
level: See https://docs.python.org/3/library/zlib.html#zlib.compress

Returns:
The compressed data.
"""
return zlib.compress(data)
return zlib.compress(data, level)


class ASCIIHexDecode:
Expand Down
4 changes: 2 additions & 2 deletions pypdf/generic/_data_structures.py
Original file line number Diff line number Diff line change
Expand Up @@ -880,7 +880,7 @@ def flateEncode(self) -> "EncodedStreamObject": # deprecated
deprecation_with_replacement("flateEncode", "flate_encode", "3.0.0")
return self.flate_encode()

def flate_encode(self) -> "EncodedStreamObject":
def flate_encode(self, level: int = -1) -> "EncodedStreamObject":
from ..filters import FlateDecode

if SA.FILTER in self:
Expand Down Expand Up @@ -909,7 +909,7 @@ def flate_encode(self) -> "EncodedStreamObject":
retval[NameObject(SA.FILTER)] = f
if parms is not None:
retval[NameObject(SA.DECODE_PARMS)] = parms
retval._data = FlateDecode.encode(self._data)
retval._data = FlateDecode.encode(self._data, level)
return retval


Expand Down