TST: Add LzwCodec for encoding #2883

MartinThoma · 2024-09-28T11:14:03Z

This is a change I wanted to do for a while :-)

While we might only need decoding for pypdf, having both decoding and encoding in one class massively helps with testing. We can still get it wrong, but it's harder to get both the encoder and the decoder wrong in a consistent way.

This PR adds an abstract Codec class as well as an LzwCodec implementation.

We could even use hypothesis for property-based testing for all codecs :-)

pubpub-zz · 2024-09-28T11:19:16Z

can you clarify what you intend to do with the encoding?

codecov · 2024-09-28T11:22:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.27%. Comparing base (ab21802) to head (6564768).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2883      +/-   ##
==========================================
+ Coverage   96.24%   96.27%   +0.02%     
==========================================
  Files          51       52       +1     
  Lines        8625     8692      +67     
  Branches     1722     1734      +12     
==========================================
+ Hits         8301     8368      +67     
  Misses        187      187              
  Partials      137      137

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MartinThoma · 2024-09-28T12:38:57Z

I want to have an easy way to check if the decoding does the right thing.

This allows us to change the LZW implementation with the confidence that we don't break workflows.

MartinThoma · 2024-09-29T08:51:43Z

@Lucas-C Maybe the encoder is interesting for fpdf2? It wasn't in any discussion, but only mentioned in py-pdf/fpdf2#691

stefan6419846 · 2024-09-29T08:58:08Z

I would recommend to make the encoder roughly the same as the decoder, id est passing data to __init__ instead.

MartinThoma · 2024-09-29T10:55:45Z

@stefan6419846 Done :-) 👍

stefan6419846 · 2024-09-29T11:03:48Z

Are we sure that we want to expose this as a public module while we do not officially support encoding objects with LZW? I tend to make codecs an internal module for now.

pubpub-zz · 2024-09-29T11:12:43Z

Codecs might be kept private and called directly from filters.

For the tests, as it currently is, you have proved that you have a function and the inverted function : I would have liked to have a minimum test that check also the compressed data

MartinThoma · 2024-09-29T12:24:04Z

Good point, I made it private. I'm uncertain about the module name, but as it is private it should not matter too much.

MartinThoma · 2024-09-29T12:30:59Z

@pubpub-zz You're absolutely right. I've added two examples. I would feel even better if there was documented (non-encoded, encoded) pairs that we could add to the test suite, but for now that should be fine.

Lucas-C · 2024-10-04T11:46:40Z

@Lucas-C Maybe the encoder is interesting for fpdf2? It wasn't in any discussion, but only mentioned in py-pdf/fpdf2#691

Yes, it could make for an interesting addition!
Thank you for the ping.

I opened py-pdf/fpdf2#1271 to suggest this feature.
Maybe it will be picked up during Hacktoberfest!

Just to be clear @MartinThoma : are you explicitly allowing fpdf2 to include the code included in this PR?

@hpierre001

## What's new ### New Features (ENH) - Add `layout_mode_font_height_weight` argument to `PageObject.extract_text()` (#2920) by @hpierre001 ### Bug Fixes (BUG) - Fix font specificier for FreeText annotation (#2893) by @ssjkamei - Line breaks are not generated due to incorrect calculation of text leading (#2890) by @ssjkamei - Improve handling of spaces in text extraction (#2882) by @ssjkamei ### Robustness (ROB) - Soft failure for flate encode image mode 1 with wrong LUT size (#2900) by @stefan6419846 ### Documentation (DOC) - Use latest package versions (#2907) by @stefan6419846 - Correct example of reading FileAttachment annotation (#2906) by @j-t-1 ### Developer Experience (DEV) - Update pinned requirements (#2918) by @stefan6419846 - Make make_release.py compatible with Windows environment (#2894) by @pubpub-zz ### Maintenance (MAINT) - Remove references to outdated Python versions (#2919) by @stefan6419846 - Generalize the method of obtaining space_code (#2891) by @ssjkamei - Unnecessary character mapping process (#2888) by @ssjkamei - New LZW decoding implementation (#2887) by @MartinThoma ### Testing (TST) - Add LzwCodec for encoding (#2883) by @MartinThoma ### Code Style (STY) - Capitalize error messages (#2903) by @j-t-1 - Modify error messages in PdfWriter (#2902) by @j-t-1 [Full Changelog](5.0.1...5.1.0)

TST: Add lzw.lzw_encode

8a34e3f

Add failing test

609cd7a

MartinThoma added 3 commits September 29, 2024 10:00

Use Codec-class

32c18bc

Refinements

fa6dfb3

Fix off-by-one errors

5cc1048

MartinThoma changed the title ~~TST: Add lzw.lzw_encode~~ TST: Add LzwCodec for encoding Sep 29, 2024

Merge branch 'main' into lzw-refactoring

9e12a4a

MartinThoma requested review from stefan6419846 and pubpub-zz September 29, 2024 08:37

MartinThoma added 2 commits September 29, 2024 10:46

Code reuse

59ec6b9

Add ABC

42a2d98

MartinThoma added the is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements label Sep 29, 2024

MartinThoma added 2 commits September 29, 2024 12:53

Merge branch 'main' into lzw-refactoring

553c0c1

pass data to init

1ab8163

Make it private

5d1248d

Test encoded value

6564768

stefan6419846 approved these changes Sep 29, 2024

View reviewed changes

MartinThoma merged commit 42de71a into main Sep 29, 2024
16 checks passed

MartinThoma deleted the lzw-refactoring branch September 29, 2024 13:58

Lucas-C mentioned this pull request Oct 4, 2024

Add support for LZWDecode compression py-pdf/fpdf2#1271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST: Add LzwCodec for encoding #2883

TST: Add LzwCodec for encoding #2883

MartinThoma commented Sep 28, 2024 •

edited

Loading

pubpub-zz commented Sep 28, 2024

codecov bot commented Sep 28, 2024 •

edited

Loading

MartinThoma commented Sep 28, 2024

MartinThoma commented Sep 29, 2024

stefan6419846 commented Sep 29, 2024

MartinThoma commented Sep 29, 2024

stefan6419846 commented Sep 29, 2024

pubpub-zz commented Sep 29, 2024

MartinThoma commented Sep 29, 2024

MartinThoma commented Sep 29, 2024

Lucas-C commented Oct 4, 2024

TST: Add LzwCodec for encoding #2883

TST: Add LzwCodec for encoding #2883

Conversation

MartinThoma commented Sep 28, 2024 • edited Loading

pubpub-zz commented Sep 28, 2024

codecov bot commented Sep 28, 2024 • edited Loading

Codecov Report

MartinThoma commented Sep 28, 2024

MartinThoma commented Sep 29, 2024

stefan6419846 commented Sep 29, 2024

MartinThoma commented Sep 29, 2024

stefan6419846 commented Sep 29, 2024

pubpub-zz commented Sep 29, 2024

MartinThoma commented Sep 29, 2024

MartinThoma commented Sep 29, 2024

Lucas-C commented Oct 4, 2024

MartinThoma commented Sep 28, 2024 •

edited

Loading

codecov bot commented Sep 28, 2024 •

edited

Loading