Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Add LzwCodec for encoding #2883

Merged
merged 12 commits into from
Sep 29, 2024
Merged

TST: Add LzwCodec for encoding #2883

merged 12 commits into from
Sep 29, 2024

Conversation

MartinThoma
Copy link
Member

@MartinThoma MartinThoma commented Sep 28, 2024

This is a change I wanted to do for a while :-)

While we might only need decoding for pypdf, having both decoding and encoding in one class massively helps with testing. We can still get it wrong, but it's harder to get both the encoder and the decoder wrong in a consistent way.

This PR adds an abstract Codec class as well as an LzwCodec implementation.

We could even use hypothesis for property-based testing for all codecs :-)

@pubpub-zz
Copy link
Collaborator

can you clarify what you intend to do with the encoding?

Copy link

codecov bot commented Sep 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.27%. Comparing base (ab21802) to head (6564768).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2883      +/-   ##
==========================================
+ Coverage   96.24%   96.27%   +0.02%     
==========================================
  Files          51       52       +1     
  Lines        8625     8692      +67     
  Branches     1722     1734      +12     
==========================================
+ Hits         8301     8368      +67     
  Misses        187      187              
  Partials      137      137              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MartinThoma
Copy link
Member Author

I want to have an easy way to check if the decoding does the right thing.

This allows us to change the LZW implementation with the confidence that we don't break workflows.

@MartinThoma MartinThoma changed the title TST: Add lzw.lzw_encode TST: Add LzwCodec for encoding Sep 29, 2024
@MartinThoma
Copy link
Member Author

@Lucas-C Maybe the encoder is interesting for fpdf2? It wasn't in any discussion, but only mentioned in py-pdf/fpdf2#691

@MartinThoma MartinThoma added the is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements label Sep 29, 2024
@stefan6419846
Copy link
Collaborator

I would recommend to make the encoder roughly the same as the decoder, id est passing data to __init__ instead.

@MartinThoma
Copy link
Member Author

@stefan6419846 Done :-) 👍

@stefan6419846
Copy link
Collaborator

Are we sure that we want to expose this as a public module while we do not officially support encoding objects with LZW? I tend to make codecs an internal module for now.

@pubpub-zz
Copy link
Collaborator

Codecs might be kept private and called directly from filters.

For the tests, as it currently is, you have proved that you have a function and the inverted function : I would have liked to have a minimum test that check also the compressed data

@MartinThoma
Copy link
Member Author

Good point, I made it private. I'm uncertain about the module name, but as it is private it should not matter too much.

@MartinThoma
Copy link
Member Author

@pubpub-zz You're absolutely right. I've added two examples. I would feel even better if there was documented (non-encoded, encoded) pairs that we could add to the test suite, but for now that should be fine.

@MartinThoma MartinThoma merged commit 42de71a into main Sep 29, 2024
16 checks passed
@MartinThoma MartinThoma deleted the lzw-refactoring branch September 29, 2024 13:58
@Lucas-C
Copy link
Member

Lucas-C commented Oct 4, 2024

@Lucas-C Maybe the encoder is interesting for fpdf2? It wasn't in any discussion, but only mentioned in py-pdf/fpdf2#691

Yes, it could make for an interesting addition!
Thank you for the ping.

I opened py-pdf/fpdf2#1271 to suggest this feature.
Maybe it will be picked up during Hacktoberfest!

Just to be clear @MartinThoma : are you explicitly allowing fpdf2 to include the code included in this PR?

stefan6419846 added a commit that referenced this pull request Oct 27, 2024
## What's new

### New Features (ENH)
- Add `layout_mode_font_height_weight` argument to `PageObject.extract_text()` (#2920) by @hpierre001

### Bug Fixes (BUG)
- Fix font specificier for FreeText annotation (#2893) by @ssjkamei
- Line breaks are not generated due to incorrect calculation of text leading (#2890) by @ssjkamei
- Improve handling of spaces in text extraction (#2882) by @ssjkamei

### Robustness (ROB)
- Soft failure for flate encode image mode 1 with wrong LUT size (#2900) by @stefan6419846

### Documentation (DOC)
- Use latest package versions (#2907) by @stefan6419846
- Correct example of reading FileAttachment annotation (#2906) by @j-t-1

### Developer Experience (DEV)
- Update pinned requirements (#2918) by @stefan6419846
- Make make_release.py compatible with Windows environment (#2894) by @pubpub-zz

### Maintenance (MAINT)
- Remove references to outdated Python versions (#2919) by @stefan6419846
- Generalize the method of obtaining space_code (#2891) by @ssjkamei
- Unnecessary character mapping process (#2888) by @ssjkamei
- New LZW decoding implementation (#2887) by @MartinThoma

### Testing (TST)
- Add LzwCodec for encoding (#2883) by @MartinThoma

### Code Style (STY)
- Capitalize error messages (#2903) by @j-t-1
- Modify error messages in PdfWriter (#2902) by @j-t-1

[Full Changelog](5.0.1...5.1.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants