Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arguments to make_zim_file and relax dependencies constraints #147

Merged
merged 5 commits into from
Mar 25, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Mar 19, 2024

Changes

  • Add support for disable_metadata_checks, ignore_duplicates and compression arguments in make_zim_file function ("zimwritefs-mode")
    • We forgot about this disable_metadata_checks when adding this functionality ... while it is the first (and maybe even main) use case ...
      • This is a minor fix, but mandatory for migration of TED scraper to Zimscraperlib 3.x
    • We should support as well ignore_duplicates in make_zim_file function ("zimwritefs-mode")
      • It will default to True, since it most cases this is what we want (we usually do not run a deduplication function before-hand on files stored in the filesystem
      • But we could (should ?) do it add replace these duplicates by aliases / redirections
  • Relax dependencies versions
    • Tests are running fine, this is especially important for lxml + beautifulsoup4 where some scraper might need features from a more recent version and usage in zimscraperlib is so limited that it probably doesn't hurt to relax constraints (at least all tests are green)
  • Upgrade dev/qa dependencies

I intend to release this as 3.3.2 as soon as it is merged since we want to proceed with TED upgrade to benefit from new encoder presets.

@benoit74 benoit74 self-assigned this Mar 19, 2024
@benoit74 benoit74 added this to the 3.3.2 milestone Mar 19, 2024
@benoit74 benoit74 force-pushed the scraperlib_3.3.2 branch 2 times, most recently from 1a38020 to 91addfe Compare March 19, 2024 08:07
Copy link

codecov bot commented Mar 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (b479038) to head (d18275c).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #147   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           32        32           
  Lines         1399      1399           
  Branches       240       240           
=========================================
  Hits          1399      1399           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Add support for `disable_metadata_checks`, `ignore_duplicates` and `compression` arguments in `make_zim_file` function ("zimwritefs-mode")
@benoit74 benoit74 changed the title Add arguments to make_zim_file and fix reencode return types Add arguments to make_zim_file and relax dependencies constraints Mar 19, 2024
@benoit74
Copy link
Collaborator Author

Nota: this has already been tested locally to work fine with ted scraper

@benoit74 benoit74 marked this pull request as ready for review March 19, 2024 08:20
@benoit74 benoit74 requested a review from rgaudin March 19, 2024 08:20
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if there's a reason behind those additions or if you were just filling a gap

src/zimscraperlib/zim/filesystem.py Outdated Show resolved Hide resolved
src/zimscraperlib/zim/filesystem.py Outdated Show resolved Hide resolved
@benoit74
Copy link
Collaborator Author

I was just filling a gap, since it was intentional I will remove this right now.

@benoit74
Copy link
Collaborator Author

Change done and first comment updated to reflect the new PR content.

@benoit74 benoit74 requested a review from rgaudin March 19, 2024 15:45
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the ignore_duplicates? We do want this one. Especially on old scrapers still using this to bundle assets

@benoit74
Copy link
Collaborator Author

Arf, I added two args to fill the gap and you made two remarks, so I misread them as being linked to these two args. I did not realized your first remark was about the workaround_nocancel which was already there before this PR. Let's discuss it live to be sure we are aligned

@benoit74
Copy link
Collaborator Author

After discussion:

  • we keep workaround_nocancel as-is
  • we add ignore_duplicates which is always True (not exposed)

After some thought, I think that we have to expose ignore_duplicates, someone very cautious in his scraper could run a deduplication on the files before calling make_zim_file, add needed redirects / aliases after that, and expect the libzim to stop if it detects a duplicate.

@benoit74 benoit74 requested a review from rgaudin March 22, 2024 06:51
@rgaudin
Copy link
Member

rgaudin commented Mar 22, 2024

👍 ; as long as it defaults to True :)

@rgaudin
Copy link
Member

rgaudin commented Mar 22, 2024

But someone cautious would not be using the zimwriterfs mode 😅

@benoit74
Copy link
Collaborator Author

Does your 👍 means you've reviewed this PR and I can merge or that you've approved my proposition?

@rgaudin
Copy link
Member

rgaudin commented Mar 22, 2024

Yes

@benoit74 benoit74 merged commit 5f126ab into main Mar 25, 2024
10 checks passed
@benoit74 benoit74 deleted the scraperlib_3.3.2 branch March 25, 2024 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants