Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore resource WARC records for now #198

Merged
merged 1 commit into from
Mar 4, 2024
Merged

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Feb 29, 2024

Fix #197

add_items_for_warc_record is already called only for records of type resource, response and revisit (i.e. request records are ignored) due to the filtering logic of iter_warc_records.

The logic to add a real item to the ZIM was not called if record type is revisit, i.e. it was called only for resource and response.

As explained in #197, resource WARC records must not be added as-is to the ZIM, they are not regular web responses (as their WARC type implies).

The logic is hence inverted to also be future-prood:

  • apply the default logic to add an item to the ZIM only if record is a response
  • apply the logic to compute revisits only if record is a revisit

@benoit74 benoit74 self-assigned this Feb 29, 2024
@benoit74 benoit74 changed the base branch from main to warc2zim2 February 29, 2024 17:16
@benoit74 benoit74 changed the title Ignore resources Ignore resource WARC records for now Feb 29, 2024
@benoit74 benoit74 marked this pull request as ready for review February 29, 2024 17:30
Copy link

codecov bot commented Feb 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.12%. Comparing base (6fedf54) to head (8120a1f).
Report is 1 commits behind head on warc2zim2.

Additional details and impacted files
@@              Coverage Diff              @@
##           warc2zim2     #198      +/-   ##
=============================================
- Coverage      86.22%   86.12%   -0.11%     
=============================================
  Files             13       13              
  Lines            973      973              
  Branches         176      176              
=============================================
- Hits             839      838       -1     
- Misses           110      111       +1     
  Partials          24       24              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benoit74 benoit74 requested a review from mgautierfr March 4, 2024 09:40
@benoit74 benoit74 merged commit 233bd3e into warc2zim2 Mar 4, 2024
5 of 6 checks passed
@benoit74 benoit74 deleted the ignore_resources branch March 4, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not include resource WARC records inside the ZIM
3 participants