Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade browsertrix crawler and remove redirect handling #285

Merged
merged 6 commits into from
Mar 7, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Feb 29, 2024

Fix #256
Fix #284
Fix #166

This PR adopts browsertrix crawler 1.0.0-beta5 1.0.0-beta.6.

Among other things, this release now handles nicely redirect (webrecorder/browsertrix-crawler#476).

We hence have to remove the handling we've previously done on our side and caused issues (#256). We just keep the cleaning of the URL (remove default ports 443 and 80).

As a side-effect, this will also solve #166 since browsertrix crawler is already permissive in terms of SSL certificates issues. The only SSL issues which will continue to be blocked are the ones where the browser cannot establish at all the connection, like https://panzer-war.com/ were the browser has no cipher in common with the server

Redirect handling has been tested with https://metafilter.com:

docker run -v $PWD/output:/output:rw --name zimit2_test--rm local-zimit:zimit2 zimit --limit 10 --adminEmail="contact+zimfarm@kiwix.org" --description="Test" --lang="en" --name="metafilter.com_en_all" --output="/output" --publisher="openZIM" --scopeType="prefix" --statsFilename="/output/task_progress.json" --title="Metafilter" --url="https://metafilter.com:443" --verbose

Handling of insecure connection withhttps://www.moneyinstructor.com (which still fails without the simplification of check_url):

docker run -v $PWD/output:/output:rw --name zimit2_test --rm local-zimit:zimit2 zimit --limit 10 --adminEmail="contact+zimfarm@kiwix.org" --description="Test" --lang="en" --name="www.moneyinstructor.com_en_all" --output="/output" --publisher="openZIM" --scopeType="prefix" --statsFilename="/output/task_progress.json" --title="MoneyInstructor" --url="https://www.moneyinstructor.com/" --verbose

This PR should not be merged before openzim/warc2zim#196

@benoit74 benoit74 self-assigned this Feb 29, 2024
Copy link

codecov bot commented Feb 29, 2024

Codecov Report

Attention: Patch coverage is 16.66667% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 14.91%. Comparing base (857ae56) to head (5c71674).
Report is 1 commits behind head on zimit2.

Files Patch % Lines
src/zimit/zimit.py 16.66% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           zimit2     #285      +/-   ##
==========================================
+ Coverage   14.88%   14.91%   +0.03%     
==========================================
  Files           1        1              
  Lines         262      248      -14     
  Branches       38       35       -3     
==========================================
- Hits           39       37       -2     
+ Misses        223      211      -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74
Copy link
Collaborator Author

"Luckily", tests are failing due to openzim/warc2zim#198 (but even once this is merged, we still need to wait for openzim/warc2zim#196)

@benoit74 benoit74 marked this pull request as ready for review February 29, 2024 17:23
@benoit74
Copy link
Collaborator Author

@mgautierfr I did not asked you for a formal review of this since as far as I've understood you are less experienced with zimit, but do not hesitate to have a look and comment as well

src/zimit/zimit.py Outdated Show resolved Hide resolved
@benoit74
Copy link
Collaborator Author

benoit74 commented Mar 7, 2024

I had to fix the tests by updating the number of expected WARC records from 8 to 7, because we do not have anymore the "weird / unexpected" https://dict.brave.com/edgedl/chrome/dict/en-us-10-1.bdic in the WARC anymore (item #3 below)

Before:

WARC item  Comment
response https://isago.rskg.org/ 1
request https://isago.rskg.org/ not included (request)
response https://isago.rskg.org/static/favicon256.png 2
request https://isago.rskg.org/static/favicon256.png not included (request)
response https://dict.brave.com/edgedl/chrome/dict/en-us-10-1.bdic 3
request https://dict.brave.com/edgedl/chrome/dict/en-us-10-1.bdic not included (request)
response https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css 4
request https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css not included (request)
response https://isago.rskg.org/static/tarifs-isago.pdf 5
request https://isago.rskg.org/static/tarifs-isago.pdf not included (request)
response https://isago.rskg.org/conseils 6
request https://isago.rskg.org/conseils not included (request)
response https://isago.rskg.org/a-propos 7
request https://isago.rskg.org/a-propos not included (request)
response https://isago.rskg.org/faq 8
request https://isago.rskg.org/faq not included (request)

After:

WARC item  Comment
response http://isago.rskg.org/ not included (redirect)
request http://isago.rskg.org/ not included (request)
response https://isago.rskg.org/ 1
request https://isago.rskg.org/ not included (request)
response https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css 4
request https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css not included (request)
response https://isago.rskg.org/static/favicon256.png 2
request https://isago.rskg.org/static/favicon256.png not included (request)
resource urn:pageinfo:http://isago.rskg.org/ not included (resource)
response https://isago.rskg.org/conseils 6
request https://isago.rskg.org/conseils not included (request)
resource urn:pageinfo:https://isago.rskg.org/conseils not included (resource)
response https://isago.rskg.org/faq 8
request https://isago.rskg.org/faq not included (request)
resource urn:pageinfo:https://isago.rskg.org/faq not included (resource)
response https://isago.rskg.org/a-propos 7
request https://isago.rskg.org/a-propos not included (request)
resource urn:pageinfo:https://isago.rskg.org/a-propos not included (resource)
response https://isago.rskg.org/static/tarifs-isago.pdf 5
request https://isago.rskg.org/static/tarifs-isago.pdf not included (request)
resource urn:pageinfo:https://isago.rskg.org/static/tarifs-isago.pdf not included (resource)

@benoit74 benoit74 requested a review from rgaudin March 7, 2024 08:53
@benoit74
Copy link
Collaborator Author

benoit74 commented Mar 7, 2024

Review welcomed again, changing a test "to make it works" probably needs to be confirmed to be OK 🤣

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ; but the commit (296b104) must include the relevant information (for future blame's sake): why we had and expected 8 before and why we have and expect 7 now.

@benoit74
Copy link
Collaborator Author

benoit74 commented Mar 7, 2024

Done, commit updated.

@rgaudin
Copy link
Member

rgaudin commented Mar 7, 2024

👍

@benoit74 benoit74 merged commit 867d14f into zimit2 Mar 7, 2024
6 checks passed
@benoit74 benoit74 deleted the crawler_beta5 branch March 7, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants