Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit decoding of documents from binary to string #260

Merged
merged 1 commit into from
May 21, 2024

Conversation

benoit74
Copy link
Collaborator

Fix #221

Changes:

  • decoding of binary WARC content:
    • do not try all chardet suggested encodings in a row, but only the most probable one
    • if all alternatives to decode have failed, try to simply ignore bad characters when decoding with original encoding indicated in HTTP headers
    • log a warning (with details) when encoding used is not the encoding indicated in HTTP headers or characters had to be ignored

@benoit74 benoit74 self-assigned this May 17, 2024
@benoit74 benoit74 marked this pull request as ready for review May 17, 2024 15:36
@benoit74 benoit74 requested a review from rgaudin May 17, 2024 15:36
src/warc2zim/utils.py Outdated Show resolved Hide resolved
src/warc2zim/utils.py Outdated Show resolved Hide resolved
src/warc2zim/utils.py Show resolved Hide resolved
src/warc2zim/utils.py Outdated Show resolved Hide resolved
@benoit74 benoit74 force-pushed the simplify_decoding branch 3 times, most recently from 49d44dd to 56341fe Compare May 21, 2024 08:15
@benoit74 benoit74 requested a review from rgaudin May 21, 2024 08:17
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benoit74 benoit74 merged commit 86fbe25 into warc2zim2 May 21, 2024
4 checks passed
@benoit74 benoit74 deleted the simplify_decoding branch May 21, 2024 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants