-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: report file name of file that chardet fails to read #3524
base: main
Are you sure you want to change the base?
feat: report file name of file that chardet fails to read #3524
Conversation
resolves codespell-project#3519 Tested and it works now, reporting the file name: ``` codespell --write-changes -i3 -C 5 -H -f -e --count -s --builtin clear,rare,names Failed to decode file ./pep_sphinx_extensions/tests/pep_lint/test_pep_number.py using detected encoding Windows-1254. Traceback (most recent call last): File "/Users/corneliusromer/micromamba/envs/codespell/bin/codespell", line 8, in <module> sys.exit(_script_main()) ^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 1103, in _script_main return main(*sys.argv[1:]) ^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 1300, in main bad_count += parse_file( ^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 945, in parse_file lines, encoding = file_opener.open(filename) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 232, in open return self.open_with_chardet(filename) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 246, in open_with_chardet lines = self.get_lines(f) ^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 303, in get_lines lines = f.readlines() ^^^^^^^^^^^^^ File "/Users/corneliusromer/micromamba/envs/codespell/lib/python3.12/encodings/cp1254.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1349: character maps to <undefined> ```
raise | ||
else: | ||
lines = self.get_lines(f) | ||
f.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To minimize changes, I would suggest:
try:
f = open(filename, encoding=encoding, newline="")
except LookupError:
print(
f"ERROR: Don't know how to handle encoding {encoding}: {filename}",
file=sys.stderr,
)
raise
else:
try:
lines = f.readlines()
except UnicodeDecodeError:
print(f"ERROR: Could not detect encoding: {filename}", file=sys.stderr)
raise
finally:
f.close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to minimize changes? I'm happy to go with whatever you want but let me try to convince you (with Zen of Python):
- "There should be an obvious way to do it": context managers are the way one should open files. Not use finally, it's messy
- "Flat is better than nested": Your suggestion has loads of try/else/try/finally nesting, it's hard to grok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really think this is not much more readable?
try:
with open(filename, encoding=encoding, newline="") as f:
lines = self.get_lines(f)
except LookupError: # Raised by open() if encoding is unknown
error_msg = f"ERROR: Chardet returned unknown encoding for: {filename}."
print(error_msg, file=sys.stderr)
raise
except UnicodeDecodeError: # Raised by self.get_lines() if decoding fails
error_msg = f"ERROR: Failed decoding file: {filename}"
print(error_msg, file=sys.stderr)
raise
Also note that you introduced a bug by replacing self.get_lines(f)
with f.readlines()
in your suggestion 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree that in general flat is better, but I would also like to limit the code under
try
to the strict minimum, as a way to document which exceptions each piece of code is expected to raise. Some linters do enforce that.open
only raisesLookupError
readlines
only raisesUnicodeDecodeError
- I'd like to keep the codebase consistent. See Fix uncaught exception on empty files #2195.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with using context managers to open files. I just didn't know how to make it compatible with try
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, nice to keep try local, but it's a tradeoff with nesting. I commented instead to make clear what raises what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific regarding your "I would like to keep codebase consistent"? I don't know what exactly you would like me to change. Is anything inconsistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open_with_chardet
and open_with_internal
(already fixed in #2195) should be kept as similar as possible and even share code if possible at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification!
…nges. We require the type info, otherwise mypy fails
I've added tests because codecov failed otherwise. Not sure these are super important but what's done is done! Learned something about testing on the way - and mocking! |
I am happy with codecov failing on exceptions, but I don't have rights to merge when CI tests fail. With that said, adding tests looks like the best option. |
Indeed, I will make an issue there if I remember
…On Tue, Aug 20, 2024, 09:50 Dimitri Papadopoulos Orfanos < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In codespell_lib/_codespell.py
<#3524 (comment)>
:
>
- return lines, f.encoding
+ return lines, encoding
Indeed. The chardet documentation should describe the return values in
more detail.
—
Reply to this email directly, view it on GitHub
<#3524 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF77AQNAITVX26F5UWGKDP3ZSLYM5AVCNFSM6AAAAABMWRLYDWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENBXGIZDSNBYG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
resolves #3519
Tested and it works now, reporting the file name: