-
Notifications
You must be signed in to change notification settings - Fork 301
Automatically detect character encoding of YAML files and ignore files #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically detect character encoding of YAML files and ignore files #630
Conversation
|
I just noticed that one of the checks for this PR is failing. The coverage for |
At the moment, no. I'm sorry, please excuse the delay, this is a big change with much impact, I need a large time slot to review this, which I couldn't find yet. |
adrienverge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Jason, please excuse the very long delay for reviewing this... This was a big piece and I needed time. I apologize.
The 6 commits are well splitted, well explained, and make the review much easier. Thanks a lot!
In my opinion this PR is good to go. I suspect it can solve problems in several cases (including the issues you pointed out), but I also see a small risk of breakage on exotic systems the day it's released. If this happens, will you be around to help find a solution?
A few notes:
- I notice that you used encoding names with underscores (e.g.
utf_8vs.utf-8). I just read on https://docs.python.org/fr/3/library/codecs.html#standard-encodings that not only are they valid, but they also seem to be the "right" notation:Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.
- I feared that using
open().decode()would put the whole contents of files in memory before even linting them, and affect performance. But this is already what yamllint does currently.
You shouldn’t be apologizing, I should be thanking you. Thank you for taking the time to review this PR and to write valuable review comments. Sometimes, maintainers don’t take the time to thoroughly review my contributions. When that happens, my contributions end up being rejected without being understood which is frustrating and saddening. You’re review comments are different, though. They clearly demonstrate that you took the time to read, understand and think about this PR. I’d much rather receive a thoughtful review than a timely one.
Sure! I had thought about adding a Now that you mention the small risk of breakage on exotic systems, it’s making me think about the I just pushed a new version of this pull request with the following changes:
|
adrienverge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up, again it's a great work.
Amended commits are still perfectly clear and anticipate some of my questions :)
Sure! I had thought about adding a
--force-encodingoption that would disable encoding autodetection. The idea is that someone could use--force-encoding shift_jisto make yamllint decode everything using Shift JIS or--force-encoding cp1252to make yamllint decode everything using code page 1252. I decided against adding the--force-encodingin this PR because this PR was already so big.Now that you mention the small risk of breakage on exotic systems, it’s making me think about the
--force-encodingidea again. It might be wise to implement--force-encodingafter this PR gets merged and before there’s a new stable release of yamllint. That way, we can give users an easy workaround if encoding autodetection breaks something. If you think that that is a good idea, then I can open another PR that adds a--force-encodingoption after this PR gets merged.
If I understand correctly, YAML specification says that files should be Unicode-encoded, but in the real world not all YAML files are.
But I'm not sure whether today, with yamllint 1.35.1, it's possible to successfully run yamllint on non-Unicode-encoded YAML files. Do you know whether it's the case?
Here is what I tried on my system (a recent GNU/Linux):
- Python's
sys.getdefaultencoding()always equalsutf-8, even when setting environment variablesPYTHONIOENCODING=cp1252,PYTHONUTF8=0andLANG=fr_FR.CP1252. - After creating a CP1252-encoded file (
python -c 'with open("/tmp/cp1252.yaml", "wb") as f: f.write("- éçà".encode("cp1252"))'), runningPYTHONIOENCODING=cp1252 yamllint /tmp/cp1252.yamlfails withUnicodeDecodeError: 'utf-8' codec can't decode ….
I'm not an expert of how Python handles encoding, so your input is more than welcome on this 🙂
- If it's not the case, then no need for an extra option
--force-encoding. - If it's the case, then I agree it would be better to add a way to avoid breakage for those users. However I'm not a fan of a new command line option (
--force-encoding), because ideally this would be temporary and removed after a few yamllint versions (after users fixed their files to use UTF-8), so it would break usage the day when we remove this option.
Instead, we could use an environment variableYAMLLINT_IO_ENCODING, thatdetect_encoding()would use to override the encoding if defined (then we could unit-test 2 or 3 cases in a newtest_detect_encoding_with_env_var_override()). Removing support for this env var in the future won't break command-line options. In my opinion, this should be fairly simple and should go in the same commit.
I revised the commit message for “decoder: Autodetect detect encoding of YAML files”
I like the new version too, it's clearer about the Windows case. Did you mean Autodetect detect → Autodetect?
I revised the commit message for “decoder: Autodetect detect encoding of YAML files” to (hopefully) make it easier to understand for people who are unfamiliar with this issue.
Indeed, it's slightly clearer 👍 and will allow grepping warn_default_encoding or EncodingWarning inside Git history.
| # An empty string | ||
| PreEncodedTestStringInfo( | ||
| b'', | ||
| None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
I believe you are expecting something wrong here: From the docs: PYTHONIOENCODING: If this is set before running the interpreter, it overrides the encoding used for stdin/stdout/stderr, (I did not check if you do/did this.) It could be interesting to run tests with https://docs.python.org/3/library/io.html#io-encoding-warning to detect where the code relies on default encoding. |
|
I just pushed a new version of this PR. Here’s what’s new:
Checking
That being said, I think that the documentation might actually be wrong here. When I run I agree with @dalito about That being said, I’m not sure why yamllint was still using UTF-8 when you tested it. I can think of a few possibilities, though:
If you can’t get it to work on your host system, you can always try this Nix flake that I created. That flake will automatically create a virtual machine with a GB 18030 locale and then run yamllint in that virtual machine in order to demonstrate that yamllint 1.35.1 can indeed be used to lint GB 18030–encoded YAML files.
OK, that sounds like a good idea. I’ve implemented it, but I decided to call the environment variable
Yep. Thanks for pointing that out.
I agree, |
adrienverge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dalito @Jayman2000 thanks for your help and explanations on system encodings! Indeed I misunderstood PYTHONIOENCODING.
@Jayman2000, https://jasonyundt.website/posts/terminal-in-ascii-on-linux is a nice and useful article 👍 For information, as you suspected, fr_FR.CP1252 was not in my glibc locales list. Since fr_FR.iso88591 is present, next time I'll use this one.
I rereviewed the code and it looks ready for merging. Before that, I have 2 remarks:
-
OK, that sounds like a good idea. I’ve implemented it, but I decided to call the environment variable YAMLLINT_FILE_ENCODING instead of YAMLLINT_IO_ENCODING
Your argument makes sense,
YAMLLINT_FILE_ENCODINGsounds better to avoid confusion. But yamllint can also lint stdin (echo -e 'é: v\né: v' | yamllint -) and write reports to stdout (duplication of key "é" in mapping). Does your latest implementation withYAMLLINT_FILE_ENCODINGalso handle the encoding of stdin/stdout?By the way, good idea to properly warn about the temporariness of
YAMLLINT_FILE_ENCODING. -
The new documentation page on "Character Encoding Override" is a great addition. Thanks! To prepare for potential future additions, may I suggest to rename it to "Character Encoding", and put the last 2 paragraphs in a section "Override character encoding"?
Before this change, build_temp_workspace() would always encode a path using UTF-8 and the strict error handler [1]. Most of the time, this is fine, but systems do not necessarily use UTF-8 and the strict error handler for paths [2]. [1]: <https://docs.python.org/3.12/library/stdtypes.html#str.encode> [2]: <https://docs.python.org/3.12/glossary.html#term-filesystem-encoding-and-error-handler>
Before this commit, test_run_default_format_output_in_tty() changed the value of sys.stdout, but it would never change it back to the original value. This commit makes sure that it gets changed back. At the moment, this commit doesn’t make a user-visible difference. A future commit will add a new test named test_ignored_from_file_with_multiple_encodings(). That new test requires that stdout gets restored, or else it will fail.
The motivation behind this change is to make it easier to create a future commit. That future commit will make yamllint change its behavior if an environment variable named YAMLLINT_FILE_ENCODING is found. That new environment variable will potentially cause interference with many different tests. Before this change, environment variables would only be deleted when the tests.test_cli module was used. At the moment, it’s OK to do that because that’s the only test module that will fail if certain environment variables are set. Once yamllint is updated to look for the YAMLLINT_FILE_ENCODING variable, pretty much every test will be likely to fail if YAMLLINT_FILE_ENCODING is set to a certain values. This change makes the code for deleting environment variables get run for all tests (not just tests.test_cli). As an alternative, we could have kept most of the code for deleting environment variables in tests/test_cli.py, and only included code for deleting YAMLLINT_FILE_ENCODING in tests/__init__.py. I decided to put all of the environment variable deletion code in tests/__init__.py in order to make things more consistent and easier to understand. I had also considered adding a function for deleting environment variables to tests/common.py and then adding this to every test module that needs to have environment variables deleted: from tests.common import remove_env_vars_that_might_interfere setUpModule = remove_env_vars_that_might_interfere() I decided to not do that because pretty much every single test module will fail if YAMLLINT_FILE_ENCODING is set to certain values, and there’s a lot of test modules.
Before this change, yamllint would open YAML files using open()’s default encoding. As long as UTF-8 mode isn’t enabled, open() defaults to using the system’s locale encoding [1][2]. This can cause problems in multiple different scenarios. The first scenario involves linting UTF-8 YAML files on Linux systems. Most of the time, the locale encoding on Linux systems is set to UTF-8 [3][4], but it can be set to something else [5]. In the unlikely event that someone was using Linux with a locale encoding other than UTF-8, there was a chance that yamllint would crash with a UnicodeDecodeError. The second scenario involves linting UTF-8 YAML files on Windows systems. The locale encoding on Windows systems is the system’s ANSI code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8 by default [7]. In the very likely event that someone was using Windows with a locale encoding other than UTF-8, there was a chance that yamllint would crash with a UnicodeDecodeError. Additionally, using open()’s default encoding is a violation of the YAML spec. Chapter 5.2 says: “On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported. If a character stream begins with a byte order mark, the character encoding will be taken to be as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (x00) characters.” [8] In most cases, this change fixes all of those problems by implementing the YAML spec’s character encoding detection algorithm. Now, as long as YAML files begin with either a byte order mark or an ASCII character, yamllint will (in most cases) automatically detect them as being UTF-8, UTF-16 or UTF-32. Other character encodings are not supported at the moment. Even with this change, there is still one specific situation where yamllint still uses the wrong character encoding. Specifically, this change does not affect the character encoding used for stdin. This means that at the moment, these two commands may use different character encodings when decoding file.yaml: $ yamllint file.yaml $ cat file.yaml | yamllint - A future commit will update yamllint so that it uses the same character encoding detection algorithm for stdin. It’s possible that this change will break things for existing yamllint users. This change allows users to use the YAMLLINT_FILE_ENCODING to override the autodetection algorithm just in case they’ve been using yamllint on weird nonstandard YAML files. Credit for the idea of having tests with pre-encoded strings and having an environment variable for overriding the character encoding autodetection algorithm goes to @adrienverge [9]. Fixes #218. Fixes #238. Fixes #347. [1]: <https://docs.python.org/3.12/library/functions.html#open> [2]: <https://docs.python.org/3.12/library/os.html#utf8-mode> [3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html> [4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale> [5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f> [6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding> [7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page> [8]: <https://yaml.org/spec/1.2.2/#52-character-encodings> [9]: <#630 (comment)>
Before this change, yamllint would use a character encoding autodetection algorithm in order to determine the character encoding of all YAML files that it processed, unless the YAML file was sent to yamllint via stdin. This change makes it so that yamllint always uses the character encoding detection algorithm, even if the YAML file is sent to yamllint via stdin. Before this change, one of yamllint’s tests would replace sys.stdin with a StringIO object. This change makes it so that that test replaces sys.stdin with a file object instead of a StringIO object. Before this change, it was OK to use a StringIO object because yamllint never tried to access sys.stdin.buffer. It’s no longer OK to use a StringIO because yamllint now tries to access sys.stdin.buffer. File objects do have a buffer attribute, so we can use a file object instead.
Before this change, yamllint would decode files on the ignore-from-file list using open()’s default encoding [1][2]. This can cause decoding to fail in some situations (see the previous commit message for details). This change makes yamllint automatically detect the encoding for files on the ignore-from-file list. It uses the same algorithm that it uses for detecting the encoding of YAML files, so the same limitations apply: files must use UTF-8, UTF-16 or UTF-32 and they must begin with either a byte order mark or an ASCII character. [1]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.input> [2]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.FileInput>
In general, using open()’s default encoding is a mistake [1]. This change makes sure that every time open() is called, the encoding parameter is specified. Specifically, it makes it so that all tests succeed when run like this: python -X warn_default_encoding -W error::EncodingWarning -m unittest discover [1]: <https://peps.python.org/pep-0597/#using-the-default-encoding-is-a-common-mistake>
The previous few commits have removed all calls to open() that use its default encoding. That being said, it’s still possible that code added in the future will contain that same mistake. This commit makes it so that the CI test job will fail if that mistake is made again. Unfortunately, it doesn’t look like coverage.py allows you to specify -X options [1] or warning filters [2] when running your tests [3]. To work around this problem, I’m running all of the Python code, including coverage.py itself, with -X warn_default_encoding and -W error::EncodingWarning. As a result, the CI test job will also fail if coverage.py uses open()’s default encoding. Hopefully, coverage.py won’t do that. If it does, then we can always temporarily revert this commit. [1]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-X> [2]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-W> [3]: <https://coverage.readthedocs.io/en/7.4.0/cmd.html#execution-coverage-run>
|
I just pushed a new version of this PR. Here’s what’s new:
Thank you for bringing up I think that we need to make sure that we use the right character encoding for I don’t think that we need to make sure that we use the right character encoding for (Click here for lots details)
In short, it seems to me that the best possible option in all situations is to use the locale encoding for I also don’t think that we need to make sure that we use the right character encoding for The new version uses the character encoding autodetection algorithm on
Done. |
adrienverge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Jason,
I don’t think that we need to make sure that we use the right character encoding for
stdout. I think that Python will choose the right character encoding for us. I took a look at each of yamllint’s output formats to try and figure out what the ideal character encoding would be:
Agreed. In details, I agree with your reasoning about the 4 output --formats for stdout.
About stderr, it seems logical to do the same as for stdout (here, let Python use the locale encoding).
The commits are still well split and give all the context needed. Thank you for that.
Let's release this!
Before this change, yamllint would open YAML files using open()’s default encoding. As long as UTF-8 mode isn’t enabled, open() defaults to using the system’s locale encoding [1][2]. This can cause problems in multiple different scenarios. The first scenario involves linting UTF-8 YAML files on Linux systems. Most of the time, the locale encoding on Linux systems is set to UTF-8 [3][4], but it can be set to something else [5]. In the unlikely event that someone was using Linux with a locale encoding other than UTF-8, there was a chance that yamllint would crash with a UnicodeDecodeError. The second scenario involves linting UTF-8 YAML files on Windows systems. The locale encoding on Windows systems is the system’s ANSI code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8 by default [7]. In the very likely event that someone was using Windows with a locale encoding other than UTF-8, there was a chance that yamllint would crash with a UnicodeDecodeError. Additionally, using open()’s default encoding is a violation of the YAML spec. Chapter 5.2 says: “On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported. If a character stream begins with a byte order mark, the character encoding will be taken to be as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (x00) characters.” [8] In most cases, this change fixes all of those problems by implementing the YAML spec’s character encoding detection algorithm. Now, as long as YAML files begin with either a byte order mark or an ASCII character, yamllint will (in most cases) automatically detect them as being UTF-8, UTF-16 or UTF-32. Other character encodings are not supported at the moment. Even with this change, there is still one specific situation where yamllint still uses the wrong character encoding. Specifically, this change does not affect the character encoding used for stdin. This means that at the moment, these two commands may use different character encodings when decoding file.yaml: $ yamllint file.yaml $ cat file.yaml | yamllint - A future commit will update yamllint so that it uses the same character encoding detection algorithm for stdin. It’s possible that this change will break things for existing yamllint users. This change allows users to use the YAMLLINT_FILE_ENCODING to override the autodetection algorithm just in case they’ve been using yamllint on weird nonstandard YAML files. Credit for the idea of having tests with pre-encoded strings and having an environment variable for overriding the character encoding autodetection algorithm goes to @adrienverge [9]. Fixes #218. Fixes #238. Fixes #347. [1]: <https://docs.python.org/3.12/library/functions.html#open> [2]: <https://docs.python.org/3.12/library/os.html#utf8-mode> [3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html> [4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale> [5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f> [6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding> [7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page> [8]: <https://yaml.org/spec/1.2.2/#52-character-encodings> [9]: <#630 (comment)>
Now that this pull request [1] has been merged, the upstream version of yamllint contains the fix that I wanted. I no longer need to use my own fork of the yamllint repo in order to get that fix. Fixes #22. [1]: <adrienverge/yamllint#630>
This PR makes sure that yamllint never uses
open()’s default encoding. Specifically, it uses the character encoding detection algorithm specified in chapter 5.2 of the YAML spec when reading both YAML files and files that are on theignore-from-filelist.There are two other PRs that are similar to this one. Here’s how this PR compares to those two:
ignore-from-filelist. Those two PRs only detects the encoding of files being linted.chardet. This PR doesn’t add any dependencies.yamllintpackage is simpler.testpackage is much more complicated, but hopefully it tests things more thoroughly.Fixes #218. Fixes #238. Fixes #347.
Closes #240. Closes #581.