Skip to content

Conversation

@Jayman2000
Copy link
Contributor

@Jayman2000 Jayman2000 commented Jan 3, 2024

This PR makes sure that yamllint never uses open()’s default encoding. Specifically, it uses the character encoding detection algorithm specified in chapter 5.2 of the YAML spec when reading both YAML files and files that are on the ignore-from-file list.

There are two other PRs that are similar to this one. Here’s how this PR compares to those two:

  • This PR doesn’t have any merge conflicts.
  • This PR has a cleaner commit history. You can run the tests and flake8 on each commit in this PR, and they’ll report no errors. I don’t think that you can do that with Detect encoding per yaml spec (fix #238) #240.
  • This PR has longer commit messages. I really tried to explain why I think that my changes make sense.
  • This PR detects the encoding of files being linted, config files, and files on the ignore-from-file list. Those two PRs only detects the encoding of files being linted.
  • Detect encoding per yaml spec (fix #238) #240 PR adds a dependency on chardet. This PR doesn’t add any dependencies.
  • This PR only supports UTF-8, UTF-16 and UTF-32. Both of those PRs support additional encodings.
  • Unicode yaml #581 adds support for running tests on Windows. This PR doesn’t.
  • The code that this PR adds to the yamllint package is simpler.
  • The code that this PR adds to the test package is much more complicated, but hopefully it tests things more thoroughly.

Fixes #218. Fixes #238. Fixes #347.
Closes #240. Closes #581.

@coveralls
Copy link

coveralls commented Jan 3, 2024

Coverage Status

coverage: 99.815% (-0.01%) from 99.825%
when pulling 475679b on Jayman2000:auto-detect-encoding
into 325fafa on adrienverge:master.

@Jayman2000
Copy link
Contributor Author

I just noticed that one of the checks for this PR is failing. The coverage for yamllint/config.py went down, but that’s just because the total number relevant lines went down. There’s only two lines that aren’t covered, but those same two lines aren’t covered in the master branch. Is there anything that I need to do here?

@adrienverge
Copy link
Owner

Is there anything that I need to do here?

At the moment, no. I'm sorry, please excuse the delay, this is a big change with much impact, I need a large time slot to review this, which I couldn't find yet.

Copy link
Owner

@adrienverge adrienverge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Jason, please excuse the very long delay for reviewing this... This was a big piece and I needed time. I apologize.

The 6 commits are well splitted, well explained, and make the review much easier. Thanks a lot!

In my opinion this PR is good to go. I suspect it can solve problems in several cases (including the issues you pointed out), but I also see a small risk of breakage on exotic systems the day it's released. If this happens, will you be around to help find a solution?

A few notes:

  • I notice that you used encoding names with underscores (e.g. utf_8 vs. utf-8). I just read on https://docs.python.org/fr/3/library/codecs.html#standard-encodings that not only are they valid, but they also seem to be the "right" notation:

    Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.

  • I feared that using open().decode() would put the whole contents of files in memory before even linting them, and affect performance. But this is already what yamllint does currently.

@Jayman2000
Copy link
Contributor Author

Jayman2000 commented Nov 29, 2024

Hello Jason, please excuse the very long delay for reviewing this... This was a big piece and I needed time. I apologize.

You shouldn’t be apologizing, I should be thanking you.

Thank you for taking the time to review this PR and to write valuable review comments. Sometimes, maintainers don’t take the time to thoroughly review my contributions. When that happens, my contributions end up being rejected without being understood which is frustrating and saddening. You’re review comments are different, though. They clearly demonstrate that you took the time to read, understand and think about this PR. I’d much rather receive a thoughtful review than a timely one.

In my opinion this PR is good to go. I suspect it can solve problems in several cases (including the issues you pointed out), but I also see a small risk of breakage on exotic systems the day it's released. If this happens, will you be around to help find a solution?

Sure! I had thought about adding a --force-encoding option that would disable encoding autodetection. The idea is that someone could use --force-encoding shift_jis to make yamllint decode everything using Shift JIS or --force-encoding cp1252 to make yamllint decode everything using code page 1252. I decided against adding the --force-encoding in this PR because this PR was already so big.

Now that you mention the small risk of breakage on exotic systems, it’s making me think about the --force-encoding idea again. It might be wise to implement --force-encoding after this PR gets merged and before there’s a new stable release of yamllint. That way, we can give users an easy workaround if encoding autodetection breaks something. If you think that that is a good idea, then I can open another PR that adds a --force-encoding option after this PR gets merged.


I just pushed a new version of this pull request with the following changes:

  • I rebased it on 8513d9b (the tip of master, at the moment).
  • I implemented the review suggestions (see the conversations that just recently got resolved).
  • The variable in tests/common.py that used to be named test_strings is now called TEST_STRINGS_TO_ENCODE_AT_RUNTIME. I made this change for two reasons:
    1. I needed a add a new variable that contained a different type of test strings. I needed to add that variable in order to implement one of the suggestions from your review comments.
    2. The variable name should have been all uppercase anyway because the variable is a constant.
  • For similar reasons, I also renamed the error1 and error2 variables to ERROR1 and ERROR2 in the function that used to be named test_detect_encoding().
  • I added a repr() call to the msg="auto_decode({repr(input_bytes)}) returned the wrong value." line in tests/test_decoder.py. The repr() call helps ensure that the error message has proper syntax.
  • I fixed a dead link in one of the commit messages (https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.htmlhttps://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html).
  • I revised the commit message for “decoder: Autodetect detect encoding of YAML files” to (hopefully) make it easier to understand for people who are unfamiliar with this issue.
  • I made the commit message for “decoder: Autodetect encoding for ignore-from-file” slightly simpler.
  • I improved the commit message for “CI: Fail when open()’s default encoding is used”. I reworded part of it in order to make it more clear why coverage.py is affected by default encoding errors.
  • Updated copyright years for some files to account for the fact that I worked on some files in both 2023 and 2024.

Copy link
Owner

@adrienverge adrienverge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up, again it's a great work.
Amended commits are still perfectly clear and anticipate some of my questions :)

Sure! I had thought about adding a --force-encoding option that would disable encoding autodetection. The idea is that someone could use --force-encoding shift_jis to make yamllint decode everything using Shift JIS or --force-encoding cp1252 to make yamllint decode everything using code page 1252. I decided against adding the --force-encoding in this PR because this PR was already so big.

Now that you mention the small risk of breakage on exotic systems, it’s making me think about the --force-encoding idea again. It might be wise to implement --force-encoding after this PR gets merged and before there’s a new stable release of yamllint. That way, we can give users an easy workaround if encoding autodetection breaks something. If you think that that is a good idea, then I can open another PR that adds a --force-encoding option after this PR gets merged.

If I understand correctly, YAML specification says that files should be Unicode-encoded, but in the real world not all YAML files are.
But I'm not sure whether today, with yamllint 1.35.1, it's possible to successfully run yamllint on non-Unicode-encoded YAML files. Do you know whether it's the case?
Here is what I tried on my system (a recent GNU/Linux):

  • Python's sys.getdefaultencoding() always equals utf-8, even when setting environment variables PYTHONIOENCODING=cp1252, PYTHONUTF8=0 and LANG=fr_FR.CP1252.
  • After creating a CP1252-encoded file (python -c 'with open("/tmp/cp1252.yaml", "wb") as f: f.write("- éçà".encode("cp1252"))'), running PYTHONIOENCODING=cp1252 yamllint /tmp/cp1252.yaml fails with UnicodeDecodeError: 'utf-8' codec can't decode ….

I'm not an expert of how Python handles encoding, so your input is more than welcome on this 🙂

  • If it's not the case, then no need for an extra option --force-encoding.
  • If it's the case, then I agree it would be better to add a way to avoid breakage for those users. However I'm not a fan of a new command line option (--force-encoding), because ideally this would be temporary and removed after a few yamllint versions (after users fixed their files to use UTF-8), so it would break usage the day when we remove this option.
    Instead, we could use an environment variable YAMLLINT_IO_ENCODING, that detect_encoding() would use to override the encoding if defined (then we could unit-test 2 or 3 cases in a new test_detect_encoding_with_env_var_override()). Removing support for this env var in the future won't break command-line options. In my opinion, this should be fairly simple and should go in the same commit.

I revised the commit message for “decoder: Autodetect detect encoding of YAML files”

I like the new version too, it's clearer about the Windows case. Did you mean Autodetect detectAutodetect?

I revised the commit message for “decoder: Autodetect detect encoding of YAML files” to (hopefully) make it easier to understand for people who are unfamiliar with this issue.

Indeed, it's slightly clearer 👍 and will allow grepping warn_default_encoding or EncodingWarning inside Git history.

# An empty string
PreEncodedTestStringInfo(
b'',
None,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@dalito
Copy link

dalito commented Feb 15, 2025

After creating a CP1252-encoded file (python -c 'with open("/tmp/cp1252.yaml", "wb") as f: f.write("- éçà".encode("cp1252"))'), running PYTHONIOENCODING=cp1252 yamllint /tmp/cp1252.yaml fails with UnicodeDecodeError: 'utf-8' codec can't decode ….

I believe you are expecting something wrong here: From the docs: PYTHONIOENCODING: If this is set before running the interpreter, it overrides the encoding used for stdin/stdout/stderr,
So setting "PYTHONIOENCODING" has no effect on the encoding used when opening files which uses what is returned by locale.getencoding which is still utf-8.

(I did not check if you do/did this.) It could be interesting to run tests with https://docs.python.org/3/library/io.html#io-encoding-warning to detect where the code relies on default encoding.

@Jayman2000
Copy link
Contributor Author

I just pushed a new version of this PR. Here’s what’s new:

  • I rebased this branch on top of e427005 (the tip of master, at the moment) which means that there are no more merge conflicts.
  • The commit called “tests: Restore stdout and stderr” now called “tests: Restore stdout”. This was done to account for the changes from e427005.
  • The YAMLLINT_FILE_ENCODING environment variable can now be used to override the encoding used to decode files (see below).
  • I fixed the “Autodetect detect” thing in that one commit message’s subject line.
  • I implemented the review suggestions (see the conversations that just recently got resolved).
  • For files that already existed, already had a copyright notice and that I significantly contributed to, I added myself to the copyright notice (I added “Copyright (C) <years> Jason Yundt”).
  • For the files that I created myself, I updated the copyright years in order to account for the fact that it’s now 2025.

@adrienverge

If I understand correctly, YAML specification says that files should be Unicode-encoded, but in the real world not all YAML files are. But I'm not sure whether today, with yamllint 1.35.1, it's possible to successfully run yamllint on non-Unicode-encoded YAML files. Do you know whether it's the case? Here is what I tried on my system (a recent GNU/Linux):

  • Python's sys.getdefaultencoding() always equals utf-8, even when setting environment variables PYTHONIOENCODING=cp1252, PYTHONUTF8=0 and LANG=fr_FR.CP1252.

  • After creating a CP1252-encoded file (python -c 'with open("/tmp/cp1252.yaml", "wb") as f: f.write("- éçà".encode("cp1252"))'), running PYTHONIOENCODING=cp1252 yamllint /tmp/cp1252.yaml fails with UnicodeDecodeError: 'utf-8' codec can't decode ….

Checking sys.getdefaultencoding() is a good idea, but I actually don’t think that that particular function is relevant here. Here’s what the Python documentation has to say about open() function’s encoding argument:

In text mode, if encoding is not specified the encoding used is platform-dependent: locale.getencoding() is called to get the current locale encoding.

That being said, I think that the documentation might actually be wrong here. locale.getencoding() seems to always return the actual locale encoding, regardless of whether UTF-8 mode is enabled or disabled. I think that the documentation for open() should probably say locale.getpreferredencoding() instead of locale.getencoding().

When I run python -c 'import locale; print(locale.getpreferredencoding())' with a plain-ASCII locale and UTF-8 mode disabled, it prints ANSI_X3.4-1968. When I run that same command with a plain-ASCII locale and UTF-8 mode enabled, it prints utf-8. When I run that same command with a UTF-8 locale, it prints UTF-8.

I agree with @dalito about PYTHONIOENCODING. It’s probably also a red herring.

That being said, I’m not sure why yamllint was still using UTF-8 when you tested it. I can think of a few possibilities, though:

  1. fr_FR.CP1252 is not on glibc’s supported locales list. It’s possible that this is a bug with glibc and that it would work better if you tried a locale that’s on that list.
  2. It’s possible that there isn’t actually a locale on your system that’s named fr_FR.CP1252. You could try running locale --all-locales to see if it’s on the list.
  3. fr_FR.CP1252 is just a name. Technically, you could have accidentally created a locale that has CP1252 in its name, but actually uses UTF-8. You could try running LANG=fr_FR.CP1252 locale --keyword-name LC_CTYPE to verify that the charmap really is set to CP1252.
  4. It’s also possible that your version of glibc doesn’t actually support the CP1252 character encoding. You could try running locale --charmaps to check and see if CP1252 is available on your system.

If you can’t get it to work on your host system, you can always try this Nix flake that I created. That flake will automatically create a virtual machine with a GB 18030 locale and then run yamllint in that virtual machine in order to demonstrate that yamllint 1.35.1 can indeed be used to lint GB 18030–encoded YAML files.


  • If it's not the case, then no need for an extra option --force-encoding.

  • If it's the case, then I agree it would be better to add a way to avoid breakage for those users. However I'm not a fan of a new command line option (--force-encoding), because ideally this would be temporary and removed after a few yamllint versions (after users fixed their files to use UTF-8), so it would break usage the day when we remove this option.
    Instead, we could use an environment variable YAMLLINT_IO_ENCODING, that detect_encoding() would use to override the encoding if defined (then we could unit-test 2 or 3 cases in a new test_detect_encoding_with_env_var_override()). Removing support for this env var in the future won't break command-line options. In my opinion, this should be fairly simple and should go in the same commit.

OK, that sounds like a good idea. I’ve implemented it, but I decided to call the environment variable YAMLLINT_FILE_ENCODING instead of YAMLLINT_IO_ENCODING. YAMLLINT_IO_ENCODING sounds too similar to PYTHONIOENCODING, and (as far as I know) the PYTHONIOENCODING variable affects stdin, stdout and stderr, not regular files. People who are familiar with PYTHONIOENCODING might see the name YAMLLINT_IO_ENCODING and assume that it has something to do with stdin, stdout and stderr.


I like the new version too, it's clearer about the Windows case. Did you mean Autodetect detectAutodetect?

Yep. Thanks for pointing that out.


@dalito

(I did not check if you do/did this.) It could be interesting to run tests with https://docs.python.org/3/library/io.html#io-encoding-warning to detect where the code relies on default encoding.

I agree, -X warn_default_encoding is a really useful way to track down these kinds of bugs. I’m really glad that was added to Python. A lot of the changes in this PR come from me running python -X warn_default_encoding -W error::EncodingWarning -m unittest discover and seeing what broke. The last commit in this PR makes the CI tests run with -X warn_default_encoding -W error::EncodingWarning to help us avoid these kinds of mistakes in the future.

Copy link
Owner

@adrienverge adrienverge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dalito @Jayman2000 thanks for your help and explanations on system encodings! Indeed I misunderstood PYTHONIOENCODING.
@Jayman2000, https://jasonyundt.website/posts/terminal-in-ascii-on-linux is a nice and useful article 👍 For information, as you suspected, fr_FR.CP1252 was not in my glibc locales list. Since fr_FR.iso88591 is present, next time I'll use this one.

I rereviewed the code and it looks ready for merging. Before that, I have 2 remarks:

  1. OK, that sounds like a good idea. I’ve implemented it, but I decided to call the environment variable YAMLLINT_FILE_ENCODING instead of YAMLLINT_IO_ENCODING

    Your argument makes sense, YAMLLINT_FILE_ENCODING sounds better to avoid confusion. But yamllint can also lint stdin (echo -e 'é: v\né: v' | yamllint -) and write reports to stdout (duplication of key "é" in mapping). Does your latest implementation with YAMLLINT_FILE_ENCODING also handle the encoding of stdin/stdout?

    By the way, good idea to properly warn about the temporariness of YAMLLINT_FILE_ENCODING.

  2. The new documentation page on "Character Encoding Override" is a great addition. Thanks! To prepare for potential future additions, may I suggest to rename it to "Character Encoding", and put the last 2 paragraphs in a section "Override character encoding"?

Before this change, build_temp_workspace() would always encode a path
using UTF-8 and the strict error handler [1]. Most of the time, this is
fine, but systems do not necessarily use UTF-8 and the strict error
handler for paths [2].

[1]: <https://docs.python.org/3.12/library/stdtypes.html#str.encode>
[2]: <https://docs.python.org/3.12/glossary.html#term-filesystem-encoding-and-error-handler>
Before this commit, test_run_default_format_output_in_tty() changed the
value of sys.stdout, but it would never change it back to the original
value. This commit makes sure that it gets changed back.

At the moment, this commit doesn’t make a user-visible difference. A
future commit will add a new test named
test_ignored_from_file_with_multiple_encodings(). That new test requires
that stdout gets restored, or else it will fail.
The motivation behind this change is to make it easier to create a
future commit. That future commit will make yamllint change its behavior
if an environment variable named YAMLLINT_FILE_ENCODING is found. That
new environment variable will potentially cause interference with many
different tests.

Before this change, environment variables would only be deleted when the
tests.test_cli module was used. At the moment, it’s OK to do that
because that’s the only test module that will fail if certain
environment variables are set. Once yamllint is updated to look for the
YAMLLINT_FILE_ENCODING variable, pretty much every test will be likely
to fail if YAMLLINT_FILE_ENCODING is set to a certain values. This
change makes the code for deleting environment variables get run for all
tests (not just tests.test_cli).

As an alternative, we could have kept most of the code for deleting
environment variables in tests/test_cli.py, and only included code for
deleting YAMLLINT_FILE_ENCODING in tests/__init__.py. I decided to put
all of the environment variable deletion code in tests/__init__.py in
order to make things more consistent and easier to understand.

I had also considered adding a function for deleting environment
variables to tests/common.py and then adding this to every test module
that needs to have environment variables deleted:

	from tests.common import remove_env_vars_that_might_interfere
	setUpModule = remove_env_vars_that_might_interfere()

I decided to not do that because pretty much every single test module
will fail if YAMLLINT_FILE_ENCODING is set to certain values, and
there’s a lot of test modules.
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2]. This can cause problems in
multiple different scenarios.

The first scenario involves linting UTF-8 YAML files on Linux systems.
Most of the time, the locale encoding on Linux systems is set to UTF-8
[3][4], but it can be set to something else [5]. In the unlikely event
that someone was using Linux with a locale encoding other than UTF-8,
there was a chance that yamllint would crash with a UnicodeDecodeError.

The second scenario involves linting UTF-8 YAML files on Windows
systems. The locale encoding on Windows systems is the system’s ANSI
code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8
by default [7]. In the very likely event that someone was using Windows
with a locale encoding other than UTF-8, there was a chance that
yamllint would crash with a UnicodeDecodeError.

Additionally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

In most cases, this change fixes all of those problems by implementing
the YAML spec’s character encoding detection algorithm. Now, as long as
YAML files begin with either a byte order mark or an ASCII character,
yamllint will (in most cases) automatically detect them as being UTF-8,
UTF-16 or UTF-32. Other character encodings are not supported at the
moment.

Even with this change, there is still one specific situation where
yamllint still uses the wrong character encoding. Specifically, this
change does not affect the character encoding used for stdin. This means
that at the moment, these two commands may use different character
encodings when decoding file.yaml:

	$ yamllint file.yaml
	$ cat file.yaml | yamllint -

A future commit will update yamllint so that it uses the same character
encoding detection algorithm for stdin.

It’s possible that this change will break things for existing yamllint
users. This change allows users to use the YAMLLINT_FILE_ENCODING to
override the autodetection algorithm just in case they’ve been using
yamllint on weird nonstandard YAML files.

Credit for the idea of having tests with pre-encoded strings and having
an environment variable for overriding the character encoding
autodetection algorithm goes to @adrienverge [9].

Fixes #218. Fixes #238. Fixes #347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
[9]: <#630 (comment)>
Before this change, yamllint would use a character encoding
autodetection algorithm in order to determine the character encoding of
all YAML files that it processed, unless the YAML file was sent to
yamllint via stdin. This change makes it so that yamllint always uses
the character encoding detection algorithm, even if the YAML file is
sent to yamllint via stdin.

Before this change, one of yamllint’s tests would replace sys.stdin with
a StringIO object. This change makes it so that that test replaces
sys.stdin with a file object instead of a StringIO object. Before this
change, it was OK to use a StringIO object because yamllint never tried
to access sys.stdin.buffer. It’s no longer OK to use a StringIO because
yamllint now tries to access sys.stdin.buffer. File objects do have a
buffer attribute, so we can use a file object instead.
Before this change, yamllint would decode files on the ignore-from-file
list using open()’s default encoding [1][2]. This can cause decoding to
fail in some situations (see the previous commit message for details).

This change makes yamllint automatically detect the encoding for files
on the ignore-from-file list. It uses the same algorithm that it uses
for detecting the encoding of YAML files, so the same limitations apply:
files must use UTF-8, UTF-16 or UTF-32 and they must begin with either a
byte order mark or an ASCII character.

[1]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.input>
[2]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.FileInput>
In general, using open()’s default encoding is a mistake [1]. This
change makes sure that every time open() is called, the encoding
parameter is specified. Specifically, it makes it so that all tests
succeed when run like this:

	python -X warn_default_encoding -W error::EncodingWarning -m unittest discover

[1]: <https://peps.python.org/pep-0597/#using-the-default-encoding-is-a-common-mistake>
The previous few commits have removed all calls to open() that use its
default encoding. That being said, it’s still possible that code added
in the future will contain that same mistake. This commit makes it so
that the CI test job will fail if that mistake is made again.

Unfortunately, it doesn’t look like coverage.py allows you to specify -X
options [1] or warning filters [2] when running your tests [3]. To work
around this problem, I’m running all of the Python code, including
coverage.py itself, with -X warn_default_encoding and
-W error::EncodingWarning. As a result, the CI test job will also fail
if coverage.py uses open()’s default encoding. Hopefully, coverage.py
won’t do that. If it does, then we can always temporarily revert this
commit.

[1]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-X>
[2]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-W>
[3]: <https://coverage.readthedocs.io/en/7.4.0/cmd.html#execution-coverage-run>
@Jayman2000
Copy link
Contributor Author

I just pushed a new version of this PR. Here’s what’s new:

  • I rebased this branch on top of 325fafa (the tip of master, at the moment).
  • I renamed the “decoder: Autodetect encoding of YAML files” commit to “decoder: Autodetect encoding of most YAML files”. I changed its commit message in order to make it clear that that commit doesn’t affect the encoding of stdin.
  • I added a new commit named “decoder: Autodetect decoding of stdin”. (See below for details).
  • I discovered that a lot of the tests fail if you set YAMLLINT_FILE_ENCODING to cp037. In order to fix this problem, I added a new commit named “tests: Move code for deleting env vars to __init__”, so that I could make all tests delete the YAMLLINT_FILE_ENCODING environment variable (if it exists) before running.

Your argument makes sense, YAMLLINT_FILE_ENCODING sounds better to avoid confusion. But yamllint can also lint stdin (echo -e 'é: v\né: v' | yamllint -) and write reports to stdout (duplication of key "é" in mapping). Does your latest implementation with YAMLLINT_FILE_ENCODING also handle the encoding of stdin/stdout?

Thank you for bringing up stdin and stdout. The previous versions of this PR did not handle the encoding of stdin or stdout. I hadn’t thought about those two at all.

I think that we need to make sure that we use the right character encoding for stdin. It would be confusing if there was a chance that yamllint file.yaml and cat file.yaml | yamllint - would use different character encodings for file.yaml. Plus, stdin is technically a stream of arbitrary data. It’s not necessarily a stream of characters. Programs are free to interpret stdin however they want (example: cat Archive.tar | tar --extract). The most natural thing for yamllint to do is to would be to assume that stdin is supposed to be valid YAML and give errors if it’s not (just like how tar assumes that stdin is supposed to be a valid tar file and gives errors if it isn’t).

I don’t think that we need to make sure that we use the right character encoding for stdout. I think that Python will choose the right character encoding for us. I took a look at each of yamllint’s output formats to try and figure out what the ideal character encoding would be:

(Click here for lots details)
  • --format parsable: From what I can tell, there isn’t any formal specification for this format, so we can’t consult a standard to figure out what its character encoding should be. However, there is this example code. That example pipes its output to tee. Here’s what POSIX.1-2024 has to say about tee:

    The following environment variables shall affect the execution of tee:

    • […]

    • LC_CTYPE

      Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments).

    In that example script, tee ends up sending its output to awk. Here’s what POSIX.1-2024 has to say about awk:

    The following environment variables shall affect the execution of awk:

    • […]

    • LC_CTYPE

      Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments and input files), the behavior of character classes within regular expressions, the identification of characters as letters, and the mapping of uppercase and lowercase characters for the toupper and tolower functions.

    From what I can tell, command-line programs on UNIX-like systems are supposed to use LC_CTYPE’s character encoding for stdin, stdout and stderr unless there’s a good reason not to (for example, curl https://example.com/image.webp > image.webp). Since we’re sending the output to tee and awk, we should use LC_CTYPE’s character encoding. I believe that that’s what Python does by default (unless UTF-8 mode is turned on).

  • --format standard: This format is meant to meant to be displayed on a terminal screen. In order to ensure that the text looks correct, we need to use the same character encoding that the terminal is using. There’s no way to know for sure what character encoding the terminal is using, but it’s pretty safe to assume that the terminal is using the locale encoding. If the terminal is not using the locale encoding, then the user needs to fix their system’s configuration. Python defaults to using the locale encoding for stdout.

  • --format colored: Everything that I wrote about the standard format also applies to the colored format.

  • --format github: This one’s tricky. The documentation for GitHub workflow commands says:

    Most workflow commands use the echo command in a specific format, while others are invoked by writing to a file. For more information, see Environment files.

    The Environment files section that that quote links to says:

    You will need to use UTF-8 encoding when writing to these files to ensure proper processing of the commands.

    So, if we decide the redirect yamllint’s output to one of those files, we know that yamllint should use UTF-8 for stdout in that particular scenario. If the output is not being redirected, what character encoding should be used? Unfortunately, I wasn’t able to find any documentation that answers that question. That being said, throughout the documentation page for GitHub workflow commands, they use examples like echo "::workflow-command parameter1={data},parameter2={data}::{command value}". The POSIX standard doesn’t technically say that echo is supposed to use the locale encoding (unless the XSI extension is implemented), but it’s implied that echo is supposed to use the locale encoding. If GitHub wants the results of those echo commands to be UTF-8, then GitHub should really make sure that the runners use a UTF-8 locale. If GitHub want the results of those echo commands to be something else, then GitHub should really make sure that the runners use a different locale. If GitHub doesn’t do one of those two things, then that’s a GitHub bug, not a yamllint bug.

In short, it seems to me that the best possible option in all situations is to use the locale encoding for stdout. Python does that by default, so we don’t need to do anything.

I also don’t think that we need to make sure that we use the right character encoding for stderr. Most of the references to stderr are in the tests/ directory. There’s only three references (1 2 3) to stderr that aren’t in that directory. All three of those lines are just trying to display text on a terminal screen. When displaying text on a terminal screen, it’s best to use the locale encoding in order to help prevent mojibake. Python does that by default, so we don’t need to do anything.

The new version uses the character encoding autodetection algorithm on stdin. As a result, the new version now allows you to set the character encoding used for stdin by setting YAMLLINT_FILE_ENCODING.


  1. The new documentation page on "Character Encoding Override" is a great addition. Thanks! To prepare for potential future additions, may I suggest to rename it to "Character Encoding", and put the last 2 paragraphs in a section "Override character encoding"?

Done.

Copy link
Owner

@adrienverge adrienverge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Jason,

I don’t think that we need to make sure that we use the right character encoding for stdout. I think that Python will choose the right character encoding for us. I took a look at each of yamllint’s output formats to try and figure out what the ideal character encoding would be:

Agreed. In details, I agree with your reasoning about the 4 output --formats for stdout.

About stderr, it seems logical to do the same as for stdout (here, let Python use the locale encoding).

The commits are still well split and give all the context needed. Thank you for that.

Let's release this!

@adrienverge adrienverge merged commit 8323394 into adrienverge:master Mar 23, 2025
6 of 7 checks passed
adrienverge pushed a commit that referenced this pull request Mar 23, 2025
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2]. This can cause problems in
multiple different scenarios.

The first scenario involves linting UTF-8 YAML files on Linux systems.
Most of the time, the locale encoding on Linux systems is set to UTF-8
[3][4], but it can be set to something else [5]. In the unlikely event
that someone was using Linux with a locale encoding other than UTF-8,
there was a chance that yamllint would crash with a UnicodeDecodeError.

The second scenario involves linting UTF-8 YAML files on Windows
systems. The locale encoding on Windows systems is the system’s ANSI
code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8
by default [7]. In the very likely event that someone was using Windows
with a locale encoding other than UTF-8, there was a chance that
yamllint would crash with a UnicodeDecodeError.

Additionally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

In most cases, this change fixes all of those problems by implementing
the YAML spec’s character encoding detection algorithm. Now, as long as
YAML files begin with either a byte order mark or an ASCII character,
yamllint will (in most cases) automatically detect them as being UTF-8,
UTF-16 or UTF-32. Other character encodings are not supported at the
moment.

Even with this change, there is still one specific situation where
yamllint still uses the wrong character encoding. Specifically, this
change does not affect the character encoding used for stdin. This means
that at the moment, these two commands may use different character
encodings when decoding file.yaml:

	$ yamllint file.yaml
	$ cat file.yaml | yamllint -

A future commit will update yamllint so that it uses the same character
encoding detection algorithm for stdin.

It’s possible that this change will break things for existing yamllint
users. This change allows users to use the YAMLLINT_FILE_ENCODING to
override the autodetection algorithm just in case they’ve been using
yamllint on weird nonstandard YAML files.

Credit for the idea of having tests with pre-encoded strings and having
an environment variable for overriding the character encoding
autodetection algorithm goes to @adrienverge [9].

Fixes #218. Fixes #238. Fixes #347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
[9]: <#630 (comment)>
@Jayman2000 Jayman2000 deleted the auto-detect-encoding branch March 23, 2025 12:20
Jayman2000 added a commit to Jayman2000/jasons-pre-commit-hooks that referenced this pull request Mar 23, 2025
Now that this pull request [1] has been merged, the upstream version of
yamllint contains the fix that I wanted. I no longer need to use my own
fork of the yamllint repo in order to get that fix.

Fixes #22.

[1]: <adrienverge/yamllint#630>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Character set issue on Windows Failure to read file in 1.21.0 with Python 2.7 and non-ascii yaml files. Can not parse utf-8 strings

4 participants