Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not parse utf-8 strings #218

Open
mattn opened this issue Dec 16, 2019 · 9 comments · May be fixed by #630
Open

Can not parse utf-8 strings #218

mattn opened this issue Dec 16, 2019 · 9 comments · May be fixed by #630

Comments

@mattn
Copy link

mattn commented Dec 16, 2019

Traceback (most recent call last):
  File "C:\msys64\mingw64\lib\python3.8\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\msys64\mingw64\lib\python3.8\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\msys64\mingw64\bin\yamllint.exe\__main__.py", line 7, in <module>
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\cli.py", line 189, in run
    prob_level = show_problems(problems, 'stdin', args_format=args.format)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\cli.py", line 91, in show_problems
    for problem in problems:
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\linter.py", line 198, in _run
    syntax_error = get_syntax_error(buffer)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\linter.py", line 179, in get_syntax_error
    list(yaml.parse(buffer, Loader=yaml.BaseLoader))
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yaml\__init__.py", line 73, in parse
    loader = Loader(stream)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yaml\loader.py", line 14, in __init__
    Reader.__init__(self, stream)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yaml\reader.py", line 74, in __init__
    self.check_printable(stream)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yaml\reader.py", line 143, in check_printable
    raise ReaderError(self.name, position, ord(character),
yaml.reader.ReaderError: unacceptable character #xdc82: special characters are not allowed
  in "<unicode string>", position 279

I know #20 and #2. But it's on non-Windows. On Windows, LANG, LC_CTYPE does not set in generally. I think yamllint should provide way to read utf-8 string even if LANG/LC_CTYPE is not set.

@adrienverge
Copy link
Owner

Can you provide a way to reproduce your problem, especially an input file that triggers the error + a yamllint version?

@mattn
Copy link
Author

mattn commented Dec 18, 2019

test.yaml

---
テスト: 'コード'
C:\temp>yamllint -v
yamllint 1.19.0

C:\>temp>yamllint test.yaml
Traceback (most recent call last):
  File "C:\msys64\mingw64\lib\python3.8\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\msys64\mingw64\lib\python3.8\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\msys64\mingw64\bin\yamllint.exe\__main__.py", line 7, in <module>
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\cli.py", line 175, in run
    problems = linter.run(f, conf, filepath)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\linter.py", line 237, in run
    content = input.read()
UnicodeDecodeError: 'cp932' codec can't decode byte 0x86 in position 6: illegal multibyte sequence

@mattn
Copy link
Author

mattn commented Dec 18, 2019

PYTHONIOENCODING=UTF-8 can fix this for stdin

C:\temp>yamllint - < test.yaml
Traceback (most recent call last):
  File "C:\msys64\mingw64\lib\python3.8\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\msys64\mingw64\lib\python3.8\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\msys64\mingw64\bin\yamllint.exe\__main__.py", line 7, in <module>
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\cli.py", line 189, in run
    prob_level = show_problems(problems, 'stdin', args_format=args.format)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\cli.py", line 91, in show_problems
    for problem in problems:
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\linter.py", line 198, in _run
    syntax_error = get_syntax_error(buffer)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\linter.py", line 179, in get_syntax_error
    list(yaml.parse(buffer, Loader=yaml.BaseLoader))
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yaml\__init__.py", line 73, in parse
    loader = Loader(stream)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yaml\loader.py", line 14, in __init__
    Reader.__init__(self, stream)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yaml\reader.py", line 74, in __init__
    self.check_printable(stream)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yaml\reader.py", line 143, in check_printable
    raise ReaderError(self.name, position, ord(character),
yaml.reader.ReaderError: unacceptable character #xdc86: special characters are not allowed
  in "<unicode string>", position 5

C:\temp>set PYTHONIOENCODING=UTF-8

C:\temp>yamllint - < test.yaml

But file inupt still wrong.

C:\temp>yamllint test.yaml
Traceback (most recent call last):
  File "C:\msys64\mingw64\lib\python3.8\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\msys64\mingw64\lib\python3.8\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\msys64\mingw64\bin\yamllint.exe\__main__.py", line 7, in <module>
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\cli.py", line 175, in run
    problems = linter.run(f, conf, filepath)
  File "C:\msys64\mingw64\lib\python3.8\site-packages\yamllint\linter.py", line 237, in run
    content = input.read()
UnicodeDecodeError: 'cp932' codec can't decode byte 0x86 in position 6: illegal multibyte sequence

@adrienverge
Copy link
Owner

On Linux, your example file works perfectly. It looks like Windows default encoding is not Unicode.

yamllint uses PyYAML to parse YAML, could you try the following command, to see if PyYAML is able to load the file?

python -c 'import yaml; yaml.safe_load(open("test.yaml").read());'

@mattn
Copy link
Author

mattn commented Dec 18, 2019

C:\temp>python -c "import yaml; yaml.safe_load(open('test.yaml').read());"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'cp932' codec can't decode byte 0x86 in position 6: illegal multibyte sequence

@rhysd
Copy link

rhysd commented Dec 19, 2019

Might be related to yaml/pyyaml#123 (comment). Probably the following would work.

python -c 'import yaml; yaml.safe_load(open("test.yaml", encoding="utf8").read());'

@mattn
Copy link
Author

mattn commented Dec 19, 2019

I confirmed @rhysd 's code work.

c18t pushed a commit to c18t/pre-commit-hooks that referenced this issue Jan 18, 2020
- YAMLを標準入力から読ませることでWindowsに対応したyamllintフックを作成
  - cf. adrienverge/yamllint#218
@sandstrom
Copy link

I'm doing some issue gardening 🌱🌿 🌷 and came upon this issue. Since it's quite old I just wanted to ask if this is still relevant? If it isn't, maybe we can close this issue?

By closing some old issues we reduce the list of open issues to a more manageable set.

@adrienverge
Copy link
Owner

I think it's related to #238, #239 and #240, and should not be closed (or closed as duplicate, if confirmed).

Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Jan 3, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Jan 4, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Jan 4, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Jan 5, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Jan 10, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Jan 13, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Jan 20, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Feb 5, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Feb 8, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Feb 15, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Feb 25, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Jul 19, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Sep 20, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Jayman2000 added a commit to Jayman2000/yamllint-pr that referenced this issue Nov 29, 2024
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2]. This can cause problems in
multiple different scenarios.

The first scenario involves linting UTF-8 YAML files on Linux systems.
Most of the time, the locale encoding on Linux systems is set to UTF-8
[3][4], but it can be set to something else [5]. In the unlikely event
that someone was using Linux with a locale encoding other than UTF-8,
there was a chance that yamllint would crash with a UnicodeDecodeError.

The second scenario involves linting UTF-8 YAML files on Windows
systems. The locale encoding on Windows systems is the system’s ANSI
code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8
by default [7]. In the very likely event that someone was using Windows
with a locale encoding other than UTF-8, there was a chance that
yamllint would crash with a UnicodeDecodeError.

Additionally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begin with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Credit for the idea of having tests with pre-encoded strings goes to
@adrienverge [9].

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
[9]: <adrienverge#630 (comment)>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants