Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subprocess.run() defaults to the wrong text encoding under Windows #105312

Open
kunom opened this issue Jun 5, 2023 · 8 comments
Open

subprocess.run() defaults to the wrong text encoding under Windows #105312

kunom opened this issue Jun 5, 2023 · 8 comments
Labels
OS-windows topic-subprocess Subprocess issues. type-bug An unexpected behavior, bug, or error

Comments

@kunom
Copy link

kunom commented Jun 5, 2023

This is a western Europe Windows 11 machine:

Python 3.11.3 (tags/v3.11.3:f3909b8, Apr  4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] on win32
>>> import subprocess
>>> subprocess.run("echo ö", shell=True, text=True, stdout=subprocess.PIPE).stdout.strip("\n")
'”'

As you can see, there is codepage confusion. You don't get back what you wrote out.

Windows has different codepage settings applied, depending on context. File encoding (also called ANSI codepage) is not necessarily identical with console encoding (also called OEM codepage), see https://stackoverflow.com/a/43194047. The OEM codepage contains legacy graphical symbols like "╣" or "▒".

On my machine:

locale.getencoding()='cp1252'
ctypes.windll.kernel32.GetACP()=1252
ctypes.windll.kernel32.GetConsoleCP()=850
ctypes.windll.kernel32.GetConsoleOutputCP()=850

The character "ö" has codepoint 0x94 in CP850 (see table there). In CP1252, this codepoint maps to "”".

The suggestion here is that subprocess related things should not pass the choice of the default encoding to io.TextWrapper (which is documented to take locale.getencoding()), but should instead default to the value returned by GetConsoleCP().

This would be exactly the same as the GO people decided to do.

@kunom kunom added the type-bug An unexpected behavior, bug, or error label Jun 5, 2023
@kunom kunom changed the title suprocess.run uses the wrong text encoding under Windows subprocess.run() defaults to the wrong text encoding under Windows Jun 5, 2023
@terryjreedy
Copy link
Member

I get the same in main with US machine.

@eryksun
Copy link
Contributor

eryksun commented Jun 6, 2023

If Python is attached to a console session, the console's current input code page is os.device_encoding(0), and the current output code page is os.device_encoding(1).

The CMD shell's internal commands such as echo use the current output code page. For example:

>>> encoding = os.device_encoding(1)
>>> encoding
'cp850'
>>> p = subprocess.run("echo ö", shell=True, stdout=subprocess.PIPE, encoding=encoding)
>>> p.stdout.strip()
'ö'

Alternatively, since the console defaults to using the system OEM code page, you can use encoding='oem' if you haven't otherwise changed it via the console API, "chcp.com", or the "CodePage" setting of a named console session under the registry key "HKCU\Console\<title>".


There is no universal I/O encoding or API query. For example, "sort.exe" use the process OEM code page instead of the console output code page. In the following example, I've set the console code pages to 850, and I chose the character "¢" because it's encoded differently in each of the code pages: 437 (process OEM), 1252 (process ANSI), and 850.

>>> open('f.txt', 'w', encoding='utf-16').write('¢')
1
>>> p = subprocess.run('sort.exe "f.txt"', stdout=subprocess.PIPE, encoding='oem')
>>> p.stdout.strip()
'¢'
>>> p = subprocess.run('sort.exe "f.txt"', stdout=subprocess.PIPE, encoding='ansi')
>>> p.stdout.strip() # wrong
'›'
>>> p = subprocess.run('sort.exe "f.txt"', stdout=subprocess.PIPE, encoding='cp850')
>>> p.stdout.strip() # wrong
'ø'

For another example, "attrib.exe" uses the process ANSI code page.

>>> os.mkdir('spam')
>>> open(r'spam', 'w').close()
>>> p = subprocess.run(r'attrib.exe "spam\*"', stdout=subprocess.PIPE, text=True)
>>> p.stdout.strip()
'A                    C:\\Temp\\spam\\¢'
>>> p = subprocess.run(r'attrib.exe "spam\*"', stdout=subprocess.PIPE, encoding='ansi')
>>> p.stdout.strip()
'A                    C:\\Temp\\spam\\¢'
>>> p = subprocess.run(r'attrib.exe "spam\*"', stdout=subprocess.PIPE, encoding='oem')
>>> p.stdout.strip() # wrong
'A                    C:\\Temp\\spam\\ó'
>>> p = subprocess.run(r'attrib.exe "spam\*"', stdout=subprocess.PIPE, encoding='cp850')
>>> p.stdout.strip() # wrong
'A                    C:\\Temp\\spam\\ó'

The list of mutually inconsistent examples could go on. There is no standard. Common choices are the process ANSI code page, process OEM code page, UTF-16, UTF-8, or the current input code page or current output code page of a console session.

The current console code page in general has nothing to do with the user locale (e.g. day/month names, number/currency symbols) or the user's preferred UI language (text resources, messages). It's a bad choice for the locale encoding, unless it's UTF-8. The best choice in general is the ANSI code page of the user locale, unless the process ANSI code page is UTF-8. Next best is the process ANSI code page, which is normally based on the system locale and commonly matches the user locale. Python uses the process ANSI code page as the locale encoding, unless it's overridden by UTF-8 mode.


This would be exactly the same as the GO people decided to do.

That revision changed their readConsole() function, which reads encoded bytes from console files via ReadFile() and then decodes to text using the console input code page from GetConsoleCP().

Since 3.6, Python's I/O stack uses the io._WindowsConsoleIO class that's based on the wide-character console API functions ReadConsoleW() and WriteConsoleW(). So console code pages aren't directly relevant to Python, except for the low-level os.read() and os.write() functions.

@kunom
Copy link
Author

kunom commented Jun 6, 2023

This is great detail and information about the behaviour of Windows. Thanks, @eryksun !

Still, is the expectation that any user of subprocess.run(shell=True, ...) knows of that? They are just calling the default command interpreter triggered by shell=True. For that one, shouldn't we know how it behaves, and shouldn't Python be able to adapt to that?

So, while I agree with you that text encoding clearly needs to be configurable because there is not clear standard, I get back to my initial impression that the default is wrongly chosen.

FWIW (probably not too much): the OEM codepage is also used by any dotnet executable (no need to include it's source code here):

>>> subprocess.run("SayHi.exe", text=True, stdout=subprocess.PIPE).stdout
'”\n'

@epogrebnyak
Copy link

Still, is the expectation that any user of subprocess.run(shell=True, ...) knows of that? They are just calling the default

Some user expectations: I came across subprocess to when testing a commnd line application, that needed to print some non-latin character. The minimal example is this:

print(subprocess.run(["python", "-c", "print('Я')"], shell=True, text=True, capture_output=True).stderr)

results in the following:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "d:\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u042f' in position 0: character maps to <undefined

python -c "print('Я')" runs fine on the command line.

@zooba
Copy link
Member

zooba commented Jan 31, 2024

Unfortunately, we're somewhat restricted in how many people we can break here.

Python on Windows uses a different encoding based on whether it's attached to a console (no compatibility requirements) or to a file/pipe (many compatibility requirements). In the former, it uses the native console APIs with UTF-16-LE text to write output, and so virtually any character will work.

For file/pipes, we use the normal method of looking at PYTHONIOENCODING and the current locale settings (there's no console in this scenario). Otherwise everyone trying to override it would start failing.

If you set PYTHONIOENCODING=utf-8 before launching, it will write UTF-8, and provided your subprocess call is expecting it (through using the encoding= argument, rather than the discouraged text=True argument), you shouldn't have a problem.

If someone comes up with a clever way we can change the default encoding to UTF-8 without upsetting everyone who needs it to behave the way it currently does, or at least safely deprecating/warning them about an upcoming change, we'd be open to changing it. Until then, we're kinda stuck.

@epogrebnyak
Copy link

If you set PYTHONIOENCODING=utf-8 before launching, it will write UTF-8, and provided your subprocess call is expecting it (through using the encoding= argument, rather than the discouraged text=True argument), you shouldn't have a problem.

Thank you for explaining in detail, this works like a charm:

$ set PYTHONIOENCODING=utf-8
$ python -c "import subprocess; subprocess.run([\"python\", \"-c\", \"print('Делавер')\"], encoding='utf-8')"
Делавер

Also tried with no encoding parameter, works as well:

$ python -c "import subprocess; subprocess.run([\"python\", \"-c\", \"print('Делавер')\"])"

As a side note - one might want tochange the codepage if output is in ???? characters, for example code below with echo will not print properly on codepage 437 (my default), should switch to 65001.

$ chcp 65001
$ echo import subprocess; subprocess.run(["python", "-c", "print('Делавер')"], encoding="utf-8") | python

@eryksun
Copy link
Contributor

eryksun commented Feb 1, 2024

$ set PYTHONIOENCODING=utf-8
$ python -c "import subprocess; subprocess.run(["python", "-c", "print('Делавер')"], encoding='utf-8')"
Делавер

The encoding parameter is irrelevant here because you're not capturing output with pipes. PYTHONIOENCODING is irrelevant here because it sets the initial encoding to use for sys.stdin, sys.stdout, and sys.stderr when they're not console files. If stdout is a console file on Windows, then the base of the I/O stack in Python 3.6+ will be an instance of io._WindowsConsoleIO, and the encoding of sys.stdout will be UTF-8.

Your original example captured stdout and stderr with the call subprocess.run(["python", "-c", "print('Я')"], text=True, capture_output=True). In this case, the child python.exe process has pipes for its stdout and stderr. These pipe files in the child process default to using the system code page, which in your case is code page 1252. You can override the default standard I/O encoding by setting PYTHONIOENCODING, but you'll still have a mojibake problem due to the text=True argument. For example:

>>> os.environ['PYTHONIOENCODING']
'UTF-8'
>>> p = subprocess.run(["python", "-c", "print('Я')"], text=True, capture_output=True)
>>> p.stdout
'Я\n'

On the parent side of the pipes, the text=True argument sets up the stdout and stderr files of the subprocess.Popen instance to use the default I/O encoding. The PYTHONIOENCODING setting doesn't affect that default, so it's still using the system code page. Instead of using text=True, combine the use of PYTHONIOENCODING with the argument encoding='utf-8'. For example:

>>> os.environ['PYTHONIOENCODING']
'UTF-8'
>>> p = subprocess.run(["python", "-c", "print('Я')"], encoding='utf-8', capture_output=True)
>>> p.stdout
'Я\n'

Alternatively, override all I/O to UTF-8 by setting PYTHONUTF8, or pass the command-line argument -X utf8. For example:

>>> os.environ['PYTHONUTF8']
'1'
>>> sys.flags.utf8_mode
1
>>> p = subprocess.run(["python", "-c", "print('Я')"], text=True, capture_output=True)
>>> p.stdout
'Я\n'

As a side note - one might want tochange the codepage if output is in ???? characters, for example code below with echo will not print properly on codepage 437 (my default), should switch to 65001.

The "chcp.com" command gets and sets the current code page of the console session. I assume your shell is either CMD or another that works similarly. The CMD shell is partly a legacy console application. When attached to a console session, CMD uses the console code page as the I/O encoding for files and pipes, such as text that's piped to a child process from the echo command, such as echo print('Делавер') | python.exe. Without a console session, CMD uses the system code page as the I/O encoding. For console I/O (e.g. echo Делавер without redirection to a file or pipe), CMD uses Unicode via the console's wide-character API.

Python is a legacy Windows application that uses the system code page as its default I/O encoding for a file or pipe, whether or not it's attached to a console session. Like the CMD shell, Python 3.6+ uses Unicode for console I/O via the console's wide-character API.

Note that the system code page (i.e. system ANSI) and default console code page (i.e. system OEM) are potentially inconsistent with text that's based on the locale (e.g. number, currency, and time formatting characters; names of weekdays and months) and UI language (e.g. the messages and strings used by the system or a library/application). Usually the locale and UI language are configured consistently with each other for the current user, but the system code page and default console code page are configured at the system level. A good option for the I/O encoding is to use the code page that's configured for the current user locale, but override to UTF-8 if either the user locale is Unicode-only (e.g. Hindi, India) or the system code page is set to UTF-8 (i.e. code page 65001). That said, it's increasingly a non-issue as an increasing number of systems nowadays configure the system code page and default console code page as UTF-8. All of this mess with legacy encodings will one day be a footnote in history.

@xiezhipeng-git
Copy link

subprocess.check_output also same error in python3.10.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OS-windows topic-subprocess Subprocess issues. type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

8 participants