Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with default sys.stdout.encoding producing UnicodeEncodeError #2940

Open
Lukasdoe opened this issue Feb 9, 2023 · 10 comments
Open

Comments

@Lukasdoe
Copy link

Lukasdoe commented Feb 9, 2023

I just had this really weird bug, where no matter which Rust String I wanted to use in Python always raised this error when trying to print or convert the String: "UnicodeEncodeError('ascii', 'asdf´´', 4, 6, 'ordinal not in range(128)')"

I just now found the problem which caused this error to occur consistently:
The shell in which the rust executable was running didn't have the "LANG" environment variable set. This makes the python interpreter choose the default encoding "ASCII" which is (among others) used for output encoding. This error is easily overlooked because everything works fine if you only use ASCII characters (because UTF-8 is compatible with ASCII in that range). It only fails when using a character that is not included in ASCII, i.e. has a value of over 127.

Rust Strings are (by default) stored as utf-8 encoded bytes and the conversion between a Rust and a Python String apparently directly hands a pointer to the utf-8 encoded byte-array to the Python interpreter to produce a Python String.

Feature Request: Somehow tell the user that the Python output encoding should be UTF-8 (and this means properly setting the "LANG" variable), because otherwise this weird error might occur when dealing with special characters.

@birkenfeld
Copy link
Member

Can you show some code or details where the string comes from and what you do with it? In particular, is there stdin/stdout or files involved?

Normal string conversions between Rust and Python are independent of any system encodings.

@Lukasdoe
Copy link
Author

Compile the following code:

use pyo3::{prepare_freethreaded_python, types::PyDict, Python};

fn main() {
    println!("LANG: {:?}", std::env::var("LANG"));
    prepare_freethreaded_python();
    Python::with_gil(|py| {
        let s1: String = "non_ascii_char-->ß<--".to_string();
        let dict = PyDict::new(py);
        dict.set_item("s1", s1).unwrap();
        py.eval("print(s1)", Some(dict), None).unwrap();
    })
}

And execute the following commands:

cargo run

-> should execute normally

LANG="bogus" cargo run

-> should panic

@Lukasdoe
Copy link
Author

I know that the string conversion between Rust and Python is independent of the used encoding and that was my problem in the first place. If the standard encoding used by Rust (UTF-8) and by Python (depending on the LANG environment variable) differ, this unexpected error will occur.

@birkenfeld
Copy link
Member

The Rust string is not involved here, neither is Rust's string encoding. You get the same exception from this:

py.eval("print('ß')", None, None).unwrap();

or even by importing a second module with the print statement.

The problem is that sys.stdout.encoding is set to ascii here, since Python cannot get more info out of the LANG setting.

Not a PyO3 bug IMHO.

@Lukasdoe
Copy link
Author

Not a PyO3 bug IMHO.

Well, that's why I marked this as an "enhancement" and not a bug.

The Rust string is not involved here, neither is Rust's string encoding.

I think this is not entirely correct. Rust Strings and &strs are UTF-8 encoded [0]. The raw, byte representation (UTF-8 encoded) is then directly handed over to the Python interpreter, no matter how the string is transferred. That's just how a Rust &str is turned into a PyString [1] [2].

The problem is that sys.stdout.encoding is set to ascii here, since Python cannot get more info out of the LANG setting.

This is true, however this is also the problem. The output encoding is automatically determined from the locale, set via the "LANG" environment variable, which can define a number of encodings. Besides UTF-8, you could e.g. also use "en_GB.iso88591" or "en_GB.iso885915", which are also not compatible with UTF-8.

If it is not defined, Python uses the C / POSIX default locale with the ASCII encoding (related PEP: [3]). PYO3 just assumes that the shell that executes the Rust process uses a locale with UTF-8 encoding or leaves the task of reconfiguring the IO encoding of their Python interpreter to the user.

I would appreciate it, if the documentation would just list this problem and the solution to it. The current error states the problem (ordinal with an out-of-range value for the current encoding), but doesn't tell you why. I do not expect the average developer to know about locale coercion of the Python interpreter, nor about the (historic) POSIX defaults.

@birkenfeld
Copy link
Member

I think this is not entirely correct. Rust Strings and &strs are UTF-8 encoded [0]. The raw, byte representation (UTF-8 encoded) is then directly handed over to the Python interpreter, no matter how the string is transferred. That's just how a Rust &str is turned into a PyString [1] [2].

The Python API that PyO3 uses (PyUnicode_FromString) is defined to take a UTF-8 encoded string, so this is correct in all cases.

But as I showed, it is also not relevant here: you could have pure-ASCII source code, by encoding the ß as a unicode escape and still get the exception. It is purely happening due to Python not knowing how to encode the string for output into its stdout.

If it is not defined, Python uses the C / POSIX default locale with the ASCII encoding (related PEP: [3]). PYO3 just assumes that the shell that executes the Rust process uses a locale with UTF-8 encoding or leaves the task of reconfiguring the IO encoding of their Python interpreter to the user.

This is standard Python behavior: try LANG=en_US.iso-8859-1 python -c "import sys; print(sys.stdout.encoding)". PyO3 is not involved.

@birkenfeld
Copy link
Member

HOWEVER: I just realized that with LANG=C (or LANG=bogus) there are two different outcomes in the interpreter and a PyO3 generated executable with embedded Python: Python defaults to UTF-8 stdout encoding, while the other doesn't. (All with a 3.10 interpreter.)

Digging through the code, this is because of this config setting.

In an embedded environment initialized by Py_InititalizeEx, this is set to 0 in _PyPreConfig_InitCompatConfig, while the standalone interpreter sets it to -1 which enables the heuristic described here: "The Python UTF-8 Mode is enabled if the LC_CTYPE locale is C or POSIX at Python startup (see the PyConfig_Read() function)."

So the actionable item here is (cc @davidhewitt) if we should switch from Py_InitializeEx to the mechanism described here and emulate the interpreter's new heuristic behavior.

@davidhewitt
Copy link
Member

Sorry it's taken me a few days to get to this discussion. Agreed that Rust <-> Python string conversions are irrelevant here, both are unicode encoded.

On the locale, yes I absolutely agree that we can do better. Python UTF-8 mode was added in Python 3.7 (i.e. our minimum Python version), so I would be in favour of changing to support it better in PyO3. With issues like #2817 and #1741 kicking around, some reworking to our embedded interpreter initialization seems overdue. (Probably better defaults and a more flexible API combined.)

Given that PEP 686 has declared that Python 3.15 will be UTF-8 mode as the default, I'm personally tempted to suggest that we already make all PyO3 embedded interpreters are UTF-8 mode by default (i.e. set the config setting you link above to 1, not -1). This seems consistent with Rust's strong preference for UTF-8.

Let's make this something to action for 0.19

@davidhewitt davidhewitt added this to the 0.19 milestone Feb 14, 2023
@birkenfeld
Copy link
Member

I'm not sure that we should switch from not matching python in one way to doing it in another way :) The least surprising behavior would be to match the interpreter we're building with, IMHO.

@davidhewitt
Copy link
Member

Quite possibly, yes. I think there's going to be quite a few of these defaults to figure out when implementing "better" Python initialization...

@davidhewitt davidhewitt removed this from the 0.19 milestone Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants