Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-126742: allow to use non-UTF8 exception messages #126746

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

picnixz
Copy link
Contributor

@picnixz picnixz commented Nov 12, 2024

@picnixz picnixz changed the title gh-126742: allow to use translated exception messages gh-126742: allow to use non-UTF8 exception messages Nov 12, 2024
Python/errors.c Outdated Show resolved Hide resolved
Python/errors.c Outdated Show resolved Hide resolved
@ZeroIntensity
Copy link
Member

Technically, since this is a bugfix, let's wait until #126555 has landed before we merge this, and then address all the uses of dlerror() in this PR.

@ZeroIntensity ZeroIntensity added DO-NOT-MERGE needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes labels Nov 12, 2024
@picnixz
Copy link
Contributor Author

picnixz commented Nov 12, 2024

then address all the uses of dlerror() in this PR.

I actually thought of doing it two PRs. Technically, this specific PR can be merged without waiting for the other and then you can use this interface in the other (rather the other way around).

@ZeroIntensity
Copy link
Member

Eh, I would rather not. Let's not overwhelm someone with thread states and the private API on their first PR :)

@ZeroIntensity ZeroIntensity removed needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes DO-NOT-MERGE labels Nov 13, 2024
@ZeroIntensity
Copy link
Member

I've removed backport labels, per what Petr said in the issue. I think he's right--let's just focus on main with this.

@encukou
Copy link
Member

encukou commented Nov 15, 2024

Hmm, just had an idea: if we add a “char* in locale encoding” format to PyUnicode_FromFormat, it'll be possible to use it with PyErr_Format -- but also for non-error messages.

@ZeroIntensity
Copy link
Member

I like that idea! But we're exposing this as a public API--how often might users need this? (I've never needed to mess with locale in any of my C API usage, but to be fair I'm typically not calling APIs that return strings.)

@picnixz
Copy link
Contributor Author

picnixz commented Nov 15, 2024

Hmm, just had an idea: if we add a “char* in locale encoding” format to

I'm not very fond of adding a case to additionally check when there's not a lot of use cases (unless proven otherwise). It shouldn't affect performances much (but this needs verification) but this would complicate an already complicate function I think (and people need to remember this new format as well).

@ZeroIntensity
Copy link
Member

No no, I don't think he meant a variation of PyUnicode_FromFormat that takes a locale, he meant a format value %whatever that automatically calls PyUnicode_DecodeLocale or something like that.

@picnixz
Copy link
Contributor Author

picnixz commented Nov 15, 2024

No no, I don't think he meant a variation of PyUnicode_FromFormat that takes a locale, he meant a format value %whatever that automatically calls PyUnicode_DecodeLocale or something like that.

And I understood. I don't think a new code should be added unless there is a real need.

@ZeroIntensity
Copy link
Member

And I understood.

Ah! I couldn't tell, sorry.

A code search for PyUnicode_DecodeLocale is showing not too much real world usage--maybe that's a bad thing? Depending on how common it is for a non-UTF8 locale to break things, maybe it would be a good idea to add a format character just so people start handling it.

@encukou
Copy link
Member

encukou commented Nov 19, 2024

It could also show that we're mainly handling errors from the OS at one point -- the OSError constructor. That would be a good thing :)

Anyway, for this PR: @picnixz, do you want to add calls to the new helper(s)?

@ZeroIntensity
Copy link
Member

FWIW, Bénédikt is on vacation right now. I don't think he'll be able to implement anything until next week.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the purpose of these internal functions since they are now used. This PR only adds dead code. I would prefer to see how these functions are used.

If only dlerror() uses it, why not using code using dlerror()?

Include/internal/pycore_pyerrors.h Outdated Show resolved Hide resolved
@encukou encukou marked this pull request as draft November 20, 2024 09:46
@encukou
Copy link
Member

encukou commented Dec 2, 2024

The internal fix makes a lot sense, thank you!

Decoding with surrogateescape shouldn't fail for data-related reason. AFAIK, the expected failures should be something like MemoryError, a broken codec, or a KeyboardInterrupt. These should not be ignored.

Also, since _PyErr_SetLocaleString is not called directly, there could just be the PyErr_SetLocaleString that would call PyErr_SetObject.

Like Serhyi, I don't think a combanation of two public-API functions needs to be added to the public API, let alone Stable API. But, that's something for the C API WG to decide, if you bring it there.

@picnixz
Copy link
Contributor Author

picnixz commented Dec 2, 2024

I'll make it private for now and we can see in the future if we want or not to make it public.

AFAIK, the expected failures should be something like MemoryError, a broken codec, or a KeyboardInterrupt. These should not be ignored

Then, what should we do with the code that currently ignores them?

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a practical way to get a non-ASCII error messages not encoded to UTF-8 for the following functions?

  • dlerror()
  • gdbm_strerror()

How can I test (manually) modified code paths?

Doc/data/refcounts.dat Outdated Show resolved Hide resolved
Include/internal/pycore_pyerrors.h Outdated Show resolved Hide resolved
@picnixz
Copy link
Contributor Author

picnixz commented Dec 2, 2024

Is there a practical way to get a non-ASCII error messages not encoded to UTF-8 for the following functions?

I don't know :( Maybe changing the locale env. var. would be sufficient (assuming you have the translated messages) butI'm really bad at setting up locales so I don't really know =/

@vstinner
Copy link
Member

vstinner commented Dec 2, 2024

I confirm that dlerror() is encoded the current locale encoding:

import ctypes
import locale

RTLD_NOW = 2

locale.setlocale(locale.LC_ALL, "")

libdl = ctypes.cdll.LoadLibrary('libdl.so.2')
libdl.dlerror.restype = ctypes.c_char_p

libdl.dlopen(b'donotexist.so', RTLD_NOW)
errmsg = libdl.dlerror()
print(errmsg)

Output:

$ LANG=fr_FR.iso88591 ./python x.py 
b"donotexist.so: Ne peut ouvrir le fichier d'objet partag\xe9: Aucun fichier ou dossier de ce nom"

$ LANG=fr_FR.utf8 ./python x.py 
b"donotexist.so: Ne peut ouvrir le fichier d'objet partag\xc3\xa9: Aucun fichier ou dossier de ce nom"

The character "é" (U+00E9) is encoded to ISO-8859-1 (\xe9) or UTF-8 (\xc3\xa9).

We will add internal functions later in one go instead one-by-one.
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM, but I didn't test it (manually).

Include/internal/pycore_pyerrors.h Outdated Show resolved Hide resolved
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translated error messages most likely depend on the LC_MESSAGE environment variable. It is a separate difficult issue if its encoding differs from LC_CTYPE's encoding.

But if it is non-ASCII because it includes file paths, we should use the FS encoding.

@picnixz
Copy link
Contributor Author

picnixz commented Dec 2, 2024

It is a separate difficult issue if its encoding differs from LC_CTYPE's encoding.

What should we do for that? should we fix this in a follow-up PR?

Co-authored-by: Victor Stinner <vstinner@python.org>
@vstinner
Copy link
Member

vstinner commented Dec 3, 2024

Translated error messages most likely depend on the LC_MESSAGE environment variable. It is a separate difficult issue if its encoding differs from LC_CTYPE's encoding.

IMO it's reasonable to expect LC_MESSAGE and LC_CTYPE locales to use the same encoding in this API. It's already an enhancement compared to the current code which always expect UTF-8.

@encukou
Copy link
Member

encukou commented Dec 4, 2024

Then, what should we do with the code that currently ignores them?

Raise the error from _PyErr_SetLocaleString.
I should have caught that when they were switched to "surrogateescape". Hopefully it's not too late now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants