Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-126742: allow to use non-UTF8 exception messages #126746

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

picnixz
Copy link
Contributor

@picnixz picnixz commented Nov 12, 2024

@picnixz picnixz changed the title gh-126742: allow to use translated exception messages gh-126742: allow to use non-UTF8 exception messages Nov 12, 2024
Python/errors.c Outdated Show resolved Hide resolved
Python/errors.c Outdated Show resolved Hide resolved
@ZeroIntensity
Copy link
Member

Technically, since this is a bugfix, let's wait until #126555 has landed before we merge this, and then address all the uses of dlerror() in this PR.

@ZeroIntensity ZeroIntensity added DO-NOT-MERGE needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes labels Nov 12, 2024
@picnixz
Copy link
Contributor Author

picnixz commented Nov 12, 2024

then address all the uses of dlerror() in this PR.

I actually thought of doing it two PRs. Technically, this specific PR can be merged without waiting for the other and then you can use this interface in the other (rather the other way around).

@ZeroIntensity
Copy link
Member

Eh, I would rather not. Let's not overwhelm someone with thread states and the private API on their first PR :)

@ZeroIntensity ZeroIntensity removed needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes DO-NOT-MERGE labels Nov 13, 2024
@ZeroIntensity
Copy link
Member

I've removed backport labels, per what Petr said in the issue. I think he's right--let's just focus on main with this.

@encukou
Copy link
Member

encukou commented Nov 15, 2024

Hmm, just had an idea: if we add a “char* in locale encoding” format to PyUnicode_FromFormat, it'll be possible to use it with PyErr_Format -- but also for non-error messages.

@ZeroIntensity
Copy link
Member

I like that idea! But we're exposing this as a public API--how often might users need this? (I've never needed to mess with locale in any of my C API usage, but to be fair I'm typically not calling APIs that return strings.)

@picnixz
Copy link
Contributor Author

picnixz commented Nov 15, 2024

Hmm, just had an idea: if we add a “char* in locale encoding” format to

I'm not very fond of adding a case to additionally check when there's not a lot of use cases (unless proven otherwise). It shouldn't affect performances much (but this needs verification) but this would complicate an already complicate function I think (and people need to remember this new format as well).

@ZeroIntensity
Copy link
Member

No no, I don't think he meant a variation of PyUnicode_FromFormat that takes a locale, he meant a format value %whatever that automatically calls PyUnicode_DecodeLocale or something like that.

@picnixz
Copy link
Contributor Author

picnixz commented Nov 15, 2024

No no, I don't think he meant a variation of PyUnicode_FromFormat that takes a locale, he meant a format value %whatever that automatically calls PyUnicode_DecodeLocale or something like that.

And I understood. I don't think a new code should be added unless there is a real need.

@ZeroIntensity
Copy link
Member

And I understood.

Ah! I couldn't tell, sorry.

A code search for PyUnicode_DecodeLocale is showing not too much real world usage--maybe that's a bad thing? Depending on how common it is for a non-UTF8 locale to break things, maybe it would be a good idea to add a format character just so people start handling it.

@encukou
Copy link
Member

encukou commented Nov 19, 2024

It could also show that we're mainly handling errors from the OS at one point -- the OSError constructor. That would be a good thing :)

Anyway, for this PR: @picnixz, do you want to add calls to the new helper(s)?

@ZeroIntensity
Copy link
Member

FWIW, Bénédikt is on vacation right now. I don't think he'll be able to implement anything until next week.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the purpose of these internal functions since they are now used. This PR only adds dead code. I would prefer to see how these functions are used.

If only dlerror() uses it, why not using code using dlerror()?

Include/internal/pycore_pyerrors.h Outdated Show resolved Hide resolved
@encukou encukou marked this pull request as draft November 20, 2024 09:46
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a practical way to get a non-ASCII error messages not encoded to UTF-8 for the following functions?

  • dlerror()
  • gdbm_strerror()

How can I test (manually) modified code paths?

Doc/data/refcounts.dat Outdated Show resolved Hide resolved
Include/internal/pycore_pyerrors.h Outdated Show resolved Hide resolved
@picnixz
Copy link
Contributor Author

picnixz commented Dec 2, 2024

Is there a practical way to get a non-ASCII error messages not encoded to UTF-8 for the following functions?

I don't know :( Maybe changing the locale env. var. would be sufficient (assuming you have the translated messages) butI'm really bad at setting up locales so I don't really know =/

@vstinner
Copy link
Member

vstinner commented Dec 2, 2024

I confirm that dlerror() is encoded the current locale encoding:

import ctypes
import locale

RTLD_NOW = 2

locale.setlocale(locale.LC_ALL, "")

libdl = ctypes.cdll.LoadLibrary('libdl.so.2')
libdl.dlerror.restype = ctypes.c_char_p

libdl.dlopen(b'donotexist.so', RTLD_NOW)
errmsg = libdl.dlerror()
print(errmsg)

Output:

$ LANG=fr_FR.iso88591 ./python x.py 
b"donotexist.so: Ne peut ouvrir le fichier d'objet partag\xe9: Aucun fichier ou dossier de ce nom"

$ LANG=fr_FR.utf8 ./python x.py 
b"donotexist.so: Ne peut ouvrir le fichier d'objet partag\xc3\xa9: Aucun fichier ou dossier de ce nom"

The character "é" (U+00E9) is encoded to ISO-8859-1 (\xe9) or UTF-8 (\xc3\xa9).

We will add internal functions later in one go instead one-by-one.
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM, but I didn't test it (manually).

Include/internal/pycore_pyerrors.h Outdated Show resolved Hide resolved
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translated error messages most likely depend on the LC_MESSAGE environment variable. It is a separate difficult issue if its encoding differs from LC_CTYPE's encoding.

But if it is non-ASCII because it includes file paths, we should use the FS encoding.

@picnixz
Copy link
Contributor Author

picnixz commented Dec 2, 2024

It is a separate difficult issue if its encoding differs from LC_CTYPE's encoding.

What should we do for that? should we fix this in a follow-up PR?

Co-authored-by: Victor Stinner <vstinner@python.org>
@vstinner
Copy link
Member

vstinner commented Dec 3, 2024

Translated error messages most likely depend on the LC_MESSAGE environment variable. It is a separate difficult issue if its encoding differs from LC_CTYPE's encoding.

IMO it's reasonable to expect LC_MESSAGE and LC_CTYPE locales to use the same encoding in this API. It's already an enhancement compared to the current code which always expect UTF-8.

@encukou
Copy link
Member

encukou commented Dec 4, 2024

Then, what should we do with the code that currently ignores them?

Raise the error from _PyErr_SetLocaleString.
I should have caught that when they were switched to "surrogateescape". Hopefully it's not too late now.

@picnixz
Copy link
Contributor Author

picnixz commented Dec 5, 2024

Oh ok! I'll change it in a few hours (coming back home right now)

@picnixz picnixz marked this pull request as draft December 8, 2024 13:31
@picnixz picnixz force-pushed the fix/locale-set-object-exception-126742 branch from ad8ad6f to 6f9ee0b Compare December 8, 2024 13:40
@picnixz picnixz marked this pull request as ready for review December 8, 2024 13:40
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add tests?

@picnixz
Copy link
Contributor Author

picnixz commented Dec 8, 2024

Could you please add tests?

Do you want tests for the additional messages or tests for the new function itself? For the first case, I don't really know how to do it since we need to check for the presence of the translation files but I can do it for the second.

@serhiy-storchaka
Copy link
Member

Try to use Python API that calls corresponding functions (like dlopen() and gdbm_open()) in non-UTF-8 locale. Current code fails if translations are installed.

Try to use it with non-ASCII name. It seems that the dlopen() error message contains the file name -- current code fails trying to decode it with UTF-8. This will work even if no translations are installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants