gh-126700: pygettext: Support more gettext functions #126912

tomasr8 · 2024-11-16T21:53:01Z

pygettext currently only supports the single-argument form _('foo'). This can be extended via the --keyword CLI argument but all functions are assumed to be single-argument as well. That means no ngettext, pgettext , etc.. pygettext is currently not able to emit msgctxt and msgid_plural at all.

This PR fixes that by adding support for all standard gettext functions:

gettext
ngettext
pgettext
npgettext
dgettext
dngettext
dpgettext
dnpgettext

These can be extended using --keyword=KEYWORD but all such keywords are still assumed to be single-argument. Adding CLI support requires parsing the xgettext keyword specifier format (e.g. pgettext:1c,2) which can be done in a followup PR.

Adding support for other gettext functions means that pygettext now extracts calls that may seem a bit questionable such as _(x="kwargs") but it is consistent with other extractors like xgettext and pybabel. This is also the price for using a token-based extractor. If we want to keep the code reasonably simple, we'll need to accept some false positives. I would eventually like to see pygettext switch to an AST-based extractor which would eliminate these issues.

I tried to keep the diff minimal, but some larger changes were needed. Feedback welcome!

Issue: pygettext: Add support for multi-argument gettext functions #126700

tomasr8 · 2024-11-16T21:54:17Z

Tools/i18n/pygettext.py

-DEFAULTKEYWORDS = ', '.join(default_keywords)
-
-EMPTYSTRING = ''
+__version__ = '1.6'


I bumped the version since this adds some new capabilities, but let me know if it's not needed!

It makes sense if the script is separately distributed. But when it is the part of the Python distribution, I think that we should use the Python version. We can discuss this in a separate issue.

Good point, I reverted that change. My original reasoning was that since the version is written to the POT file we might want to bump it up but I agree that it should use the Python version itself, not a separate version.

tomasr8 · 2024-11-16T21:56:02Z

Lib/test/test_tools/test_i18n.py

@@ -332,14 +332,14 @@ def test_calls_in_fstring_with_multiple_args(self):
        msgids = self.extract_docstrings_from_str(dedent('''\
        f"{_('foo', 'bar')}"
        '''))
-        self.assertNotIn('foo', msgids)
+        self.assertIn('foo', msgids)


Both xgettext and pybabel extract this and it would take a lot of effort to disallow this in general with the current extractor, so I'd leave this for now at least.

tomasr8 · 2024-11-16T21:58:13Z

Tools/i18n/pygettext.py

@@ -613,7 +683,9 @@ class Options:
    make_escapes(not options.escape)

    # calculate all keywords
-    options.keywords.extend(default_keywords)
+    options.keywords = {kw: {0: 'msgid'} for kw in options.keywords}


--keyword still works but the keywords are assumed to be single-argument. A followup PR could add support for more.

serhiy-storchaka

Great! I just took a look and will do a more detailed review tomorrow. So far I see two problems:

It does not work correctly when the first argument ща dgettext is a complex expression that includes parentheses or commas. Nested parentheses should be counted as in __suiteseen.
No warning is issued when the expected argument is not a string literal. Such bug were recently fixed in argparse. The new i18n argparse tests would catch such bugs, because pygettext emits warnings, but with this PR it silently ignores them.

tomasr8 · 2024-11-17T11:08:40Z

It does not work correctly when the first argument ща dgettext is a complex expression that includes parentheses or commas. Nested parentheses should be counted as in __suiteseen.

That should be fixed now! I used the same mechanism that __suiteseen does.

No warning is issued when the expected argument is not a string literal. Such bug were recently fixed in argparse. The new i18n argparse tests would catch such bugs, because pygettext emits warnings, but with this PR it silently ignores them.

I restored the warnings. I initially removed them in order to allow extraction of keyword arguments (e.g. _(x="foo")) which is supported by xgettext and pybabel. Now that the warnings are restored, this is not allowed anymore since properly parsing those would add a lot of complexity (it wasn't allowed before either so the behaviour of pygettext does not change).

Note that there are still a lot of edge cases when it comes to extraction which I didn't want to address in this PR. For instance, extraction of nested constructs such as _('foo', param=_('bar')) and f-strings. Addressing those would be considerably easier if we eventually switched to a parser-based approach as in #104402. If you think it's worthwhile I'd be happy to continue work on that PR once this lands 🙂

Lib/test/test_tools/i18n_data/messages.py

serhiy-storchaka · 2024-11-18T09:22:19Z

Lib/test/translationdata/argparse/msgids.txt

@@ -8,6 +8,8 @@ argument %(argument_name)s: %(message)s
 argument '%(argument_name)s' is deprecated
 can't open '%(filename)s': %(error)s
 command '%(parser_name)s' is deprecated
+conflicting option string: %s


Should msgid_plural also be output? Or do this in the following PR?

We might also want to include msgctxt even though I don't think any messages use pgettext currently. I was thinking instead of using a list of msgids, why not use the generated POT file?

We initially rejected that idea when adding the snapshots because of potentially changing line locations, but pygettext has an option to turn those off.

We could even add an option to not emit the header (pybabel has this for instance). Then the snapshots would only change if the strings themselves change.

Misc/NEWS.d/next/Tools-Demos/2024-11-16-20-47-20.gh-issue-126700.ayrHv4.rst

serhiy-storchaka · 2024-11-18T09:26:10Z

Tools/i18n/pygettext.py

-DEFAULTKEYWORDS = ', '.join(default_keywords)
-
-EMPTYSTRING = ''
+__version__ = '1.6'


It makes sense if the script is separately distributed. But when it is the part of the Python distribution, I think that we should use the Python version. We can discuss this in a separate issue.

serhiy-storchaka · 2024-11-18T09:29:04Z

Tools/i18n/pygettext.py

 class TokenEater:
    def __init__(self, options):
        self.__options = options
        self.__messages = {}
        self.__state = self.__waiting
-        self.__data = []
+        self.__data = defaultdict(str)


Why use defaultdict?

So that I can do this one-liner 😄 :

https://github.com/python/cpython/pull/126912/files#diff-27bf55510663a73d2fdea1e604efdb59e0115378530202b5c55d04656dedece2R513

Tools/i18n/pygettext.py

serhiy-storchaka · 2024-11-18T09:37:12Z

Tools/i18n/pygettext.py

+            elif tstring in ')]}':
+                self.__enclosurecount -= 1
+            elif expect_string_literal:
+                # We are inside an argument which is a translatable string and


I think this can be merged with the below. But I can be wrong.

Hmm I don't think it can, unless I am misunderstading what you mean

It is not so important now, after you moved the ugly part of the code into warn_unexpected_token(), but you can avoid the code duplication by using earlier returns.

if ttype == tokenize.OP and self.__enclosurecount == 0: if tstring == ')': ... return if tstring == ',': ... return if expect_string_literal: ... # handle string literals, comments, etc else: ... # handle parentheses

Nice! I wasn't considering early returns but I think it's more digestible this way. I updated the code and added more test cases :)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

serhiy-storchaka · 2024-11-21T10:55:56Z

Tools/i18n/pygettext.py

+            elif tstring in '([{':
+                self.__enclosurecount += 1
+            elif tstring in ')]}':
+                self.__enclosurecount -= 1


These are invalid if expect_string_literal is true.

I suspect that this does not work correctly for _('string'[i]).

See the comment below.

serhiy-storchaka · 2024-11-21T11:04:08Z

Tools/i18n/pygettext.py

+            elif tstring in ')]}':
+                self.__enclosurecount -= 1
+            elif expect_string_literal:
+                # We are inside an argument which is a translatable string and


It is not so important now, after you moved the ugly part of the code into warn_unexpected_token(), but you can avoid the code duplication by using earlier returns.

if ttype == tokenize.OP and self.__enclosurecount == 0: if tstring == ')': ... return if tstring == ',': ... return if expect_string_literal: ... # handle string literals, comments, etc else: ... # handle parentheses

serhiy-storchaka

LGTM. 👍

tomasr8 added 4 commits November 16, 2024 22:48

Support multi-argument gettext functions

22f7803

Update snapshots

06686c9

Bump pygettext version

e5c2746

Add news entry

511a0c0

tomasr8 requested a review from serhiy-storchaka November 16, 2024 21:53

tomasr8 requested a review from savannahostrowski as a code owner November 16, 2024 21:53

bedevere-app bot added the awaiting review label Nov 16, 2024

bedevere-app bot mentioned this pull request Nov 16, 2024

pygettext: Add support for multi-argument gettext functions #126700

Closed

tomasr8 commented Nov 16, 2024

View reviewed changes

serhiy-storchaka reviewed Nov 16, 2024

View reviewed changes

tomasr8 added 3 commits November 17, 2024 11:10

Correctly count enclosures

62d6455

Restore warnings for invalid arguments

496f5d9

Remove extra space

06186a0

Simplify code

3d67a7a

tomasr8 commented Nov 17, 2024

View reviewed changes

Lib/test/test_tools/i18n_data/messages.py Outdated Show resolved Hide resolved

Update comment

48070d5

serhiy-storchaka reviewed Nov 18, 2024

View reviewed changes

tomasr8 and others added 4 commits November 18, 2024 18:27

Only extract when __enclosure_count is 0

d6fd789

Keep the old version

3497690

Improve news entry

192187e

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

Update snapshots

9cfc901

tomasr8 requested a review from serhiy-storchaka November 21, 2024 09:28

serhiy-storchaka reviewed Nov 21, 2024

View reviewed changes

Refactor __openseen

24851c5

serhiy-storchaka approved these changes Nov 22, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels Nov 22, 2024

serhiy-storchaka merged commit 0a1944c into python:main Nov 22, 2024
40 checks passed

bedevere-app bot removed the awaiting merge label Nov 22, 2024

tomasr8 deleted the ngettext branch November 22, 2024 19:01

tomasr8 mentioned this pull request Nov 30, 2024

gh-104400: pygettext: use an AST parser instead of a tokenizer #104402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-126700: pygettext: Support more gettext functions #126912

gh-126700: pygettext: Support more gettext functions #126912

tomasr8 commented Nov 16, 2024 •

edited by bedevere-app bot

Loading

tomasr8 Nov 16, 2024

serhiy-storchaka Nov 18, 2024

tomasr8 Nov 18, 2024

tomasr8 Nov 16, 2024

tomasr8 Nov 16, 2024

serhiy-storchaka left a comment

tomasr8 commented Nov 17, 2024

serhiy-storchaka Nov 18, 2024

tomasr8 Nov 18, 2024

serhiy-storchaka Nov 18, 2024

serhiy-storchaka Nov 18, 2024

tomasr8 Nov 18, 2024

serhiy-storchaka Nov 18, 2024

tomasr8 Nov 18, 2024

serhiy-storchaka Nov 21, 2024

tomasr8 Nov 21, 2024

serhiy-storchaka Nov 21, 2024

serhiy-storchaka Nov 21, 2024

serhiy-storchaka left a comment

gh-126700: pygettext: Support more gettext functions #126912

gh-126700: pygettext: Support more gettext functions #126912

Conversation

tomasr8 commented Nov 16, 2024 • edited by bedevere-app bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

tomasr8 commented Nov 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

tomasr8 commented Nov 16, 2024 •

edited by bedevere-app bot

Loading