-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-126700: pygettext: Support more gettext functions #126912
Conversation
Tools/i18n/pygettext.py
Outdated
DEFAULTKEYWORDS = ', '.join(default_keywords) | ||
|
||
EMPTYSTRING = '' | ||
__version__ = '1.6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bumped the version since this adds some new capabilities, but let me know if it's not needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense if the script is separately distributed. But when it is the part of the Python distribution, I think that we should use the Python version. We can discuss this in a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I reverted that change. My original reasoning was that since the version is written to the POT file we might want to bump it up but I agree that it should use the Python version itself, not a separate version.
@@ -332,14 +332,14 @@ def test_calls_in_fstring_with_multiple_args(self): | |||
msgids = self.extract_docstrings_from_str(dedent('''\ | |||
f"{_('foo', 'bar')}" | |||
''')) | |||
self.assertNotIn('foo', msgids) | |||
self.assertIn('foo', msgids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both xgettext
and pybabel
extract this and it would take a lot of effort to disallow this in general with the current extractor, so I'd leave this for now at least.
@@ -613,7 +683,9 @@ class Options: | |||
make_escapes(not options.escape) | |||
|
|||
# calculate all keywords | |||
options.keywords.extend(default_keywords) | |||
options.keywords = {kw: {0: 'msgid'} for kw in options.keywords} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--keyword
still works but the keywords are assumed to be single-argument. A followup PR could add support for more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I just took a look and will do a more detailed review tomorrow. So far I see two problems:
- It does not work correctly when the first argument ща
dgettext
is a complex expression that includes parentheses or commas. Nested parentheses should be counted as in__suiteseen
. - No warning is issued when the expected argument is not a string literal. Such bug were recently fixed in
argparse
. The new i18nargparse
tests would catch such bugs, becausepygettext
emits warnings, but with this PR it silently ignores them.
That should be fixed now! I used the same mechanism that
I restored the warnings. I initially removed them in order to allow extraction of keyword arguments (e.g. Note that there are still a lot of edge cases when it comes to extraction which I didn't want to address in this PR. For instance, extraction of nested constructs such as |
@@ -8,6 +8,8 @@ argument %(argument_name)s: %(message)s | |||
argument '%(argument_name)s' is deprecated | |||
can't open '%(filename)s': %(error)s | |||
command '%(parser_name)s' is deprecated | |||
conflicting option string: %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should msgid_plural
also be output? Or do this in the following PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want to include msgctxt
even though I don't think any messages use pgettext
currently. I was thinking instead of using a list of msgids, why not use the generated POT file?
We initially rejected that idea when adding the snapshots because of potentially changing line locations, but pygettext has an option to turn those off.
We could even add an option to not emit the header (pybabel has this for instance). Then the snapshots would only change if the strings themselves change.
Misc/NEWS.d/next/Tools-Demos/2024-11-16-20-47-20.gh-issue-126700.ayrHv4.rst
Outdated
Show resolved
Hide resolved
Tools/i18n/pygettext.py
Outdated
DEFAULTKEYWORDS = ', '.join(default_keywords) | ||
|
||
EMPTYSTRING = '' | ||
__version__ = '1.6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense if the script is separately distributed. But when it is the part of the Python distribution, I think that we should use the Python version. We can discuss this in a separate issue.
class TokenEater: | ||
def __init__(self, options): | ||
self.__options = options | ||
self.__messages = {} | ||
self.__state = self.__waiting | ||
self.__data = [] | ||
self.__data = defaultdict(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use defaultdict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that I can do this one-liner 😄 :
Tools/i18n/pygettext.py
Outdated
elif tstring in ')]}': | ||
self.__enclosurecount -= 1 | ||
elif expect_string_literal: | ||
# We are inside an argument which is a translatable string and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be merged with the below. But I can be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't think it can, unless I am misunderstading what you mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not so important now, after you moved the ugly part of the code into warn_unexpected_token()
, but you can avoid the code duplication by using earlier return
s.
if ttype == tokenize.OP and self.__enclosurecount == 0:
if tstring == ')':
...
return
if tstring == ',':
...
return
if expect_string_literal:
... # handle string literals, comments, etc
else:
... # handle parentheses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I wasn't considering early returns but I think it's more digestible this way. I updated the code and added more test cases :)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Tools/i18n/pygettext.py
Outdated
elif tstring in '([{': | ||
self.__enclosurecount += 1 | ||
elif tstring in ')]}': | ||
self.__enclosurecount -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are invalid if expect_string_literal
is true.
I suspect that this does not work correctly for _('string'[i])
.
See the comment below.
Tools/i18n/pygettext.py
Outdated
elif tstring in ')]}': | ||
self.__enclosurecount -= 1 | ||
elif expect_string_literal: | ||
# We are inside an argument which is a translatable string and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not so important now, after you moved the ugly part of the code into warn_unexpected_token()
, but you can avoid the code duplication by using earlier return
s.
if ttype == tokenize.OP and self.__enclosurecount == 0:
if tstring == ')':
...
return
if tstring == ',':
...
return
if expect_string_literal:
... # handle string literals, comments, etc
else:
... # handle parentheses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 👍
pygettext
currently only supports the single-argument form_('foo')
. This can be extended via the--keyword
CLI argument but all functions are assumed to be single-argument as well. That means nongettext
,pgettext
, etc..pygettext
is currently not able to emitmsgctxt
andmsgid_plural
at all.This PR fixes that by adding support for all standard gettext functions:
These can be extended using
--keyword=KEYWORD
but all such keywords are still assumed to be single-argument. Adding CLI support requires parsing thexgettext
keyword specifier format (e.g.pgettext:1c,2
) which can be done in a followup PR.Adding support for other gettext functions means that pygettext now extracts calls that may seem a bit questionable such as
_(x="kwargs")
but it is consistent with other extractors likexgettext
andpybabel
. This is also the price for using a token-based extractor. If we want to keep the code reasonably simple, we'll need to accept some false positives. I would eventually like to see pygettext switch to an AST-based extractor which would eliminate these issues.I tried to keep the diff minimal, but some larger changes were needed. Feedback welcome!