-
-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I18n: Add check to detect string wrapped in HTML #1856
Conversation
WordPress/Sniffs/WP/I18nSniff.php
Outdated
@@ -80,6 +80,8 @@ class I18nSniff extends AbstractFunctionRestrictionsSniff { | |||
[bcdeEfFgGosuxX] # Type specifier. | |||
)/x'; | |||
|
|||
const WRAPPED_IN_HTML_REGEX = '#^(<[^>]+>).*(<[^>]+>)$#'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also include the forward slash /
in the closing tag, so it doesn't match <foo>123<foo>
? i.e. ^(<[^>]+>).*(<\/[^>]+>)$
Also, not sure if it matters, but this will also match empty content (e.g. <foo><foo>
) or non-alphanumeric tags (e.g. <<>>123<<>
). Maybe using \w
instead of [^>]
would handle these? 🤔
Sorry if I'm just adding noise, only tested briefly here: https://regex101.com/r/rIp4dV/3/tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more noise: here's same tests with \w
https://regex101.com/r/rIp4dV/4/tests :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah definitely some noise. HTML5 has optional closing slashes
whoops totally irrelevant comment - HTML5 has optional self closing slashes for void elements i believe. but that's totally irrelevant here :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you've pointed out in Slack, it's really hard to parse HTML with a RegEx -- even if that HTML is supposedly simply, there can be nuance such as e.g. an attribute containing >
, such as in <address id="%s" data-type=">foo" value="%s">my address</address>
See this famous Stack Overflow post: https://stackoverflow.com/a/1732454
The correct way would probably be to use one of PHP's included HTML parsing facilities, but OTOH, that seems a bit overblown.
Some of these cases are arguably rather pathological, so I'm going to stop tweaking the RegEx for now. I think it might be fine for starters, and it can be extended to include use cases that other people might run into over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, agree 100%. 😃
See this famous Stack Overflow post: https://stackoverflow.com/a/1732454
Best post on Stack Overflow! 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ockham Thank you for this PR. The principle of it looks good, the fine details still need some work.
I hope you'll find the feedback I've left useful. Please let me know if you have any questions.
Also: please make sure the build passes on your PR. In this case, comments are missing punctuation, but you can review the Travis output after pushing updates to see the details on why a build is failing (if it fails).
// phpcs:set WordPress.WP.I18n text_domain[] my-slug | ||
|
||
__( '<address>123 Fake Street</address>', 'my-slug' ); // Bad, string shouldn't be wrapped in HTML. | ||
__( 'I live at <address>123 Fake Street</address>', 'my-slug' ); // Okay, no consensus on HTML tags withing strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see some more unit tests. This feels very minimal for a new feature.
I think this might be ready for another look 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ockham Thanks for making those changes. Definitely going in the right direction, though it still needs a little more work. Hope you'll find my remarks useful. Let me know if you have any questions.
Also: once done, could you squash the commits into one ?
__( 'I live at <address>123 Fake Street</address>', 'my-slug' ); // Okay, no consensus on HTML tags withing strings. | ||
__( '<div class="my-class">Translatable content</div>', 'my-slug' ); // Bad | ||
__( '<option id="%1$s" value="%2$s">Translatable option name</option>', 'my-slug' ); // Wrapping is okay, since there are placeholders | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to see some additional tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added your example and another variation of an existing one, but seem to struggle to come up with meaningful ones overall. Could you give me some pointers as to what you're thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, try this as a good starting point for finding more tests: try running the sniff over WP core and see which false positives it throws up for this new error code. Those cases will all need to be accounted for and the patterns found which caused the false positives should be covered by unit tests.
Also see the link in the original issue to Core trac where there are also quite some examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, try this as a good starting point for finding more tests: try running the sniff over WP core and see which false positives it throws up for this new error code. Those cases will all need to be accounted for and the patterns found which caused the false positives should be covered by unit tests.
Ah yeah, makes sense, thanks. wp-includes
doesn't yield any occurrences, wp-content
has the following:
Those seem legit, but I think we've covered that class of issue with our current unit tests, no?
wp-admin
has quite a few, e.g.
and so on.
A lot of these are <a />
links where only the link text needs translating, and the actual <a />
could be moved out of the __()
call.
Also see the link in the original issue to Core trac where there are also quite some examples.
Uh, not sure which issue you mean 😅 Maybe this list (which #1419 (comment) refers to)? I checked a few of those.
<span>Previewing:</span> %s
Plugin <strong>resumed</strong>.
<span class="screen-reader-text">link to view ratings opens in a new tab</span>
These also seem pretty much covered I'd say 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these are links where only the link text needs translating, and the actual could be moved out of the __() call.
Sorry, but that's not true. Those are legit translation calls including the tag as there may be a localized page available for that URL and part of translating is changing the URL to the localized version.
I think it would be a good idea to allow for those, i.e. if the tag is an a
AND has a href
with a hard-coded URL, there should be no warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these are links where only the link text needs translating, and the actual could be moved out of the __() call.
Sorry, but that's not true. Those are legit translation calls including the tag as there may be a localized page available for that URL and part of translating is changing the URL to the localized version.
Right, I didn't think of that.
I think it would be a good idea to allow for those, i.e. if the tag is an
a
AND has ahref
with a hard-coded URL, there should be no warning.
I'm not totally sure it should be generalized that much though? This seems like a fairly 'semantic' addition to the sniff that implies quite some background information (there can be localized link targets), which might not be true for all codebases alike. One could argue that the localized link targets are more the exception than the rule, that the sniff doesn't necessarily need to know about them, and that it'd be fine to require a sniff ignore to be added by the author in case it's needed 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the warning would be kept for link tags with hard-coded URLs, I'd like to see a separate error code for those.
That would at least allow codebases - including WP Core - which do have localized URLs and therefore do this with good reason to exclude the warning for those particular text strings from their custom ruleset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added per 1294413
I was just talking about this PR with a friend of mine and he suggested (rightly) to also consider the following case for which the HTML tags should not be removed: __( '<span>Text</span> More text <span>Text</span>', 'domain' ); |
d689d4c
to
c766d3c
Compare
Fixes the non-controversial part of WordPress#1419 by adding a sniff that detects translated strings wrapped in HTML tags, and a corresponding unit test. Quoting WordPress#1419 (comment): > Examples where the markup _should_ be removed from inside of the string would be: > > ``` > <?php __( '<h1>Settings Page</h1>', 'text-domain ); ?> > should be > <h1><?php __( 'Settings Page', 'text-domain' ); ?></h1> > because the markup only wraps the string. > ``` Note that I had to add code to `I18nUnitTest.php` to reset the text domain as set via `setConfigData()` for one test file, as it would otherwise persist for every test file after that one.
c766d3c
to
a43ae6d
Compare
Rewrote the sniff using Also ran the sniff on the WordPress codebase (results are here), which seems to indicate no false positives so far, and that our current unit tests are at least representative of the kind of code encountered in a real-world codebase. Think this is in good enough shape to be considered for merging, and where potential refinements could be addressed in follow-up PRs? 😅 |
I'll have another look tomorrow but this PR will definitely not go into WPCS 2.2.1, which is due to be released tomorrow. I'd want this feature to be in place for beta testing for at least a little while so we can see if any more issues are reported. |
👋 @jrfnl Would you mind giving this another look? 🙂 |
@ockham Has my remark in #1856 (comment) been addressed ? |
Oh and please also include some tests with malformed/incomplete HTML and verify that the code handles this gracefully. I seem to remember that XMLreader does not handle that well. Some examples of the top of my head (please try and find some more/more specific ones for known issues): __( '<div>text to translate', 'domain' );
__( 'text to <div>translate', 'domain' );
__( 'text < to translate', 'domain' );
__( 'text to > translate', 'domain' ); |
I added a few commits to add your malformed HTML examples, plus some that Christos had in his regex tests. I also added a rule for |
I'd like to get some more eyes on this. @dingo-d @GaryJones Would you please have a look ? |
Looks good just by looking at it, I have started to run the i18n sniff on themes repo that I have locally (this will take a while), just to see if anything interesting pops up, since themes can get creative when it comes to what they add in the translation strings 😄 |
I ran this over my local themes repo and from what I see it caught everything correctly https://gist.github.com/dingo-d/aecd13368ec1a91e1bd40772d0d94a21 I ran over the themes using vendor/bin/phpcs --sniffs=WordPress.WP.i18n ../WordPress-Themes/themes/ --report-file=../phpcs-result.txt --error-severity=0 --extensions=php --report=code This caught both |
|
||
// Strings wrapped in '<a href="...">...</a>'. These get a different warning, as the link target might actually be localized. | ||
// phpcs:disable WordPress.WP.I18n.NoHtmlWrappedStrings | ||
__( '<a href="http://wordpress.org">WordPress</a>', 'my-slug' ); // Bad, strings shouldn't be wrapped in <a href="...">...</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the one that would concern me. By adding a warning, we're effectively recommending that URLs shouldn't be localized, especially when the comments say that this is "bad".
I'd rather see this one as good, and the placeholder test below possibly as bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment, and the one below.
I ended up adding a separate warning for this case (NoAHrefWrappedStrings
vs NoHtmlWrappedStrings
) so that users would be able to separately disable the <a href=".">...</a>
rule (while retaining the more 'generic' one about wrapping strings in HTML tags), to effectively get the behavior that you're describing. Does that sound okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had read that, and Juliette has the same concerns that I do.
My concern, is that the person writing the code, may not know of any localised link, but is being a good citizen by making the URL available for translation. Yet, the code in this PR is penalising them for that.
If instead of:
__( '<a href="http://wordpress.org">WordPress</a>', 'my-slug' );
It was:
__( '<a href="http://wordpress.org">WordPress</a> Rocks!', 'my-slug' );
then this sniff would have no issue with the hard-coded URL.
I'm all for the rest, and I know this currently has a different violation code, but right now, I'd rather let any strings wrapped in <a ...>...</a>
to not be flagged, regardless of attribute values or placeholders, and then maybe tighten it up in the future if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the one that would concern me. By adding a warning, we're effectively recommending that URLs shouldn't be localized, especially when the comments say that this is "bad".
This is incorrect. The recommended approach would be this: <a href="<?php echo esc_url( __( 'https://wordpress.org/' ) ); ?>"><?php _e( 'WordPress Rocks' ); ?></a>
, and not "not localizing" the URL.
In the approach you recommend (including the <a>
tag incl the href
value), translation systems like GlotPress (on translate.wordpress.org) will actually throw a warning if you want to submit a translation with a changed tag (and changing the URL in a tag is changing the tag), so I'd consider this discouraged as well.
This is why I think it's preferrable to wrap the URL separately in gettext and have the sniff advise so (i.e. change the message Strings should not be wrapped in <a href="...">...</a>
to something like Strings should not be wrapped in <a href="...">...</a>, please wrap the URL separately
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommended approach would be this:
<a href="<?php echo esc_url( __( 'https://wordpress.org/' ) ); ?>"><?php _e( 'WordPress Rocks' ); ?></a>
, and not "not localizing" the URL.
Your example doesn't match what I put. In mine, WordPress
was linked, while Rocks
was not. By following your approach, the string would be split.
https://developer.wordpress.org/apis/handbook/internationalization/internationalization-guidelines/#best-practices-for-writing-strings says "Do not leave URLs for translation, unless they could have a version in another language.".
In the case of my example, I'd want to switch it to https://en-gb.wordpress.org
.
An examples where it wouldn't be switched, would be when it points to an internal settings page.
At this stage, I wouldn't want to get into analysing the URL itself to decide whether to flag it or not, and so, would still like to see this not flagged for this first pass. With more discussion, and consensus, maybe a variant of this check for a href could be added.
translation systems like GlotPress (on translate.wordpress.org) will actually throw a warning if you want to submit a translation with a changed tag
Correct, along with warnings for other validation failures as well - that's just a way to highlight it to Translation Editors, rather than let potentially spam URLs to make it through to production translation sets. It's not to discourage valid localised URLs from being submitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your example doesn't match what I put. In mine, WordPress was linked, while Rocks was not. By following your approach, the string would be split.
This was for the sake of an argument. The whole point of this PR is to encourage leaving out as much as HTML possbile from Gettext string. Your string of argument is distracting from this approach.
Correct, along with warnings for other validation failures as well - that's just a way to highlight it to Translation Editors, rather than let potentially spam URLs to make it through to production translation sets. It's not to discourage valid localised URLs from being submitted.
We can disagree about the reasons behind the validation failure. In general your approach encourages translators to change HTML tags which in my opinion should not be their business and they should actually be protected. It's easy to introduce invalid HTML and just dismiss the warning. So actually the fact that you can override warnings in GlotPress is not helpful in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of this PR is to encourage leaving out as much as HTML possbile from Gettext string.
Your definition of "as possible" and mine clearly differ. I'll refer to my previous comment on the issue as for why.
In general your approach encourages translators to change HTML tags which in my opinion should not be their business and they should actually be protected.
Your opinion disagrees with the gettext
manual, WP best practices document, and one or two handbooks.
Besides that, the prior art here is that the translators using GlotPress have been handling (wrapping) markup in strings from developers who didn't know better for a number of years now. That, on its own, isn't necessarily reason enough to continue to allow it, but I can't see what major issue has happened that would warrant the overly strict set of checks that you want to see imposed in the first pass.
As I've said, I'm happy to continue this discussion for wrapping anchor links, but by moving it outside of this PR, we can unblock the majority of the benefit that the rest of it brings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting that strings wrapped with anchor tags are not flagged, for this first improvement of the sniff.
Removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ockham Thanks for making those changes.
Sorry I've been quiet over the last week. I'm in full conference prep mode and will be traveling for the next two weeks, so won't have a chance to have another look until after that time. Though I haven't reviewed the latest version of the technical implementation, I do believe my concerns about the actual functionality have been addressed. I'll leave it up to the other committers to decide whether they feel confident merging this as it is now or whether they want to wait until I'm back. They have my blessing. |
Fixes the non-controversial part of #1419 by adding a sniff that detects translated strings wrapped in HTML tags, and a corresponding unit test. Quoting #1419 (comment):
Note that I had to add code to
I18nUnitTest.php
to reset the text domain as set viasetConfigData()
for one test file, as it would otherwise persist for every test file after that one.Based on a WordPress.com-internal check. /cc @akirk