NonASCIICharacterChecker should inspect Token.rawText, not Token.text #274

latkin · 2017-08-22T17:55:49Z

Fixes #273

 * @param text -- the text associated with the token after unicode escaping
 * @param rawText -- the text associated with the token before unicode escaping

NonASCIICharacterChecker currently looks at text (source code with escapes sequences applied), it should look at rawText (source code in raw form).

codecov-io · 2017-08-22T18:04:36Z

Codecov Report

Merging #274 into master will not change coverage.
The diff coverage is 0%.

@@          Coverage Diff          @@
##           master   #274   +/-   ##
=====================================
  Coverage       0%     0%           
=====================================
  Files          59     59           
  Lines        1451   1451           
  Branches      142    139    -3     
=====================================
  Misses       1451   1451

Impacted Files	Coverage Δ
...lastyle/scalariform/NonASCIICharacterChecker.scala	`0% <0%> (ø)`	⬆️
...scalastyle/scalariform/AbstractMethodChecker.scala	`0% <0%> (ø)`	⬆️
src/main/scala/org/scalastyle/Checker.scala	`0% <0%> (ø)`	⬆️
src/main/scala/org/scalastyle/Output.scala	`0% <0%> (ø)`	⬆️
...calastyle/scalariform/CovariantEqualsChecker.scala	`0% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36bb802...405fecc. Read the comment docs.

latkin · 2017-08-22T18:40:36Z

src/test/scala/org/scalastyle/scalariform/NonASCIICharacterCheckerTest.scala

+                   |// non-ascii in string via unicode escape - ok
+                   |class OK {
+                   |  val s = "%s"
+                   |}""".stripMargin.format("\\ud83c\\udf4e")


If somebody knows how to correctly produce the desired string in a cleaner way I'd be happy to update.

I found that using \ud83c\udf4e in triple-quote string resulted in 🍎 (i.e. escape was applied), and using \\ud83c\\udf4e resulted in \\ud83c\\udf4e (i.e. double back slashes are left in literal form). Unsure how to cleanly get desired outcome of literal \ud83c\udf4e within a triple-quote string.

marconilanna · 2017-10-10T21:06:29Z

@latkin Do you think it would be possible do add an option to allow international characters in string literals?

While I agree that both

val s = "🍎"

case "value" ⇒ println("matched")

are bad, when writing in a language other than English non-ASCII characters in string literals are needed:

val greeting = "olá"

A regex like [\p{Alnum}\p{Punct}] should probably be sufficient for most cases.

Please not that I'm suggesting it only for string literals, not for identifiers.

matthewfarwell · 2017-10-11T17:52:05Z

Hi,

Thanks for this. If you do a squash, and rebase onto master, I'll merge this.

latkin · 2017-10-11T18:47:44Z

@marconilanna that is certainly a reasonable request, but this PR does not aim to change the rule's functionality. It only aims to correct a bug in how the rule (strict ASCII or otherwise) is applied. I will leave it to project owners whether to adjust the restrictions, add an alternative rule, etc.

@matthewfarwell will take care of that shortly, thanks

It is reasonable to enforce a rule that prevents non-ASCII text from appearing directly in source code. However current implementation also flags use of unicode escape sequences, which consist of only ASCII chars (e.g. \u1f34e). NonASCIICharacterChecker should inspect Token.rawText, which represents the literal source prior to applying unicode escapes. Token.text, which is currently being used, already has unicode escapes applied, and thus doesn't represent the actual content of the source code.

latkin · 2017-10-11T20:35:43Z

@matthewfarwell done - squashed to 1 commit and rebased to latest master

matthewfarwell · 2017-10-12T06:40:32Z

Cool. Thanks!

latkin commented Aug 22, 2017

View reviewed changes

latkin force-pushed the latkin-fix-escaped-nonascii branch from 57413d2 to 405fecc Compare October 11, 2017 20:34

matthewfarwell merged commit 93acfba into scalastyle:master Oct 12, 2017

latkin deleted the latkin-fix-escaped-nonascii branch October 12, 2017 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NonASCIICharacterChecker should inspect Token.rawText, not Token.text #274

NonASCIICharacterChecker should inspect Token.rawText, not Token.text #274

latkin commented Aug 22, 2017

Uh oh!

codecov-io commented Aug 22, 2017 •

edited

Loading

Uh oh!

latkin Aug 22, 2017

Uh oh!

marconilanna commented Oct 10, 2017 •

edited

Loading

Uh oh!

matthewfarwell commented Oct 11, 2017

Uh oh!

latkin commented Oct 11, 2017

Uh oh!

latkin commented Oct 11, 2017

Uh oh!

matthewfarwell commented Oct 12, 2017

Uh oh!

Uh oh!

NonASCIICharacterChecker should inspect Token.rawText, not Token.text #274

NonASCIICharacterChecker should inspect Token.rawText, not Token.text #274

Conversation

latkin commented Aug 22, 2017

Uh oh!

codecov-io commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

latkin Aug 22, 2017

Choose a reason for hiding this comment

Uh oh!

marconilanna commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthewfarwell commented Oct 11, 2017

Uh oh!

latkin commented Oct 11, 2017

Uh oh!

latkin commented Oct 11, 2017

Uh oh!

matthewfarwell commented Oct 12, 2017

Uh oh!

Uh oh!

codecov-io commented Aug 22, 2017 •

edited

Loading

marconilanna commented Oct 10, 2017 •

edited

Loading