Unify whitespace handling in linter and formatter #962

TTOzzi · 2025-03-14T15:28:18Z

Resolve #960

In Xcode, using certain unicode characters (such as U+2028) in code causes errors. Following the same logic, I modified the linter so that it no longer treats these characters as valid and instead emits a message prompting their removal. Comments containing these characters will also no longer cause crashes when lint.

To ensure scalability, I introduced a UnicodeException enum that maintains a collection of unprocessable unicode characters. When adding a new unexpected unicode character in the future, simply defining a new case and its corresponding utf8Bytes will allow it to be properly handled as an unexpected unicode character.

ahoppen

Maybe one high-level question: Why do we need to perform a preliminary validation pass over the input? Wouldn’t it be possible to resolve the crash at the point where it’s happening – I haven’t reproduced the issue myself, so this might not be possible, but I would like to explore that option.

TTOzzi · 2025-03-15T07:52:22Z

Maybe one high-level question: Why do we need to perform a preliminary validation pass over the input? Wouldn’t it be possible to resolve the crash at the point where it’s happening – I haven’t reproduced the issue myself, so this might not be possible, but I would like to explore that option.

My understanding is that WhitespaceLinter generates diagnostics by comparing the formatted text, processed by PrettyPrinter, with the original text, specifically focusing on removed whitespace characters.

During linting, it iterates through the text step by step, incrementing the index, and calls contiguousWhitespace to extract non-whitespace portions from both the original and formatted text.
Under normal circumstances, the non-whitespace content of both texts should be identical, meaning no issues should arise. However, in this issue, the original text contains more characters that are not recognized as whitespace compared to the formatted text. As a result, during iteration, formattedIndex eventually exceeds the valid range of formattedText, leading to an out-of-bounds issue.

Given this, I believe the current behavior of the linter is reasonable.

After reviewing your comment and re-examining the behavior, I found that the root cause is the discrepancy in whitespace detection criteria between PrettyPrinter and WhitespaceLinter.
When PrettyPrinter generates formattedText, it removes trailing whitespace using trimmingTrailingWhitespace, which detects whitespace at the String level. This process removes characters like U+2028.
However, WhitespaceLinter converts the string into [UTF8.CodeUnit] and checks for whitespace at the byte level. As a result, it fails to recognize U+2028 as a single whitespace character and instead processes its UTF-8 representation [0xE2, 0x80, 0xA8] as three separate units, failing to filter it out correctly.

IMO, the best approach would be to align the whitespace detection logic between PrettyPrinter and WhitespaceLinter 🤔
Let me revise this again!

TTOzzi · 2025-03-15T08:40:18Z

Sources/SwiftFormat/PrettyPrint/WhitespaceLinter.swift

+    // To ensure consistency with PrettyPrinter, the data is converted to a String
+    // before processing whitespace.
+    let substring = String(decoding: data[offset...], as: UTF8.self)
+    var stringIndex = substring.startIndex
+    while stringIndex < substring.endIndex, substring[stringIndex].isWhitespace {
+      substring.formIndex(after: &stringIndex)
    }
-    return data[offset..<whitespaceEnd]
+    let utf8Count = substring.utf8.distance(from: substring.startIndex, to: stringIndex)
+    return data[offset..<offset + utf8Count]


I modified the logic to determine whitespace to match the one used in PrettyPrinter.
However... after looking at #915 and running performance tests, I found that it became significantly slower 🫠

.build/release/swift-format lint --measure-instructions --recursive /tmp/swift-syntax before: 50321785572 after: 4392453228823

I think I'll need to look for an alternative approach.

TTOzzi · 2025-03-15T09:14:05Z

I modified the existing logic to better align with PrettyPrinter's isWhiteSpace without adding new diagnostics.
Still, it's slightly slower compared to the original implementation.

.build/release/swift-format lint --measure-instructions --recursive /tmp/swift-syntax

before: 50321785572
after: 54372214903

It seems that the trade-off is quite high compared to the problem we’re trying to address, so we might consider a workaround that simply prevents the crash.
Before calling contiguousWhitespace, we could add a guard like this:

guard userIndex < userText.endIndex, formattedIndex < formattedText.endIndex else { break }

This would prevent the crash, although it might result in diagnostics being generated oddly. Alternatively, we could replace the assert with a guard:

guard safeCodeUnit(at: userWhitespace.endIndex, in: userText)
      == safeCodeUnit(at: formattedWhitespace.endIndex, in: formattedText)
else { break }

In this case, if unexpected unicode characters cause a discrepancy between formattedText and userText, the linter would simply not run, working around the crash.

allevato · 2025-03-17T14:39:58Z

From some quick testing, it looks like this problem only repros when the U+2028 character is at the end of a line. That looks like it throws off all of the subsequent whitespace calculations by one line, causing it to crash at the end. If it's in the middle of a line, it executes fine. (At least, that's the only condition where I could cause it to crash.)

What's more interesting is that in format mode, something in the formatter is removing that U+2028, because it's not in the output. I didn't track down exactly where it's happening, though.

So I think this might be easier to solve. If we can figure out where the U+2028 is being dropped, perhaps we could just keep it? If someone wants to have one of those in a comment, the compiler doesn't care, so maybe we shouldn't either. Then, all the offsets will line up as expected.

It's likely that there's code somewhere that's trimming "whitespace" (where whitespace is defined in the Unicode sense), causing it to cover more possibilities than what whitespace linter expects. If we make that code only recognize ASCII whitespace, that might fix it.

TTOzzi · 2025-03-17T15:50:27Z

From some quick testing, it looks like this problem only repros when the U+2028 character is at the end of a line. That looks like it throws off all of the subsequent whitespace calculations by one line, causing it to crash at the end. If it's in the middle of a line, it executes fine. (At least, that's the only condition where I could cause it to crash.)

What's more interesting is that in format mode, something in the formatter is removing that U+2028, because it's not in the output. I didn't track down exactly where it's happening, though.

So I think this might be easier to solve. If we can figure out where the U+2028 is being dropped, perhaps we could just keep it? If someone wants to have one of those in a comment, the compiler doesn't care, so maybe we shouldn't either. Then, all the offsets will line up as expected.

It's likely that there's code somewhere that's trimming "whitespace" (where whitespace is defined in the Unicode sense), causing it to cover more possibilities than what whitespace linter expects. If we make that code only recognize ASCII whitespace, that might fix it.

You're right. The issue is that the formatter and linter have different definitions of whitespace. The formatter considers U+2028 as whitespace and removes it here.

I initially thought the linter should align with the formatter's behavior, so I tried addressing it here. But if, as you mentioned, there's no real need to remove U+2028, then it might make more sense to adjust the formatter to match the linter instead.

I'll revise the approach accordingly. Thanks for sharing your thoughts!

TTOzzi · 2025-03-17T16:40:51Z

Sources/SwiftFormat/PrettyPrint/Comment.swift

  func trimmingTrailingWhitespace() -> String {
    if isEmpty { return String() }
-    let scalars = unicodeScalars
-    var idx = scalars.index(before: scalars.endIndex)
-    while scalars[idx].properties.isWhitespace {
-      if idx == scalars.startIndex { return String() }
-      idx = scalars.index(before: idx)
+    let utf8Array = Array(utf8)
+    var idx = utf8Array.endIndex - 1
+    while utf8Array[idx].isWhitespace {
+      if idx == utf8Array.startIndex { return String() }
+      idx -= 1
+    }
+    return String(decoding: utf8Array[...idx], as: UTF8.self)
+  }
+}


I have updated the formatter's whitespace criteria to match the linter's.
The formatter no longer considers characters like U+2028 as whitespace, so they will no longer be removed.

This change has no significant impact on performance.

.build/release/swift-format lint --measure-instructions --recursive /tmp/swift-syntax before: 50321785572 after: 49557836028

allevato

Can you add a test or two to WhitespaceLintTests to make sure we don't accidentally regress this in the future? We don't need to worry about the situation where a character like U+2028 would appear outside of a comment (or a string literal I guess) because that would be a compiler error, but it would be good to have a test that puts one at the end of a comment line and make sure it doesn't blow up.

That does mean that if we had a comment like this:

// foo bar baz    <U+2028>

the whitespace after baz wouldn't be recognized as trailing whitespace. I guess that's "fine"; if we ever wanted to do something more Unicode-smart in the future, we'd want to do it holistically through the whole formatter.

TTOzzi · 2025-03-17T17:08:33Z

Can you add a test or two to WhitespaceLintTests to make sure we don't accidentally regress this in the future? We don't need to worry about the situation where a character like U+2028 would appear outside of a comment (or a string literal I guess) because that would be a compiler error, but it would be good to have a test that puts one at the end of a comment line and make sure it doesn't blow up.

That does mean that if we had a comment like this:
// foo bar baz    <U+2028>
the whitespace after baz wouldn't be recognized as trailing whitespace. I guess that's "fine"; if we ever wanted to do something more Unicode-smart in the future, we'd want to do it holistically through the whole formatter.

Yes, sounds good. I've added a test case to WhitespaceLintTests to demonstrate that the linter does not treat characters like U+2028 as whitespace.

allevato

Thanks!

ahoppen

Nice. Thank you @TTOzzi, this is a great solution and the improved performance is an amazing bonus 🚀

TTOzzi requested review from ahoppen, allevato, bnbarham and shawnhyam as code owners March 14, 2025 15:28

TTOzzi force-pushed the unicode-exception branch from e27cf40 to 6bca986 Compare March 14, 2025 15:33

ahoppen reviewed Mar 14, 2025

View reviewed changes

TTOzzi force-pushed the unicode-exception branch from 6bca986 to b80ef05 Compare March 15, 2025 08:19

TTOzzi commented Mar 15, 2025

View reviewed changes

TTOzzi force-pushed the unicode-exception branch from b80ef05 to a960372 Compare March 15, 2025 09:06

TTOzzi force-pushed the unicode-exception branch from a960372 to a991ece Compare March 17, 2025 16:33

TTOzzi commented Mar 17, 2025

View reviewed changes

TTOzzi changed the title ~~Handle unprocessable whitespace-related unicode characters~~ Unify whitespace handling in linter and formatter Mar 17, 2025

allevato reviewed Mar 17, 2025

View reviewed changes

Unify whitespace handling in linter and formatter

57ec53b

TTOzzi force-pushed the unicode-exception branch from a991ece to 57ec53b Compare March 17, 2025 17:06

allevato approved these changes Mar 17, 2025

View reviewed changes

allevato enabled auto-merge March 17, 2025 17:10

allevato merged commit 2d964e4 into swiftlang:main Mar 17, 2025
18 checks passed

ahoppen reviewed Mar 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unify whitespace handling in linter and formatter #962

Unify whitespace handling in linter and formatter #962

Uh oh!

TTOzzi commented Mar 14, 2025

Uh oh!

ahoppen left a comment

Uh oh!

TTOzzi commented Mar 15, 2025

Uh oh!

TTOzzi Mar 15, 2025 •

edited

Loading

Uh oh!

TTOzzi commented Mar 15, 2025 •

edited

Loading

Uh oh!

allevato commented Mar 17, 2025

Uh oh!

TTOzzi commented Mar 17, 2025

Uh oh!

TTOzzi Mar 17, 2025

Uh oh!

allevato left a comment

Uh oh!

TTOzzi commented Mar 17, 2025

Uh oh!

allevato left a comment

Uh oh!

Uh oh!

ahoppen left a comment

Uh oh!

Uh oh!

Unify whitespace handling in linter and formatter #962

Unify whitespace handling in linter and formatter #962

Uh oh!

Conversation

TTOzzi commented Mar 14, 2025

Uh oh!

ahoppen left a comment

Choose a reason for hiding this comment

Uh oh!

TTOzzi commented Mar 15, 2025

Uh oh!

TTOzzi Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TTOzzi commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allevato commented Mar 17, 2025

Uh oh!

TTOzzi commented Mar 17, 2025

Uh oh!

TTOzzi Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

allevato left a comment

Choose a reason for hiding this comment

Uh oh!

TTOzzi commented Mar 17, 2025

Uh oh!

allevato left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahoppen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TTOzzi Mar 15, 2025 •

edited

Loading

TTOzzi commented Mar 15, 2025 •

edited

Loading