Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify whitespace handling in linter and formatter #962

Merged
merged 1 commit into from
Mar 17, 2025

Conversation

TTOzzi
Copy link
Member

@TTOzzi TTOzzi commented Mar 14, 2025

Resolve #960

In Xcode, using certain unicode characters (such as U+2028) in code causes errors. Following the same logic, I modified the linter so that it no longer treats these characters as valid and instead emits a message prompting their removal. Comments containing these characters will also no longer cause crashes when lint.

To ensure scalability, I introduced a UnicodeException enum that maintains a collection of unprocessable unicode characters. When adding a new unexpected unicode character in the future, simply defining a new case and its corresponding utf8Bytes will allow it to be properly handled as an unexpected unicode character.

Copy link
Member

@ahoppen ahoppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one high-level question: Why do we need to perform a preliminary validation pass over the input? Wouldn’t it be possible to resolve the crash at the point where it’s happening – I haven’t reproduced the issue myself, so this might not be possible, but I would like to explore that option.

@TTOzzi
Copy link
Member Author

TTOzzi commented Mar 15, 2025

Maybe one high-level question: Why do we need to perform a preliminary validation pass over the input? Wouldn’t it be possible to resolve the crash at the point where it’s happening – I haven’t reproduced the issue myself, so this might not be possible, but I would like to explore that option.

My understanding is that WhitespaceLinter generates diagnostics by comparing the formatted text, processed by PrettyPrinter, with the original text, specifically focusing on removed whitespace characters.

During linting, it iterates through the text step by step, incrementing the index, and calls contiguousWhitespace to extract non-whitespace portions from both the original and formatted text.
Under normal circumstances, the non-whitespace content of both texts should be identical, meaning no issues should arise. However, in this issue, the original text contains more characters that are not recognized as whitespace compared to the formatted text. As a result, during iteration, formattedIndex eventually exceeds the valid range of formattedText, leading to an out-of-bounds issue.

Given this, I believe the current behavior of the linter is reasonable.

After reviewing your comment and re-examining the behavior, I found that the root cause is the discrepancy in whitespace detection criteria between PrettyPrinter and WhitespaceLinter.
When PrettyPrinter generates formattedText, it removes trailing whitespace using trimmingTrailingWhitespace, which detects whitespace at the String level. This process removes characters like U+2028.
However, WhitespaceLinter converts the string into [UTF8.CodeUnit] and checks for whitespace at the byte level. As a result, it fails to recognize U+2028 as a single whitespace character and instead processes its UTF-8 representation [0xE2, 0x80, 0xA8] as three separate units, failing to filter it out correctly.

IMO, the best approach would be to align the whitespace detection logic between PrettyPrinter and WhitespaceLinter 🤔
Let me revise this again!

@TTOzzi TTOzzi force-pushed the unicode-exception branch from 6bca986 to b80ef05 Compare March 15, 2025 08:19
Comment on lines 343 to 351
// To ensure consistency with PrettyPrinter, the data is converted to a String
// before processing whitespace.
let substring = String(decoding: data[offset...], as: UTF8.self)
var stringIndex = substring.startIndex
while stringIndex < substring.endIndex, substring[stringIndex].isWhitespace {
substring.formIndex(after: &stringIndex)
}
return data[offset..<whitespaceEnd]
let utf8Count = substring.utf8.distance(from: substring.startIndex, to: stringIndex)
return data[offset..<offset + utf8Count]
Copy link
Member Author

@TTOzzi TTOzzi Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the logic to determine whitespace to match the one used in PrettyPrinter.
However... after looking at #915 and running performance tests, I found that it became significantly slower 🫠

.build/release/swift-format lint --measure-instructions --recursive /tmp/swift-syntax

before: 50321785572
after: 4392453228823

I think I'll need to look for an alternative approach.

@TTOzzi TTOzzi force-pushed the unicode-exception branch from b80ef05 to a960372 Compare March 15, 2025 09:06
@TTOzzi
Copy link
Member Author

TTOzzi commented Mar 15, 2025

I modified the existing logic to better align with PrettyPrinter's isWhiteSpace without adding new diagnostics.
Still, it's slightly slower compared to the original implementation.

.build/release/swift-format lint --measure-instructions --recursive /tmp/swift-syntax

before: 50321785572
after: 54372214903

It seems that the trade-off is quite high compared to the problem we’re trying to address, so we might consider a workaround that simply prevents the crash.
Before calling contiguousWhitespace, we could add a guard like this:

guard userIndex < userText.endIndex, formattedIndex < formattedText.endIndex else { break }

This would prevent the crash, although it might result in diagnostics being generated oddly. Alternatively, we could replace the assert with a guard:

guard safeCodeUnit(at: userWhitespace.endIndex, in: userText)
      == safeCodeUnit(at: formattedWhitespace.endIndex, in: formattedText)
else { break }

In this case, if unexpected unicode characters cause a discrepancy between formattedText and userText, the linter would simply not run, working around the crash.

@allevato
Copy link
Member

From some quick testing, it looks like this problem only repros when the U+2028 character is at the end of a line. That looks like it throws off all of the subsequent whitespace calculations by one line, causing it to crash at the end. If it's in the middle of a line, it executes fine. (At least, that's the only condition where I could cause it to crash.)

What's more interesting is that in format mode, something in the formatter is removing that U+2028, because it's not in the output. I didn't track down exactly where it's happening, though.

So I think this might be easier to solve. If we can figure out where the U+2028 is being dropped, perhaps we could just keep it? If someone wants to have one of those in a comment, the compiler doesn't care, so maybe we shouldn't either. Then, all the offsets will line up as expected.

It's likely that there's code somewhere that's trimming "whitespace" (where whitespace is defined in the Unicode sense), causing it to cover more possibilities than what whitespace linter expects. If we make that code only recognize ASCII whitespace, that might fix it.

@TTOzzi
Copy link
Member Author

TTOzzi commented Mar 17, 2025

From some quick testing, it looks like this problem only repros when the U+2028 character is at the end of a line. That looks like it throws off all of the subsequent whitespace calculations by one line, causing it to crash at the end. If it's in the middle of a line, it executes fine. (At least, that's the only condition where I could cause it to crash.)

What's more interesting is that in format mode, something in the formatter is removing that U+2028, because it's not in the output. I didn't track down exactly where it's happening, though.

So I think this might be easier to solve. If we can figure out where the U+2028 is being dropped, perhaps we could just keep it? If someone wants to have one of those in a comment, the compiler doesn't care, so maybe we shouldn't either. Then, all the offsets will line up as expected.

It's likely that there's code somewhere that's trimming "whitespace" (where whitespace is defined in the Unicode sense), causing it to cover more possibilities than what whitespace linter expects. If we make that code only recognize ASCII whitespace, that might fix it.

You're right. The issue is that the formatter and linter have different definitions of whitespace. The formatter considers U+2028 as whitespace and removes it here.

I initially thought the linter should align with the formatter's behavior, so I tried addressing it here. But if, as you mentioned, there's no real need to remove U+2028, then it might make more sense to adjust the formatter to match the linter instead.

I'll revise the approach accordingly. Thanks for sharing your thoughts!

@TTOzzi TTOzzi force-pushed the unicode-exception branch from a960372 to a991ece Compare March 17, 2025 16:33
Comment on lines 22 to +32
func trimmingTrailingWhitespace() -> String {
if isEmpty { return String() }
let scalars = unicodeScalars
var idx = scalars.index(before: scalars.endIndex)
while scalars[idx].properties.isWhitespace {
if idx == scalars.startIndex { return String() }
idx = scalars.index(before: idx)
let utf8Array = Array(utf8)
var idx = utf8Array.endIndex - 1
while utf8Array[idx].isWhitespace {
if idx == utf8Array.startIndex { return String() }
idx -= 1
}
return String(decoding: utf8Array[...idx], as: UTF8.self)
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the formatter's whitespace criteria to match the linter's.
The formatter no longer considers characters like U+2028 as whitespace, so they will no longer be removed.

This change has no significant impact on performance.

.build/release/swift-format lint --measure-instructions --recursive /tmp/swift-syntax

before: 50321785572
after: 49557836028

@TTOzzi TTOzzi changed the title Handle unprocessable whitespace-related unicode characters Unify whitespace handling in linter and formatter Mar 17, 2025
Copy link
Member

@allevato allevato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test or two to WhitespaceLintTests to make sure we don't accidentally regress this in the future? We don't need to worry about the situation where a character like U+2028 would appear outside of a comment (or a string literal I guess) because that would be a compiler error, but it would be good to have a test that puts one at the end of a comment line and make sure it doesn't blow up.

That does mean that if we had a comment like this:

// foo bar baz    <U+2028>

the whitespace after baz wouldn't be recognized as trailing whitespace. I guess that's "fine"; if we ever wanted to do something more Unicode-smart in the future, we'd want to do it holistically through the whole formatter.

@TTOzzi TTOzzi force-pushed the unicode-exception branch from a991ece to 57ec53b Compare March 17, 2025 17:06
@TTOzzi
Copy link
Member Author

TTOzzi commented Mar 17, 2025

Can you add a test or two to WhitespaceLintTests to make sure we don't accidentally regress this in the future? We don't need to worry about the situation where a character like U+2028 would appear outside of a comment (or a string literal I guess) because that would be a compiler error, but it would be good to have a test that puts one at the end of a comment line and make sure it doesn't blow up.

That does mean that if we had a comment like this:

// foo bar baz    <U+2028>

the whitespace after baz wouldn't be recognized as trailing whitespace. I guess that's "fine"; if we ever wanted to do something more Unicode-smart in the future, we'd want to do it holistically through the whole formatter.

Yes, sounds good. I've added a test case to WhitespaceLintTests to demonstrate that the linter does not treat characters like U+2028 as whitespace.

Copy link
Member

@allevato allevato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@allevato allevato enabled auto-merge March 17, 2025 17:10
@allevato allevato merged commit 2d964e4 into swiftlang:main Mar 17, 2025
18 checks passed
Copy link
Member

@ahoppen ahoppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thank you @TTOzzi, this is a great solution and the improved performance is an amazing bonus 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash: WhitespaceLinter.contiguousWhitespace(startingAt:in:)
3 participants