-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify whitespace handling in linter and formatter #962
Conversation
e27cf40
to
6bca986
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one high-level question: Why do we need to perform a preliminary validation pass over the input? Wouldn’t it be possible to resolve the crash at the point where it’s happening – I haven’t reproduced the issue myself, so this might not be possible, but I would like to explore that option.
My understanding is that During linting, it iterates through the text step by step, incrementing the index, and calls Given this, I believe the current behavior of the linter is reasonable. After reviewing your comment and re-examining the behavior, I found that the root cause is the discrepancy in whitespace detection criteria between IMO, the best approach would be to align the whitespace detection logic between PrettyPrinter and WhitespaceLinter 🤔 |
6bca986
to
b80ef05
Compare
// To ensure consistency with PrettyPrinter, the data is converted to a String | ||
// before processing whitespace. | ||
let substring = String(decoding: data[offset...], as: UTF8.self) | ||
var stringIndex = substring.startIndex | ||
while stringIndex < substring.endIndex, substring[stringIndex].isWhitespace { | ||
substring.formIndex(after: &stringIndex) | ||
} | ||
return data[offset..<whitespaceEnd] | ||
let utf8Count = substring.utf8.distance(from: substring.startIndex, to: stringIndex) | ||
return data[offset..<offset + utf8Count] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the logic to determine whitespace to match the one used in PrettyPrinter.
However... after looking at #915 and running performance tests, I found that it became significantly slower 🫠
.build/release/swift-format lint --measure-instructions --recursive /tmp/swift-syntax
before: 50321785572
after: 4392453228823
I think I'll need to look for an alternative approach.
b80ef05
to
a960372
Compare
I modified the existing logic to better align with
It seems that the trade-off is quite high compared to the problem we’re trying to address, so we might consider a workaround that simply prevents the crash. guard userIndex < userText.endIndex, formattedIndex < formattedText.endIndex else { break } This would prevent the crash, although it might result in diagnostics being generated oddly. Alternatively, we could replace the guard safeCodeUnit(at: userWhitespace.endIndex, in: userText)
== safeCodeUnit(at: formattedWhitespace.endIndex, in: formattedText)
else { break } In this case, if unexpected unicode characters cause a discrepancy between |
From some quick testing, it looks like this problem only repros when the U+2028 character is at the end of a line. That looks like it throws off all of the subsequent whitespace calculations by one line, causing it to crash at the end. If it's in the middle of a line, it executes fine. (At least, that's the only condition where I could cause it to crash.) What's more interesting is that in format mode, something in the formatter is removing that U+2028, because it's not in the output. I didn't track down exactly where it's happening, though. So I think this might be easier to solve. If we can figure out where the U+2028 is being dropped, perhaps we could just keep it? If someone wants to have one of those in a comment, the compiler doesn't care, so maybe we shouldn't either. Then, all the offsets will line up as expected. It's likely that there's code somewhere that's trimming "whitespace" (where whitespace is defined in the Unicode sense), causing it to cover more possibilities than what whitespace linter expects. If we make that code only recognize ASCII whitespace, that might fix it. |
You're right. The issue is that the formatter and linter have different definitions of whitespace. The formatter considers I initially thought the linter should align with the formatter's behavior, so I tried addressing it here. But if, as you mentioned, there's no real need to remove I'll revise the approach accordingly. Thanks for sharing your thoughts! |
a960372
to
a991ece
Compare
func trimmingTrailingWhitespace() -> String { | ||
if isEmpty { return String() } | ||
let scalars = unicodeScalars | ||
var idx = scalars.index(before: scalars.endIndex) | ||
while scalars[idx].properties.isWhitespace { | ||
if idx == scalars.startIndex { return String() } | ||
idx = scalars.index(before: idx) | ||
let utf8Array = Array(utf8) | ||
var idx = utf8Array.endIndex - 1 | ||
while utf8Array[idx].isWhitespace { | ||
if idx == utf8Array.startIndex { return String() } | ||
idx -= 1 | ||
} | ||
return String(decoding: utf8Array[...idx], as: UTF8.self) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the formatter's whitespace criteria to match the linter's.
The formatter no longer considers characters like U+2028
as whitespace, so they will no longer be removed.
This change has no significant impact on performance.
.build/release/swift-format lint --measure-instructions --recursive /tmp/swift-syntax
before: 50321785572
after: 49557836028
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test or two to WhitespaceLintTests to make sure we don't accidentally regress this in the future? We don't need to worry about the situation where a character like U+2028 would appear outside of a comment (or a string literal I guess) because that would be a compiler error, but it would be good to have a test that puts one at the end of a comment line and make sure it doesn't blow up.
That does mean that if we had a comment like this:
// foo bar baz <U+2028>
the whitespace after baz
wouldn't be recognized as trailing whitespace. I guess that's "fine"; if we ever wanted to do something more Unicode-smart in the future, we'd want to do it holistically through the whole formatter.
a991ece
to
57ec53b
Compare
Yes, sounds good. I've added a test case to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Thank you @TTOzzi, this is a great solution and the improved performance is an amazing bonus 🚀
Resolve #960
In Xcode, using certain unicode characters (such as
U+2028
) in code causes errors. Following the same logic, I modified the linter so that it no longer treats these characters as valid and instead emits a message prompting their removal. Comments containing these characters will also no longer cause crashes when lint.To ensure scalability, I introduced a
UnicodeException
enum that maintains a collection of unprocessable unicode characters. When adding a new unexpected unicode character in the future, simply defining a new case and its correspondingutf8Bytes
will allow it to be properly handled as an unexpected unicode character.