Update 5.9 with state of swift/main #652

milseman · 2023-04-05T13:35:51Z

Atomically load the lowered program (Atomically load the lowered program #610)

Since we're atomically initializing the compiled program in Regex.Program, we need to pair that with an atomic load.

Resolves #609.

Add tests for line start/end word boundary diffs (Add tests for line start/end word boundary diffs #616)

The default and simple word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue #613.

Add tweaks for Android
Fix documentation typo (Fix documentation typo #615)
Fix abstract for Regex.dotMatchesNewlines(_:). (Fix abstract for Regex.dotMatchesNewlines(_:) #614)

The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it.

Remove RegexConsumer and fix its dependencies (Remove RegexConsumer and fix its dependencies #617)
Remove RegexConsumer and fix its dependencies

This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier.

rdar://102841216

Remove remaining from-end algorithm methods

This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely.

Improve StringProcessing and RegexBuilder documentation (Improve StringProcessing and RegexBuilder documentation #611)

This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples.

Set availability for inverted character class test (Set availability for inverted character class test #621)

This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available.

Add type annotations in RegexBuilder tests

These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression.

Workarounds for rdar://104881395 and rdar://104645543

Workaround for fileprivate array issue

A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type.

Fix an issue where named character classes weren't getting converted in the result builder. rdar://104480703
Stop at end of search string in TwoWaySearcher (Stop at end of search string in TwoWaySearcher #631)

When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a limitedBy: parameter to that index movement, avoiding the invalid movement.

Fixes rdar://105154010

Correct misspelling in DSL renderer (Correct misspelling in DSL renderer #627)

vertial -> vertical

rdar://104602317

Fix output type mismatch with RegexBuilder (Fix output type mismatch with RegexBuilder #626)

Some regex literals (and presumably other Regex instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up being a mismatch between what a Regex.Match instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because regex is a Regex<Substring> but the match tries to produce a (Substring, number: Substring):

let regex = Regex {
    ZeroOrMore(.whitespace)
    /:(?<number>\d+):/
    ZeroOrMore(.whitespace)
}
let match = try regex.wholeMatch(in: " :21: ")
print(match!.output)

To fix this, we add a new ignoreCapturesInTypedOutput DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to AnyRegexOutput.

let anyOutput = AnyRegexOutput(match)
// anyOutput[1] == "21"
// anyOutput["number"] == Optional("21")

Fixes #625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest overloads have internal visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue.

Revert "Merge pull request Add type annotations in RegexBuilder tests #628 from apple/result_builder_changes_workaround"

This reverts commit 7e059b7, reversing changes made to 3ca8b13.

Use some syntax in variadics

This supports a type checker fix after the change in how result builder closure parameters are type-checked.

Type checker workaround: adjust test
Further refactor to work around type checker regression
Align availability macro with OS versions (Update OS versions aliases #641)
Speed up general character class matching (Speed up general character class matching #642)

Short-circuit Character.isASCII checks inside built in character class matching.

Also, make benchmark try a few more times before giving up.

Test for \s matching CRLF when scalar matching (Test for \s matching CRLF when scalar matching #648)
General ascii fast paths for character classes (General ascii fast paths for character classes #644)

General ASCII fast-paths for builtin character classes

Remove the unsupported anyScalar case (Remove the unsupported anyScalar case #650)

We decided not to support the anyScalar character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible.

The DSLTree.Atom.CharacterClass enum is left unchanged, since it is marked @_spi(RegexBuilder) public. Any use of that enum case is handled with a fatalError("Unsupported"), and it isn't produced on any code path.

* Atomically load the lowered program (swiftlang#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves swiftlang#609. * Add tests for line start/end word boundary diffs (swiftlang#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue swiftlang#613. * Add tweaks for Android * Fix documentation typo (swiftlang#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (swiftlang#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (swiftlang#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (swiftlang#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (swiftlang#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (swiftlang#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (swiftlang#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (swiftlang#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes swiftlang#625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request swiftlang#628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (swiftlang#641) * Speed up general character class matching (swiftlang#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (swiftlang#648) * General ascii fast paths for character classes (swiftlang#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (swiftlang#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. --------- Co-authored-by: Nate Cook <natecook@apple.com> Co-authored-by: Butta <repo@butta.fastem.com> Co-authored-by: Ole Begemann <ole@oleb.net> Co-authored-by: Alex Martini <amartini@apple.com> Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com> Co-authored-by: David Ewing <dewing@apple.com> Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>

milseman · 2023-04-05T13:36:13Z

@swift-ci please test

stephentyrone · 2023-04-05T14:54:25Z

Sources/VariadicsGenerator/VariadicsGenerator.swift

@@ -14,7 +14,7 @@
 import ArgumentParser
 #if os(macOS)


Seems like this should be #if os(Darwin), even though we're not running it on any non-mac Darwin targets at present.

(obviously, if we're gonna fix this, fix it on main first)

@natecook1000 thoughts?

stephentyrone · 2023-04-05T14:57:17Z

Sources/_StringProcessing/Unicode/ASCII.swift

+//
+//===----------------------------------------------------------------------===//
+
+private var _lineFeed: UInt8 { 0x0A }


These would maybe be more readable as UInt8(ascii: "\n") etc. Mostly a matter of taste, though.

stephentyrone · 2023-04-05T14:58:25Z

Sources/_StringProcessing/Unicode/ASCII.swift

+  /// Assuming we're ASCII, whether we match `\d`
+  var _asciiIsDigit: Bool {
+    assert(_isASCII)
+    return(_0..._9).contains(self)


Why do we define _isASCIINumber if not to use it here?

Ah, that was a remnant from a prior refactoring. I'll remove _isASCIINumber

stephentyrone · 2023-04-05T15:06:09Z

Sources/_StringProcessing/Unicode/ASCII.swift

+
+  /// Assuming we're ASCII, whether we match `\s`
+  var _asciiIsWhitespace: Bool {
+    assert(_isASCII)


I expect that we'd get better codegen out of something like:

if self > 0x20 { return false } // all whitespace is 0x20 or lower return 0x2e00 &>> self & 1 == 1 // table lookup for whitespace chars

Ideally LLVM would be clever enough to do this for us, but it isn't.

For ASCII character classes, would it be better to have 128-bit bit vectors that we lookup the corresponding ASCII value in?

128b bit vectors are a bit fussier; since there's no 128b bitwise shifts or comparisons on x86 or arm. Where we can get away with 32b or 64b bit vectors (like here), that's usually a win.

Where we need to have all 128b, yeah. It should boil down to shift, indexed load, shift, and. But in source we have to fuss about endianness, which kinda gross. I find this more readable. That said, we'll have to deal with it some places anyway, so we might as well establish a clean pattern for it.

stephentyrone · 2023-04-05T15:25:33Z

Sources/_StringProcessing/Unicode/ASCII.swift

+  /// Assuming we're ASCII, whether we match `[a-zA-Z]`
+  var _asciiIsLetter: Bool {
+    assert(_isASCII)
+    return (_a..._z).contains(self) || (_A..._Z).contains(self)


Check codgen for this; it should be just an AND or OR or BIC, followed by a subtract and compare. If not, we might need to rewrite it that way explicitly.

Would that be better than a bit vector?

These improvements are low priority as we also have such overwhelming inefficiencies elsewhere that time spent inside _quickMatch is close to negligible. However, I will definitely file something to track future improvements here.

Yes, better than a bit vector.

milseman · 2023-04-05T18:25:59Z

Testing passed (swiftlang/swift#64937). @stephentyrone good to merge to 5.9 and do these fixes on main?

milseman requested a review from natecook1000 April 5, 2023 13:35

milseman mentioned this pull request Apr 5, 2023

[5.9] DO NOT MERGE: Testing for Regex swiftlang/swift#64937

Closed

milseman requested a review from stephentyrone April 5, 2023 14:23

stephentyrone reviewed Apr 5, 2023

View reviewed changes

stephentyrone approved these changes Apr 5, 2023

View reviewed changes

milseman merged commit 02f6b71 into swiftlang:swift/release/5.9 Apr 5, 2023

milseman deleted the update_5_9 branch April 5, 2023 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update 5.9 with state of swift/main #652

Update 5.9 with state of swift/main #652

Uh oh!

milseman commented Apr 5, 2023

Uh oh!

milseman commented Apr 5, 2023

Uh oh!

stephentyrone Apr 5, 2023

Uh oh!

stephentyrone Apr 5, 2023

Uh oh!

milseman Apr 5, 2023

Uh oh!

stephentyrone Apr 5, 2023

Uh oh!

stephentyrone Apr 5, 2023

Uh oh!

milseman Apr 5, 2023

Uh oh!

stephentyrone Apr 5, 2023 •

edited

Loading

Uh oh!

milseman Apr 5, 2023

Uh oh!

stephentyrone Apr 5, 2023 •

edited

Loading

Uh oh!

stephentyrone Apr 5, 2023

Uh oh!

milseman Apr 5, 2023

Uh oh!

stephentyrone Apr 5, 2023

Uh oh!

milseman commented Apr 5, 2023

Uh oh!

Uh oh!

Update 5.9 with state of swift/main #652

Update 5.9 with state of swift/main #652

Uh oh!

Conversation

milseman commented Apr 5, 2023

Uh oh!

milseman commented Apr 5, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentyrone Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentyrone Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman commented Apr 5, 2023

Uh oh!

Uh oh!

stephentyrone Apr 5, 2023 •

edited

Loading

stephentyrone Apr 5, 2023 •

edited

Loading