Skip to content

Fix output type mismatch with RegexBuilder #626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 9, 2023

Conversation

natecook1000
Copy link
Member

Some regex literals (and presumably other Regex instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up being a mismatch between what a Regex.Match instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because regex is a Regex<Substring> but the match tries to produce a (Substring, number: Substring):

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new ignoreCapturesInTypedOutput DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to AnyRegexOutput.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes #625. rdar://104823356

Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes swiftlang#625. rdar://104823356
@natecook1000
Copy link
Member Author

@swift-ci Please test

@@ -571,7 +571,11 @@ extension RegexComponentBuilder {
accumulated: R0, next: R1
) -> Regex<Substring> where R0.RegexOutput == W0 {
let factory = makeFactory()
return factory.accumulate(accumulated, next)
if #available(macOS 9999, iOS 9999, watchOS 9999, tvOS 9999, *) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does if #available(SwiftStdlib 5.8, *) work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that isn't allowed in an always-emit function.

@@ -650,7 +680,7 @@ extension DSLTree.Node {
/// output but forwarding its only child's output.
var isOutputForwarding: Bool {
switch self {
case .nonCapturingGroup:
case .nonCapturingGroup, .ignoreCapturesInTypedOutput:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder wholeMatchType would ever return the wrong type with this change.

It's probably not output-forwarding because it doesn't always have the same output as the child.

@natecook1000
Copy link
Member Author

@swift-ci Please test

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM after a quick review

return factory.accumulate(accumulated, factory.ignoreCapturesInTypedOutput(next))
} else {
return factory.accumulate(accumulated, next)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a helper function that will reduce spew and, more importantly, give a name and place to commend why we're doing this and how it differs across versions.

E.g.

// comment...
private func dropCaptures(_ next: ...) -> ... {
  // ... comment about old and new behavior
  if #available(...) {
   return  ...
  }
  return ...
}

...

return factory.accumulate(accumulated, dropCaptures(next))

(or some better name than dropCaptures)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call, I only ever wrote this once, but it is codegen'd an awful lot.

child.regex.root.hasChildNodes
? .init(node: .ignoreCapturesInTypedOutput(child.regex.root))
: .init(node: child.regex.root)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this has-child instead of has-capture? Do we need to wrap this around everything, or is this just part of how capture concatenation works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch isn't technically necessary, but it reduces the number of times we apply this wrapper node. I didn't want to do a hasCapture check, since that would have to search the whole tree. (But maybe it would be worth it to produce a smaller tree.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... one of the downsides of using enums for trees is that it's really hard to just add a bit to track hasCapture information.

This looks fine to me for now, the extra nodes in the tree should be transient, though we might want to also track the time spent allocating each node.

Following on @rxwei's note about wholeMatchType, I found some more
instances where the DSLTree-generated output type doesn't match what
the builder overloads produce. (In particular, when the "noncompliant"
regex component is the first one in the tree.) This change catches
those as well, and includes some additional tests for those cases.
@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

@swift-ci Please test macOS platform

Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Temporarily skipping
those tests on Linux while I try to reduce the problem further (which
doesn't make much sense to me).
@natecook1000
Copy link
Member Author

@swift-ci Please test

Seems to be happening on macOS, too.
@natecook1000
Copy link
Member Author

@swift-ci Please test

We've had availability turned off for RegexBuilderTests (presumably
for convenience), but then we can't test functionality that depends
on availability, like the fix for RegexBuilder output type mismatches.
@natecook1000
Copy link
Member Author

@swift-ci Please test

This should allow the test to take advantage of availability
@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000 natecook1000 merged commit 7756942 into swiftlang:main Feb 9, 2023
@natecook1000 natecook1000 deleted the labeled_captures_dsl branch February 9, 2023 19:49
natecook1000 added a commit to natecook1000/swift-experimental-string-processing that referenced this pull request Feb 9, 2023
Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes swiftlang#625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.
natecook1000 added a commit to natecook1000/swift-experimental-string-processing that referenced this pull request Feb 9, 2023
Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes swiftlang#625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.
milseman added a commit that referenced this pull request Apr 5, 2023
* Atomically load the lowered program (#610)

Since we're atomically initializing the compiled program in
`Regex.Program`, we need to pair that with an atomic load.

Resolves #609.

* Add tests for line start/end word boundary diffs (#616)

The `default` and `simple` word boundaries have different behaviors
at the start and end of strings/lines. These tests validate that we
have the correct behavior implemented. Related to issue #613.

* Add tweaks for Android

* Fix documentation typo (#615)

* Fix abstract for Regex.dotMatchesNewlines(_:). (#614)

The old version looks like it was accidentally duplicated from
anchorsMatchLineEndings(_:) just below it.

* Remove `RegexConsumer` and fix its dependencies (#617)

* Remove `RegexConsumer` and fix its dependencies

This eliminates the RegexConsumer type and rewrites its users to call
through to other, existing functionality on Regex or in the Algorithms
implementations. RegexConsumer doesn't take account of the dual
subranges required for matching, so it can produce results that are
inconsistent with matches(of:) and ranges(of:), which were rewritten
earlier.

rdar://102841216

* Remove remaining from-end algorithm methods

This removes methods that are left over from when we were considering
from-end algorithms. These aren't tested and may not have the correct
semantics, so it's safer to remove them entirely.

* Improve StringProcessing and RegexBuilder documentation (#611)

This includes documentation improvements for core types/methods,
RegexBuilder types along with their generated variadic initializers,
and adds some curation. It also includes tests of the documentation
code samples.

* Set availability for inverted character class test (#621)

This feature depends on running with a Swift 5.7 stdlib, and fails
when that isn't available.

* Add type annotations in RegexBuilder tests

These changes work around a change to the way result builders are
compiled that removes the ability for result builder closure outputs
to affect the overload resolution elsewhere in an expression.

Workarounds for rdar://104881395 and rdar://104645543

* Workaround for fileprivate array issue

A recent compiler change results in fileprivate arrays sometimes
not keeping their buffers around long enough. This change avoids that
issue by removing the fileprivate annotations from the affected type.

* Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>

* Stop at end of search string in TwoWaySearcher (#631)

When searching for a substring that doesn't exist, it was possible
for TwoWaySearcher to advance beyond the end of the search string,
causing a crash. This change adds a `limitedBy:` parameter to that
index movement, avoiding the invalid movement.

Fixes rdar://105154010

* Correct misspelling in DSL renderer (#627)

vertial -> vertical

rdar://104602317

* Fix output type mismatch with RegexBuilder (#626)

Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes #625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.

* Revert "Merge pull request #628 from apple/result_builder_changes_workaround"

This reverts commit 7e059b7, reversing
changes made to 3ca8b13.

* Use `some` syntax in variadics

This supports a type checker fix after the change in how result
builder closure parameters are type-checked.

* Type checker workaround: adjust test

* Further refactor to work around type checker regression

* Align availability macro with OS versions (#641)

* Speed up general character class matching (#642)

Short-circuit Character.isASCII checks inside built in character class matching.

Also, make benchmark try a few more times before giving up.

* Test for \s matching CRLF when scalar matching (#648)

* General ascii fast paths for character classes (#644)

General ASCII fast-paths for builtin character classes

* Remove the unsupported `anyScalar` case (#650)

We decided not to support the `anyScalar` character class, which would
match a single Unicode scalar regardless of matching mode. However,
its representation was still included in the various character class
types in the regex engine, leading to unreachable code and unclear
requirements when changing or adding new code. This change removes
that representation where possible.

The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
is marked `@_spi(RegexBuilder) public`. Any use of that enum case
is handled with a `fatalError("Unsupported")`, and it isn't produced
on any code path.

---------

Co-authored-by: Nate Cook <natecook@apple.com>
Co-authored-by: Butta <repo@butta.fastem.com>
Co-authored-by: Ole Begemann <ole@oleb.net>
Co-authored-by: Alex Martini <amartini@apple.com>
Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
Co-authored-by: David Ewing <dewing@apple.com>
Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
milseman added a commit to milseman/swift-experimental-string-processing that referenced this pull request Apr 5, 2023
* Atomically load the lowered program (swiftlang#610)

Since we're atomically initializing the compiled program in
`Regex.Program`, we need to pair that with an atomic load.

Resolves swiftlang#609.

* Add tests for line start/end word boundary diffs (swiftlang#616)

The `default` and `simple` word boundaries have different behaviors
at the start and end of strings/lines. These tests validate that we
have the correct behavior implemented. Related to issue swiftlang#613.

* Add tweaks for Android

* Fix documentation typo (swiftlang#615)

* Fix abstract for Regex.dotMatchesNewlines(_:). (swiftlang#614)

The old version looks like it was accidentally duplicated from
anchorsMatchLineEndings(_:) just below it.

* Remove `RegexConsumer` and fix its dependencies (swiftlang#617)

* Remove `RegexConsumer` and fix its dependencies

This eliminates the RegexConsumer type and rewrites its users to call
through to other, existing functionality on Regex or in the Algorithms
implementations. RegexConsumer doesn't take account of the dual
subranges required for matching, so it can produce results that are
inconsistent with matches(of:) and ranges(of:), which were rewritten
earlier.

rdar://102841216

* Remove remaining from-end algorithm methods

This removes methods that are left over from when we were considering
from-end algorithms. These aren't tested and may not have the correct
semantics, so it's safer to remove them entirely.

* Improve StringProcessing and RegexBuilder documentation (swiftlang#611)

This includes documentation improvements for core types/methods,
RegexBuilder types along with their generated variadic initializers,
and adds some curation. It also includes tests of the documentation
code samples.

* Set availability for inverted character class test (swiftlang#621)

This feature depends on running with a Swift 5.7 stdlib, and fails
when that isn't available.

* Add type annotations in RegexBuilder tests

These changes work around a change to the way result builders are
compiled that removes the ability for result builder closure outputs
to affect the overload resolution elsewhere in an expression.

Workarounds for rdar://104881395 and rdar://104645543

* Workaround for fileprivate array issue

A recent compiler change results in fileprivate arrays sometimes
not keeping their buffers around long enough. This change avoids that
issue by removing the fileprivate annotations from the affected type.

* Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>

* Stop at end of search string in TwoWaySearcher (swiftlang#631)

When searching for a substring that doesn't exist, it was possible
for TwoWaySearcher to advance beyond the end of the search string,
causing a crash. This change adds a `limitedBy:` parameter to that
index movement, avoiding the invalid movement.

Fixes rdar://105154010

* Correct misspelling in DSL renderer (swiftlang#627)

vertial -> vertical

rdar://104602317

* Fix output type mismatch with RegexBuilder (swiftlang#626)

Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes swiftlang#625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.

* Revert "Merge pull request swiftlang#628 from apple/result_builder_changes_workaround"

This reverts commit 7e059b7, reversing
changes made to 3ca8b13.

* Use `some` syntax in variadics

This supports a type checker fix after the change in how result
builder closure parameters are type-checked.

* Type checker workaround: adjust test

* Further refactor to work around type checker regression

* Align availability macro with OS versions (swiftlang#641)

* Speed up general character class matching (swiftlang#642)

Short-circuit Character.isASCII checks inside built in character class matching.

Also, make benchmark try a few more times before giving up.

* Test for \s matching CRLF when scalar matching (swiftlang#648)

* General ascii fast paths for character classes (swiftlang#644)

General ASCII fast-paths for builtin character classes

* Remove the unsupported `anyScalar` case (swiftlang#650)

We decided not to support the `anyScalar` character class, which would
match a single Unicode scalar regardless of matching mode. However,
its representation was still included in the various character class
types in the regex engine, leading to unreachable code and unclear
requirements when changing or adding new code. This change removes
that representation where possible.

The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
is marked `@_spi(RegexBuilder) public`. Any use of that enum case
is handled with a `fatalError("Unsupported")`, and it isn't produced
on any code path.

---------

Co-authored-by: Nate Cook <natecook@apple.com>
Co-authored-by: Butta <repo@butta.fastem.com>
Co-authored-by: Ole Begemann <ole@oleb.net>
Co-authored-by: Alex Martini <amartini@apple.com>
Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
Co-authored-by: David Ewing <dewing@apple.com>
Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
milseman added a commit that referenced this pull request Apr 5, 2023
* Atomically load the lowered program (#610)

Since we're atomically initializing the compiled program in
`Regex.Program`, we need to pair that with an atomic load.

Resolves #609.

* Add tests for line start/end word boundary diffs (#616)

The `default` and `simple` word boundaries have different behaviors
at the start and end of strings/lines. These tests validate that we
have the correct behavior implemented. Related to issue #613.

* Add tweaks for Android

* Fix documentation typo (#615)

* Fix abstract for Regex.dotMatchesNewlines(_:). (#614)

The old version looks like it was accidentally duplicated from
anchorsMatchLineEndings(_:) just below it.

* Remove `RegexConsumer` and fix its dependencies (#617)

* Remove `RegexConsumer` and fix its dependencies

This eliminates the RegexConsumer type and rewrites its users to call
through to other, existing functionality on Regex or in the Algorithms
implementations. RegexConsumer doesn't take account of the dual
subranges required for matching, so it can produce results that are
inconsistent with matches(of:) and ranges(of:), which were rewritten
earlier.

rdar://102841216

* Remove remaining from-end algorithm methods

This removes methods that are left over from when we were considering
from-end algorithms. These aren't tested and may not have the correct
semantics, so it's safer to remove them entirely.

* Improve StringProcessing and RegexBuilder documentation (#611)

This includes documentation improvements for core types/methods,
RegexBuilder types along with their generated variadic initializers,
and adds some curation. It also includes tests of the documentation
code samples.

* Set availability for inverted character class test (#621)

This feature depends on running with a Swift 5.7 stdlib, and fails
when that isn't available.

* Add type annotations in RegexBuilder tests

These changes work around a change to the way result builders are
compiled that removes the ability for result builder closure outputs
to affect the overload resolution elsewhere in an expression.

Workarounds for rdar://104881395 and rdar://104645543

* Workaround for fileprivate array issue

A recent compiler change results in fileprivate arrays sometimes
not keeping their buffers around long enough. This change avoids that
issue by removing the fileprivate annotations from the affected type.

* Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>

* Stop at end of search string in TwoWaySearcher (#631)

When searching for a substring that doesn't exist, it was possible
for TwoWaySearcher to advance beyond the end of the search string,
causing a crash. This change adds a `limitedBy:` parameter to that
index movement, avoiding the invalid movement.

Fixes rdar://105154010

* Correct misspelling in DSL renderer (#627)

vertial -> vertical

rdar://104602317

* Fix output type mismatch with RegexBuilder (#626)

Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes #625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.

* Revert "Merge pull request #628 from apple/result_builder_changes_workaround"

This reverts commit 7e059b7, reversing
changes made to 3ca8b13.

* Use `some` syntax in variadics

This supports a type checker fix after the change in how result
builder closure parameters are type-checked.

* Type checker workaround: adjust test

* Further refactor to work around type checker regression

* Align availability macro with OS versions (#641)

* Speed up general character class matching (#642)

Short-circuit Character.isASCII checks inside built in character class matching.

Also, make benchmark try a few more times before giving up.

* Test for \s matching CRLF when scalar matching (#648)

* General ascii fast paths for character classes (#644)

General ASCII fast-paths for builtin character classes

* Remove the unsupported `anyScalar` case (#650)

We decided not to support the `anyScalar` character class, which would
match a single Unicode scalar regardless of matching mode. However,
its representation was still included in the various character class
types in the regex engine, leading to unreachable code and unclear
requirements when changing or adding new code. This change removes
that representation where possible.

The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
is marked `@_spi(RegexBuilder) public`. Any use of that enum case
is handled with a `fatalError("Unsupported")`, and it isn't produced
on any code path.

---------

Co-authored-by: Nate Cook <natecook@apple.com>
Co-authored-by: Butta <repo@butta.fastem.com>
Co-authored-by: Ole Begemann <ole@oleb.net>
Co-authored-by: Alex Martini <amartini@apple.com>
Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
Co-authored-by: David Ewing <dewing@apple.com>
Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
milseman added a commit that referenced this pull request May 25, 2023
* Atomically load the lowered program (#610)

Since we're atomically initializing the compiled program in
`Regex.Program`, we need to pair that with an atomic load.

Resolves #609.

* Add tests for line start/end word boundary diffs (#616)

The `default` and `simple` word boundaries have different behaviors
at the start and end of strings/lines. These tests validate that we
have the correct behavior implemented. Related to issue #613.

* Add tweaks for Android

* Fix documentation typo (#615)

* Fix abstract for Regex.dotMatchesNewlines(_:). (#614)

The old version looks like it was accidentally duplicated from
anchorsMatchLineEndings(_:) just below it.

* Remove `RegexConsumer` and fix its dependencies (#617)

* Remove `RegexConsumer` and fix its dependencies

This eliminates the RegexConsumer type and rewrites its users to call
through to other, existing functionality on Regex or in the Algorithms
implementations. RegexConsumer doesn't take account of the dual
subranges required for matching, so it can produce results that are
inconsistent with matches(of:) and ranges(of:), which were rewritten
earlier.

rdar://102841216

* Remove remaining from-end algorithm methods

This removes methods that are left over from when we were considering
from-end algorithms. These aren't tested and may not have the correct
semantics, so it's safer to remove them entirely.

* Improve StringProcessing and RegexBuilder documentation (#611)

This includes documentation improvements for core types/methods,
RegexBuilder types along with their generated variadic initializers,
and adds some curation. It also includes tests of the documentation
code samples.

* Set availability for inverted character class test (#621)

This feature depends on running with a Swift 5.7 stdlib, and fails
when that isn't available.

* Add type annotations in RegexBuilder tests

These changes work around a change to the way result builders are
compiled that removes the ability for result builder closure outputs
to affect the overload resolution elsewhere in an expression.

Workarounds for rdar://104881395 and rdar://104645543

* Workaround for fileprivate array issue

A recent compiler change results in fileprivate arrays sometimes
not keeping their buffers around long enough. This change avoids that
issue by removing the fileprivate annotations from the affected type.

* Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>

* Stop at end of search string in TwoWaySearcher (#631)

When searching for a substring that doesn't exist, it was possible
for TwoWaySearcher to advance beyond the end of the search string,
causing a crash. This change adds a `limitedBy:` parameter to that
index movement, avoiding the invalid movement.

Fixes rdar://105154010

* Correct misspelling in DSL renderer (#627)

vertial -> vertical

rdar://104602317

* Fix output type mismatch with RegexBuilder (#626)

Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes #625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.

* Revert "Merge pull request #628 from apple/result_builder_changes_workaround"

This reverts commit 7e059b7, reversing
changes made to 3ca8b13.

* Use `some` syntax in variadics

This supports a type checker fix after the change in how result
builder closure parameters are type-checked.

* Type checker workaround: adjust test

* Further refactor to work around type checker regression

* Align availability macro with OS versions (#641)

* Speed up general character class matching (#642)

Short-circuit Character.isASCII checks inside built in character class matching.

Also, make benchmark try a few more times before giving up.

* Test for \s matching CRLF when scalar matching (#648)

* General ascii fast paths for character classes (#644)

General ASCII fast-paths for builtin character classes

* Remove the unsupported `anyScalar` case (#650)

We decided not to support the `anyScalar` character class, which would
match a single Unicode scalar regardless of matching mode. However,
its representation was still included in the various character class
types in the regex engine, leading to unreachable code and unclear
requirements when changing or adding new code. This change removes
that representation where possible.

The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
is marked `@_spi(RegexBuilder) public`. Any use of that enum case
is handled with a `fatalError("Unsupported")`, and it isn't produced
on any code path.

* Fix range-based quantification fast path (#653)

The fast path for quantification incorrectly discards the last save
position when the quantification used up all possible trips, which is
only possible with range-based quantifications (e.g. `{0,3}`). This
bug shows up when a range-based quantifier matches the maximum - 1
repetitions of the preceding pattern.

For example, the regex `/a{0,2}a/` should succeed as a full match any
of the strings "aa", "aaa", or "aaaa". However, the pattern fails
to match "aaa", since the save point allowing a single "a" to match
the first `a{0,2}` part of the regex is discarded.

This change only discards the last save position when advancing the
quantifier fails due to a failure to match, not maxing out the number
of trips.

* Add in ASCII fast-path for anyNonNewline (#654)

* Avoid long expression type checks (#657)

These changes remove several seconds of type-checking time from the
RegexBuilder test cases, bringing all expressions under 150ms (on
the tested computer).

* Processor cleanup (#655)

Clean up and refactor the processor

* Simplify instruction fetching

* Refactor metrics out, and void their storage in release builds

*Put operations onto String

* Fix `firstRange(of:)` search (#656)

Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter
actually use two different string searching algorithms. `ranges(of:)`
uses the "z-searcher" algorithm, while `firstRange(of:)` uses a
two-way search. Since it's better to align on a single path for these
searches, the z-searcher has lower requirements, and the two-way
search implementation has a correctness bug, this change removes
the two-way search algorithm and uses z-search for `firstRange(of:)`.

The correctness bug in `firstRange(of:)` appears only when searching
for the second (or later) occurrence of a substring, which you have
to be fairly deliberate about. In the example below, the substring
at offsets `7..<12` is missed:

    let text = "ADACBADADACBADACB"
    //          =====  -----=====
    let pattern = "ADACB"
    let firstRange = text.firstRange(of: pattern)!
    // firstRange ~= 0..<5
    let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)!
    // secondRange ~= 12..<17

This change also removes some unrelated, unused code in Split.swift,
in addition to removing an (unused) usage of `TwoWaySearcher`.

rdar://92794248

* Bug fix and hot path for quantified `.` (#658)

Bug fix in newline hot path, and apply hot path to quantified dot

* Run scalar-semantic benchmark variants (#659)

Run scalar semantic benchmarks

* Refactor operations to be on String (#664)

Finish refactoring logic onto String

* Provide unique generic method parameter names (#669)

This is getting warned on in the 5.9 compiler, will be an error
starting in Swift 6.

* Enable quantification optimizations for scalar semantics (#671)

*  Quantified scalar semantic matching

* Remove redundant test

---------

Co-authored-by: Nate Cook <natecook@apple.com>
Co-authored-by: Butta <repo@butta.fastem.com>
Co-authored-by: Ole Begemann <ole@oleb.net>
Co-authored-by: Alex Martini <amartini@apple.com>
Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
Co-authored-by: David Ewing <dewing@apple.com>
Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
milseman added a commit to milseman/swift-experimental-string-processing that referenced this pull request May 25, 2023
* Atomically load the lowered program (swiftlang#610)

Since we're atomically initializing the compiled program in
`Regex.Program`, we need to pair that with an atomic load.

Resolves swiftlang#609.

* Add tests for line start/end word boundary diffs (swiftlang#616)

The `default` and `simple` word boundaries have different behaviors
at the start and end of strings/lines. These tests validate that we
have the correct behavior implemented. Related to issue swiftlang#613.

* Add tweaks for Android

* Fix documentation typo (swiftlang#615)

* Fix abstract for Regex.dotMatchesNewlines(_:). (swiftlang#614)

The old version looks like it was accidentally duplicated from
anchorsMatchLineEndings(_:) just below it.

* Remove `RegexConsumer` and fix its dependencies (swiftlang#617)

* Remove `RegexConsumer` and fix its dependencies

This eliminates the RegexConsumer type and rewrites its users to call
through to other, existing functionality on Regex or in the Algorithms
implementations. RegexConsumer doesn't take account of the dual
subranges required for matching, so it can produce results that are
inconsistent with matches(of:) and ranges(of:), which were rewritten
earlier.

rdar://102841216

* Remove remaining from-end algorithm methods

This removes methods that are left over from when we were considering
from-end algorithms. These aren't tested and may not have the correct
semantics, so it's safer to remove them entirely.

* Improve StringProcessing and RegexBuilder documentation (swiftlang#611)

This includes documentation improvements for core types/methods,
RegexBuilder types along with their generated variadic initializers,
and adds some curation. It also includes tests of the documentation
code samples.

* Set availability for inverted character class test (swiftlang#621)

This feature depends on running with a Swift 5.7 stdlib, and fails
when that isn't available.

* Add type annotations in RegexBuilder tests

These changes work around a change to the way result builders are
compiled that removes the ability for result builder closure outputs
to affect the overload resolution elsewhere in an expression.

Workarounds for rdar://104881395 and rdar://104645543

* Workaround for fileprivate array issue

A recent compiler change results in fileprivate arrays sometimes
not keeping their buffers around long enough. This change avoids that
issue by removing the fileprivate annotations from the affected type.

* Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>

* Stop at end of search string in TwoWaySearcher (swiftlang#631)

When searching for a substring that doesn't exist, it was possible
for TwoWaySearcher to advance beyond the end of the search string,
causing a crash. This change adds a `limitedBy:` parameter to that
index movement, avoiding the invalid movement.

Fixes rdar://105154010

* Correct misspelling in DSL renderer (swiftlang#627)

vertial -> vertical

rdar://104602317

* Fix output type mismatch with RegexBuilder (swiftlang#626)

Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes swiftlang#625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.

* Revert "Merge pull request swiftlang#628 from apple/result_builder_changes_workaround"

This reverts commit 7e059b7, reversing
changes made to 3ca8b13.

* Use `some` syntax in variadics

This supports a type checker fix after the change in how result
builder closure parameters are type-checked.

* Type checker workaround: adjust test

* Further refactor to work around type checker regression

* Align availability macro with OS versions (swiftlang#641)

* Speed up general character class matching (swiftlang#642)

Short-circuit Character.isASCII checks inside built in character class matching.

Also, make benchmark try a few more times before giving up.

* Test for \s matching CRLF when scalar matching (swiftlang#648)

* General ascii fast paths for character classes (swiftlang#644)

General ASCII fast-paths for builtin character classes

* Remove the unsupported `anyScalar` case (swiftlang#650)

We decided not to support the `anyScalar` character class, which would
match a single Unicode scalar regardless of matching mode. However,
its representation was still included in the various character class
types in the regex engine, leading to unreachable code and unclear
requirements when changing or adding new code. This change removes
that representation where possible.

The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
is marked `@_spi(RegexBuilder) public`. Any use of that enum case
is handled with a `fatalError("Unsupported")`, and it isn't produced
on any code path.

* Fix range-based quantification fast path (swiftlang#653)

The fast path for quantification incorrectly discards the last save
position when the quantification used up all possible trips, which is
only possible with range-based quantifications (e.g. `{0,3}`). This
bug shows up when a range-based quantifier matches the maximum - 1
repetitions of the preceding pattern.

For example, the regex `/a{0,2}a/` should succeed as a full match any
of the strings "aa", "aaa", or "aaaa". However, the pattern fails
to match "aaa", since the save point allowing a single "a" to match
the first `a{0,2}` part of the regex is discarded.

This change only discards the last save position when advancing the
quantifier fails due to a failure to match, not maxing out the number
of trips.

* Add in ASCII fast-path for anyNonNewline (swiftlang#654)

* Avoid long expression type checks (swiftlang#657)

These changes remove several seconds of type-checking time from the
RegexBuilder test cases, bringing all expressions under 150ms (on
the tested computer).

* Processor cleanup (swiftlang#655)

Clean up and refactor the processor

* Simplify instruction fetching

* Refactor metrics out, and void their storage in release builds

*Put operations onto String

* Fix `firstRange(of:)` search (swiftlang#656)

Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter
actually use two different string searching algorithms. `ranges(of:)`
uses the "z-searcher" algorithm, while `firstRange(of:)` uses a
two-way search. Since it's better to align on a single path for these
searches, the z-searcher has lower requirements, and the two-way
search implementation has a correctness bug, this change removes
the two-way search algorithm and uses z-search for `firstRange(of:)`.

The correctness bug in `firstRange(of:)` appears only when searching
for the second (or later) occurrence of a substring, which you have
to be fairly deliberate about. In the example below, the substring
at offsets `7..<12` is missed:

    let text = "ADACBADADACBADACB"
    //          =====  -----=====
    let pattern = "ADACB"
    let firstRange = text.firstRange(of: pattern)!
    // firstRange ~= 0..<5
    let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)!
    // secondRange ~= 12..<17

This change also removes some unrelated, unused code in Split.swift,
in addition to removing an (unused) usage of `TwoWaySearcher`.

rdar://92794248

* Bug fix and hot path for quantified `.` (swiftlang#658)

Bug fix in newline hot path, and apply hot path to quantified dot

* Run scalar-semantic benchmark variants (swiftlang#659)

Run scalar semantic benchmarks

* Refactor operations to be on String (swiftlang#664)

Finish refactoring logic onto String

* Provide unique generic method parameter names (swiftlang#669)

This is getting warned on in the 5.9 compiler, will be an error
starting in Swift 6.

* Enable quantification optimizations for scalar semantics (swiftlang#671)

*  Quantified scalar semantic matching

* Remove redundant test

---------

Co-authored-by: Nate Cook <natecook@apple.com>
Co-authored-by: Butta <repo@butta.fastem.com>
Co-authored-by: Ole Begemann <ole@oleb.net>
Co-authored-by: Alex Martini <amartini@apple.com>
Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
Co-authored-by: David Ewing <dewing@apple.com>
Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
milseman added a commit that referenced this pull request Jun 4, 2023
* Atomically load the lowered program (#610)

Since we're atomically initializing the compiled program in
`Regex.Program`, we need to pair that with an atomic load.

Resolves #609.

* Add tests for line start/end word boundary diffs (#616)

The `default` and `simple` word boundaries have different behaviors
at the start and end of strings/lines. These tests validate that we
have the correct behavior implemented. Related to issue #613.

* Add tweaks for Android

* Fix documentation typo (#615)

* Fix abstract for Regex.dotMatchesNewlines(_:). (#614)

The old version looks like it was accidentally duplicated from
anchorsMatchLineEndings(_:) just below it.

* Remove `RegexConsumer` and fix its dependencies (#617)

* Remove `RegexConsumer` and fix its dependencies

This eliminates the RegexConsumer type and rewrites its users to call
through to other, existing functionality on Regex or in the Algorithms
implementations. RegexConsumer doesn't take account of the dual
subranges required for matching, so it can produce results that are
inconsistent with matches(of:) and ranges(of:), which were rewritten
earlier.

rdar://102841216

* Remove remaining from-end algorithm methods

This removes methods that are left over from when we were considering
from-end algorithms. These aren't tested and may not have the correct
semantics, so it's safer to remove them entirely.

* Improve StringProcessing and RegexBuilder documentation (#611)

This includes documentation improvements for core types/methods,
RegexBuilder types along with their generated variadic initializers,
and adds some curation. It also includes tests of the documentation
code samples.

* Set availability for inverted character class test (#621)

This feature depends on running with a Swift 5.7 stdlib, and fails
when that isn't available.

* Add type annotations in RegexBuilder tests

These changes work around a change to the way result builders are
compiled that removes the ability for result builder closure outputs
to affect the overload resolution elsewhere in an expression.

Workarounds for rdar://104881395 and rdar://104645543

* Workaround for fileprivate array issue

A recent compiler change results in fileprivate arrays sometimes
not keeping their buffers around long enough. This change avoids that
issue by removing the fileprivate annotations from the affected type.

* Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>

* Stop at end of search string in TwoWaySearcher (#631)

When searching for a substring that doesn't exist, it was possible
for TwoWaySearcher to advance beyond the end of the search string,
causing a crash. This change adds a `limitedBy:` parameter to that
index movement, avoiding the invalid movement.

Fixes rdar://105154010

* Correct misspelling in DSL renderer (#627)

vertial -> vertical

rdar://104602317

* Fix output type mismatch with RegexBuilder (#626)

Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes #625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.

* Revert "Merge pull request #628 from apple/result_builder_changes_workaround"

This reverts commit 7e059b7, reversing
changes made to 3ca8b13.

* Use `some` syntax in variadics

This supports a type checker fix after the change in how result
builder closure parameters are type-checked.

* Type checker workaround: adjust test

* Further refactor to work around type checker regression

* Align availability macro with OS versions (#641)

* Speed up general character class matching (#642)

Short-circuit Character.isASCII checks inside built in character class matching.

Also, make benchmark try a few more times before giving up.

* Test for \s matching CRLF when scalar matching (#648)

* General ascii fast paths for character classes (#644)

General ASCII fast-paths for builtin character classes

* Remove the unsupported `anyScalar` case (#650)

We decided not to support the `anyScalar` character class, which would
match a single Unicode scalar regardless of matching mode. However,
its representation was still included in the various character class
types in the regex engine, leading to unreachable code and unclear
requirements when changing or adding new code. This change removes
that representation where possible.

The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
is marked `@_spi(RegexBuilder) public`. Any use of that enum case
is handled with a `fatalError("Unsupported")`, and it isn't produced
on any code path.

* Fix range-based quantification fast path (#653)

The fast path for quantification incorrectly discards the last save
position when the quantification used up all possible trips, which is
only possible with range-based quantifications (e.g. `{0,3}`). This
bug shows up when a range-based quantifier matches the maximum - 1
repetitions of the preceding pattern.

For example, the regex `/a{0,2}a/` should succeed as a full match any
of the strings "aa", "aaa", or "aaaa". However, the pattern fails
to match "aaa", since the save point allowing a single "a" to match
the first `a{0,2}` part of the regex is discarded.

This change only discards the last save position when advancing the
quantifier fails due to a failure to match, not maxing out the number
of trips.

* Add in ASCII fast-path for anyNonNewline (#654)

* Avoid long expression type checks (#657)

These changes remove several seconds of type-checking time from the
RegexBuilder test cases, bringing all expressions under 150ms (on
the tested computer).

* Processor cleanup (#655)

Clean up and refactor the processor

* Simplify instruction fetching

* Refactor metrics out, and void their storage in release builds

*Put operations onto String

* Fix `firstRange(of:)` search (#656)

Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter
actually use two different string searching algorithms. `ranges(of:)`
uses the "z-searcher" algorithm, while `firstRange(of:)` uses a
two-way search. Since it's better to align on a single path for these
searches, the z-searcher has lower requirements, and the two-way
search implementation has a correctness bug, this change removes
the two-way search algorithm and uses z-search for `firstRange(of:)`.

The correctness bug in `firstRange(of:)` appears only when searching
for the second (or later) occurrence of a substring, which you have
to be fairly deliberate about. In the example below, the substring
at offsets `7..<12` is missed:

    let text = "ADACBADADACBADACB"
    //          =====  -----=====
    let pattern = "ADACB"
    let firstRange = text.firstRange(of: pattern)!
    // firstRange ~= 0..<5
    let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)!
    // secondRange ~= 12..<17

This change also removes some unrelated, unused code in Split.swift,
in addition to removing an (unused) usage of `TwoWaySearcher`.

rdar://92794248

* Bug fix and hot path for quantified `.` (#658)

Bug fix in newline hot path, and apply hot path to quantified dot

* Run scalar-semantic benchmark variants (#659)

Run scalar semantic benchmarks

* Refactor operations to be on String (#664)

Finish refactoring logic onto String

* Provide unique generic method parameter names (#669)

This is getting warned on in the 5.9 compiler, will be an error
starting in Swift 6.

* Enable quantification optimizations for scalar semantics (#671)

*  Quantified scalar semantic matching

* Remove redundant test

---------

Co-authored-by: Nate Cook <natecook@apple.com>
Co-authored-by: Butta <repo@butta.fastem.com>
Co-authored-by: Ole Begemann <ole@oleb.net>
Co-authored-by: Alex Martini <amartini@apple.com>
Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
Co-authored-by: David Ewing <dewing@apple.com>
Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
milseman added a commit that referenced this pull request Dec 15, 2023
* Atomically load the lowered program (#610)

Since we're atomically initializing the compiled program in
`Regex.Program`, we need to pair that with an atomic load.

Resolves #609.

* Add tests for line start/end word boundary diffs (#616)

The `default` and `simple` word boundaries have different behaviors
at the start and end of strings/lines. These tests validate that we
have the correct behavior implemented. Related to issue #613.

* Add tweaks for Android

* Fix documentation typo (#615)

* Fix abstract for Regex.dotMatchesNewlines(_:). (#614)

The old version looks like it was accidentally duplicated from
anchorsMatchLineEndings(_:) just below it.

* Remove `RegexConsumer` and fix its dependencies (#617)

* Remove `RegexConsumer` and fix its dependencies

This eliminates the RegexConsumer type and rewrites its users to call
through to other, existing functionality on Regex or in the Algorithms
implementations. RegexConsumer doesn't take account of the dual
subranges required for matching, so it can produce results that are
inconsistent with matches(of:) and ranges(of:), which were rewritten
earlier.

rdar://102841216

* Remove remaining from-end algorithm methods

This removes methods that are left over from when we were considering
from-end algorithms. These aren't tested and may not have the correct
semantics, so it's safer to remove them entirely.

* Improve StringProcessing and RegexBuilder documentation (#611)

This includes documentation improvements for core types/methods,
RegexBuilder types along with their generated variadic initializers,
and adds some curation. It also includes tests of the documentation
code samples.

* Set availability for inverted character class test (#621)

This feature depends on running with a Swift 5.7 stdlib, and fails
when that isn't available.

* Add type annotations in RegexBuilder tests

These changes work around a change to the way result builders are
compiled that removes the ability for result builder closure outputs
to affect the overload resolution elsewhere in an expression.

Workarounds for rdar://104881395 and rdar://104645543

* Workaround for fileprivate array issue

A recent compiler change results in fileprivate arrays sometimes
not keeping their buffers around long enough. This change avoids that
issue by removing the fileprivate annotations from the affected type.

* Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>

* Stop at end of search string in TwoWaySearcher (#631)

When searching for a substring that doesn't exist, it was possible
for TwoWaySearcher to advance beyond the end of the search string,
causing a crash. This change adds a `limitedBy:` parameter to that
index movement, avoiding the invalid movement.

Fixes rdar://105154010

* Correct misspelling in DSL renderer (#627)

vertial -> vertical

rdar://104602317

* Fix output type mismatch with RegexBuilder (#626)

Some regex literals (and presumably other `Regex` instances) lose
their output type information when used in a RegexBuilder closure due
to the way the concatenating builder calls are overloaded. In
particular, any output type with labeled tuples or where the sum of
tuple components in the accumulated and new output types is greater
than 10 will be ignored.

Regex internals don't make this distinction, however, so there ends up
being a mismatch between what a `Regex.Match` instance tries to
produce and the output type of the outermost regex. For example, this
code results in a crash, because `regex` is a `Regex<Substring>`
but the match tries to produce a `(Substring, number: Substring)`:

    let regex = Regex {
        ZeroOrMore(.whitespace)
        /:(?<number>\d+):/
        ZeroOrMore(.whitespace)
    }
    let match = try regex.wholeMatch(in: " :21: ")
    print(match!.output)

To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
to mark situations where the output type is discarded. This status
is propagated through the capture list into the match's storage,
which lets us produce the correct output type. Note that we can't just
drop the capture groups when building the compiled program because
(1) different parts of the regex might reference the capture group
and (2) all capture groups are available if a developer converts the
output to `AnyRegexOutput`.

    let anyOutput = AnyRegexOutput(match)
    // anyOutput[1] == "21"
    // anyOutput["number"] == Optional("21")

Fixes #625. rdar://104823356

Note: Linux seems to crash on different tests when the two customTest
overloads have `internal` visibility or are called. Switching one of the
functions to be generic over a RegexComponent works around the issue.

* Revert "Merge pull request #628 from apple/result_builder_changes_workaround"

This reverts commit 7e059b7, reversing
changes made to 3ca8b13.

* Use `some` syntax in variadics

This supports a type checker fix after the change in how result
builder closure parameters are type-checked.

* Type checker workaround: adjust test

* Further refactor to work around type checker regression

* Align availability macro with OS versions (#641)

* Speed up general character class matching (#642)

Short-circuit Character.isASCII checks inside built in character class matching.

Also, make benchmark try a few more times before giving up.

* Test for \s matching CRLF when scalar matching (#648)

* General ascii fast paths for character classes (#644)

General ASCII fast-paths for builtin character classes

* Remove the unsupported `anyScalar` case (#650)

We decided not to support the `anyScalar` character class, which would
match a single Unicode scalar regardless of matching mode. However,
its representation was still included in the various character class
types in the regex engine, leading to unreachable code and unclear
requirements when changing or adding new code. This change removes
that representation where possible.

The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
is marked `@_spi(RegexBuilder) public`. Any use of that enum case
is handled with a `fatalError("Unsupported")`, and it isn't produced
on any code path.

* Fix range-based quantification fast path (#653)

The fast path for quantification incorrectly discards the last save
position when the quantification used up all possible trips, which is
only possible with range-based quantifications (e.g. `{0,3}`). This
bug shows up when a range-based quantifier matches the maximum - 1
repetitions of the preceding pattern.

For example, the regex `/a{0,2}a/` should succeed as a full match any
of the strings "aa", "aaa", or "aaaa". However, the pattern fails
to match "aaa", since the save point allowing a single "a" to match
the first `a{0,2}` part of the regex is discarded.

This change only discards the last save position when advancing the
quantifier fails due to a failure to match, not maxing out the number
of trips.

* Add in ASCII fast-path for anyNonNewline (#654)

* Avoid long expression type checks (#657)

These changes remove several seconds of type-checking time from the
RegexBuilder test cases, bringing all expressions under 150ms (on
the tested computer).

* Processor cleanup (#655)

Clean up and refactor the processor

* Simplify instruction fetching

* Refactor metrics out, and void their storage in release builds

*Put operations onto String

* Fix `firstRange(of:)` search (#656)

Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter
actually use two different string searching algorithms. `ranges(of:)`
uses the "z-searcher" algorithm, while `firstRange(of:)` uses a
two-way search. Since it's better to align on a single path for these
searches, the z-searcher has lower requirements, and the two-way
search implementation has a correctness bug, this change removes
the two-way search algorithm and uses z-search for `firstRange(of:)`.

The correctness bug in `firstRange(of:)` appears only when searching
for the second (or later) occurrence of a substring, which you have
to be fairly deliberate about. In the example below, the substring
at offsets `7..<12` is missed:

    let text = "ADACBADADACBADACB"
    //          =====  -----=====
    let pattern = "ADACB"
    let firstRange = text.firstRange(of: pattern)!
    // firstRange ~= 0..<5
    let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)!
    // secondRange ~= 12..<17

This change also removes some unrelated, unused code in Split.swift,
in addition to removing an (unused) usage of `TwoWaySearcher`.

rdar://92794248

* Bug fix and hot path for quantified `.` (#658)

Bug fix in newline hot path, and apply hot path to quantified dot

* Run scalar-semantic benchmark variants (#659)

Run scalar semantic benchmarks

* Refactor operations to be on String (#664)

Finish refactoring logic onto String

* Provide unique generic method parameter names (#669)

This is getting warned on in the 5.9 compiler, will be an error
starting in Swift 6.

* Enable quantification optimizations for scalar semantics (#671)

*  Quantified scalar semantic matching

* Fix doc comment for trimPrefix and trimmingPrefix funcs (#673)

* Update availability for the 5.8 release (#680)

* Optimize search for start-anchored regexes (#682)

When a regex is anchored to the start of a subject, there's no need
to search throughout a string for the pattern when searching for the
first match: a prefix match is sufficient.

This adds a regex compilation-time check about whether a match can
only be found at the start of a subject, and then uses that to
choose whether to defer to `prefixMatch` from within `firstMatch`.

* Fix misuse of `XCTSkip()` (#685)

* Handle boundaries when matching in substrings (#675)

* Handle boundaries when matching in substrings

Some of our existing matching routines use the start/endIndex
of the input, which is basically never the right thing to do.

This change revises those checks to use the search bounds, by
either moving the boundary check out of the matching method, or
if the boundary is a part of what needs to be matched (e.g.
word boundaries have different behavior at the start/end than
in the middle of a string) the search bounds are passed into
the matching method.

Testing is currently handled by piggy-backing on the existing
match tests; we should add more tests to handle substring-
specific edge cases.

* Handle sub-character substring boundaries

This change passes the end boundary down into matching methods, and
uses it to find the actual character that is part of the input
substring, even if the substring's end boundary is in the middle of
a grapheme cluster.

Substrings cannot have sub-Unicode scalar boundaries as of Swift
5.7; we can remove a check for this when matching an individual
scalar.

* Overhaul quantification fast-path (#689)

Overhaul quantification save points and fast path logic, for significant wins in simplicity and performance.

* adopt the stdlib’s pattern for atomic lazy references

- avoids reliance on a pointer conversion

* pass a pointer instead of inout conversion

- this function is imported in a way that causes the compiler to not detect it as a C function

* Update Sources/_StringProcessing/Regex/Core.swift

comment spelling fix

* Adds SPI for a NSRE compatibility mode option (#698)

NSRegularExpression matches at the Unicode scalar level, but also
matches `\r\n` sequences with a single `.` when single-line mode is
enabled. This adds a `_nsreCompatibility` property that enables both
of those behaviors, and implements support for the special case
handling of `.`.

* Add ASCII fast-path ASCII character class matching (#690)

Uses quickASCIICharacter to speed up ASCII character class matching.

2x speedup for EmailLookahead_All and many, many others. 10% regression in AnchoredNotFound_First and related.

---------

Co-authored-by: Nate Cook <natecook@apple.com>
Co-authored-by: Butta <repo@butta.fastem.com>
Co-authored-by: Ole Begemann <ole@oleb.net>
Co-authored-by: Alex Martini <amartini@apple.com>
Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
Co-authored-by: David Ewing <dewing@apple.com>
Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
Co-authored-by: Valeriy Van <github@w7software.com>
Co-authored-by: Jonathan Grynspan <grynspan@me.com>
Co-authored-by: Guillaume Lessard <guillaume.lessard@apple.com>
Co-authored-by: Guillaume Lessard <glessard@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Runtime crash when accessing DSL-composed Regex Match with named capture groups
3 participants