Improve StringProcessing and RegexBuilder documentation #611

natecook1000 · 2022-10-20T22:28:35Z

I think this provides at least basic docs for all the public symbols. Would love help identifying specific places that we need more/better documentation.

I do think we need a document that explains how regex syntax gets turned into a capture type — e.g. /(a+)(b)?/ -> (Substring, Substring, Substring?).

amartini51 · 2022-10-20T23:21:36Z

Sources/RegexBuilder/Anchor.swift

  /// Word boundaries are identified using the Unicode default word boundary
  /// algorithm by default. To specify a different word boundary algorithm,
-  /// see the `RegexComponent.wordBoundaryKind(_:)` method.
+  /// see the `RegexComponent/wordBoundaryKind(_:)` method.


Suggested change

/// see the `RegexComponent/wordBoundaryKind(_:)` method.

/// see the ``RegexComponent/wordBoundaryKind(_:)`` method.

Double backticks to make this a live link. Although I might still have the link syntax off — this looks like it's pointing to a method in the _StringProcessing module.

It doesn't seem like linking to symbols in different modules is supported by DocC yet, so I think this should just be in plain code voice.

Azoy

This is great, thanks Nate!

natecook1000 · 2022-10-21T15:31:01Z

Sources/_StringProcessing/Regex/AnyRegexOutput.swift

+/// The type-erased, dynamic output of a regular expression match.
+///
+/// When you find a match using regular expression that has `AnyRegexOutput`
+/// as its output type, you can find information about matches by iterating


Looks like I need to finish this comment.

Sources/RegexBuilder/DSL.swift

natecook1000 · 2022-10-21T15:31:59Z

Sources/RegexBuilder/DSL.swift

-/// This group opens a local backtracking scope which, upon successful exit,
-/// discards any remaining backtracking points from within the scope.
+/// An atomic group opens a local backtracking scope which, upon successful
+/// exit, discards any remaining backtracking points from within the scope.


Need an example here.

Sources/RegexBuilder/DSL.swift

Co-authored-by: Stefano De Carolis <xalien95@live.it>

milseman

Overall this looks really good, thank you. Were you able to write test cases for the examples given in the docs?

milseman · 2022-10-26T17:58:54Z

Sources/_StringProcessing/Regex/AnyRegexOutput.swift

 @available(SwiftStdlib 5.7, *)
 extension AnyRegexOutput {
-  /// Creates a type-erased regex output from an existing match.
+  /// Creates a dynamic regular expression match output from an existing match.


Are we settling on "dynamic" for this? I thought we were leaning more towards "type-erased" at one point

Oh, right — great question. I made this change because while "type-erased" is the way we look at AnyRegexOutput as implementors, for Regex users the experience of using a Match<AnyRegexOutput> is more like it's a dynamic type that they can query. Typically type erasure results in difficulty accessing the underlying values, which isn't as much the case here due to the different AnyRegexOutput API.

milseman · 2022-10-26T17:59:01Z

Sources/_StringProcessing/Documentation.docc/README.md

@@ -0,0 +1,41 @@
+# ``_StringProcessing/Regex``


What does this file do?

This is for organizing how the different symbols defined in the Regex type are organized on the Regex documentation page. By default, symbols are grouped in sections by kind: "Initializers", "Instance Properties", etc.

It looks like I didn't change this file name — I'll update that.

natecook1000 · 2022-10-26T18:59:31Z

Were you able to write test cases for the examples given in the docs?

Great point! I have these in a playground locally, so I'll make sure I move them into a new test file.

natecook1000 · 2022-10-27T18:59:02Z

@milseman Added tests for all the samples (wow do I wish these were automatic), plus longer docs for Capture and TryCapture — could you take a look at the framing and detail for these two? Also would love any ideas for demonstrating Local…

natecook1000 · 2022-12-14T19:17:52Z

@swift-ci Please test

natecook1000 · 2022-12-14T19:45:47Z

(Note: The macOS test failure is due to testing with a < 5.7 Swift.)

This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples.

* Atomically load the lowered program (#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves #609. * Add tests for line start/end word boundary diffs (#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue #613. * Add tweaks for Android * Fix documentation typo (#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes #625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request #628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (#641) * Speed up general character class matching (#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (#648) * General ascii fast paths for character classes (#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. --------- Co-authored-by: Nate Cook <natecook@apple.com> Co-authored-by: Butta <repo@butta.fastem.com> Co-authored-by: Ole Begemann <ole@oleb.net> Co-authored-by: Alex Martini <amartini@apple.com> Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com> Co-authored-by: David Ewing <dewing@apple.com> Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>

* Atomically load the lowered program (swiftlang#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves swiftlang#609. * Add tests for line start/end word boundary diffs (swiftlang#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue swiftlang#613. * Add tweaks for Android * Fix documentation typo (swiftlang#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (swiftlang#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (swiftlang#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (swiftlang#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (swiftlang#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (swiftlang#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (swiftlang#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (swiftlang#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes swiftlang#625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request swiftlang#628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (swiftlang#641) * Speed up general character class matching (swiftlang#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (swiftlang#648) * General ascii fast paths for character classes (swiftlang#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (swiftlang#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. --------- Co-authored-by: Nate Cook <natecook@apple.com> Co-authored-by: Butta <repo@butta.fastem.com> Co-authored-by: Ole Begemann <ole@oleb.net> Co-authored-by: Alex Martini <amartini@apple.com> Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com> Co-authored-by: David Ewing <dewing@apple.com> Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>

* Atomically load the lowered program (#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves #609. * Add tests for line start/end word boundary diffs (#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue #613. * Add tweaks for Android * Fix documentation typo (#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes #625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request #628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (#641) * Speed up general character class matching (#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (#648) * General ascii fast paths for character classes (#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. --------- Co-authored-by: Nate Cook <natecook@apple.com> Co-authored-by: Butta <repo@butta.fastem.com> Co-authored-by: Ole Begemann <ole@oleb.net> Co-authored-by: Alex Martini <amartini@apple.com> Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com> Co-authored-by: David Ewing <dewing@apple.com> Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>

* Atomically load the lowered program (#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves #609. * Add tests for line start/end word boundary diffs (#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue #613. * Add tweaks for Android * Fix documentation typo (#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes #625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request #628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (#641) * Speed up general character class matching (#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (#648) * General ascii fast paths for character classes (#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. * Fix range-based quantification fast path (#653) The fast path for quantification incorrectly discards the last save position when the quantification used up all possible trips, which is only possible with range-based quantifications (e.g. `{0,3}`). This bug shows up when a range-based quantifier matches the maximum - 1 repetitions of the preceding pattern. For example, the regex `/a{0,2}a/` should succeed as a full match any of the strings "aa", "aaa", or "aaaa". However, the pattern fails to match "aaa", since the save point allowing a single "a" to match the first `a{0,2}` part of the regex is discarded. This change only discards the last save position when advancing the quantifier fails due to a failure to match, not maxing out the number of trips. * Add in ASCII fast-path for anyNonNewline (#654) * Avoid long expression type checks (#657) These changes remove several seconds of type-checking time from the RegexBuilder test cases, bringing all expressions under 150ms (on the tested computer). * Processor cleanup (#655) Clean up and refactor the processor * Simplify instruction fetching * Refactor metrics out, and void their storage in release builds *Put operations onto String * Fix `firstRange(of:)` search (#656) Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter actually use two different string searching algorithms. `ranges(of:)` uses the "z-searcher" algorithm, while `firstRange(of:)` uses a two-way search. Since it's better to align on a single path for these searches, the z-searcher has lower requirements, and the two-way search implementation has a correctness bug, this change removes the two-way search algorithm and uses z-search for `firstRange(of:)`. The correctness bug in `firstRange(of:)` appears only when searching for the second (or later) occurrence of a substring, which you have to be fairly deliberate about. In the example below, the substring at offsets `7..<12` is missed: let text = "ADACBADADACBADACB" // ===== -----===== let pattern = "ADACB" let firstRange = text.firstRange(of: pattern)! // firstRange ~= 0..<5 let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)! // secondRange ~= 12..<17 This change also removes some unrelated, unused code in Split.swift, in addition to removing an (unused) usage of `TwoWaySearcher`. rdar://92794248 * Bug fix and hot path for quantified `.` (#658) Bug fix in newline hot path, and apply hot path to quantified dot * Run scalar-semantic benchmark variants (#659) Run scalar semantic benchmarks * Refactor operations to be on String (#664) Finish refactoring logic onto String * Provide unique generic method parameter names (#669) This is getting warned on in the 5.9 compiler, will be an error starting in Swift 6. * Enable quantification optimizations for scalar semantics (#671) * Quantified scalar semantic matching * Remove redundant test --------- Co-authored-by: Nate Cook <natecook@apple.com> Co-authored-by: Butta <repo@butta.fastem.com> Co-authored-by: Ole Begemann <ole@oleb.net> Co-authored-by: Alex Martini <amartini@apple.com> Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com> Co-authored-by: David Ewing <dewing@apple.com> Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>

* Atomically load the lowered program (swiftlang#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves swiftlang#609. * Add tests for line start/end word boundary diffs (swiftlang#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue swiftlang#613. * Add tweaks for Android * Fix documentation typo (swiftlang#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (swiftlang#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (swiftlang#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (swiftlang#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (swiftlang#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (swiftlang#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (swiftlang#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (swiftlang#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes swiftlang#625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request swiftlang#628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (swiftlang#641) * Speed up general character class matching (swiftlang#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (swiftlang#648) * General ascii fast paths for character classes (swiftlang#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (swiftlang#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. * Fix range-based quantification fast path (swiftlang#653) The fast path for quantification incorrectly discards the last save position when the quantification used up all possible trips, which is only possible with range-based quantifications (e.g. `{0,3}`). This bug shows up when a range-based quantifier matches the maximum - 1 repetitions of the preceding pattern. For example, the regex `/a{0,2}a/` should succeed as a full match any of the strings "aa", "aaa", or "aaaa". However, the pattern fails to match "aaa", since the save point allowing a single "a" to match the first `a{0,2}` part of the regex is discarded. This change only discards the last save position when advancing the quantifier fails due to a failure to match, not maxing out the number of trips. * Add in ASCII fast-path for anyNonNewline (swiftlang#654) * Avoid long expression type checks (swiftlang#657) These changes remove several seconds of type-checking time from the RegexBuilder test cases, bringing all expressions under 150ms (on the tested computer). * Processor cleanup (swiftlang#655) Clean up and refactor the processor * Simplify instruction fetching * Refactor metrics out, and void their storage in release builds *Put operations onto String * Fix `firstRange(of:)` search (swiftlang#656) Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter actually use two different string searching algorithms. `ranges(of:)` uses the "z-searcher" algorithm, while `firstRange(of:)` uses a two-way search. Since it's better to align on a single path for these searches, the z-searcher has lower requirements, and the two-way search implementation has a correctness bug, this change removes the two-way search algorithm and uses z-search for `firstRange(of:)`. The correctness bug in `firstRange(of:)` appears only when searching for the second (or later) occurrence of a substring, which you have to be fairly deliberate about. In the example below, the substring at offsets `7..<12` is missed: let text = "ADACBADADACBADACB" // ===== -----===== let pattern = "ADACB" let firstRange = text.firstRange(of: pattern)! // firstRange ~= 0..<5 let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)! // secondRange ~= 12..<17 This change also removes some unrelated, unused code in Split.swift, in addition to removing an (unused) usage of `TwoWaySearcher`. rdar://92794248 * Bug fix and hot path for quantified `.` (swiftlang#658) Bug fix in newline hot path, and apply hot path to quantified dot * Run scalar-semantic benchmark variants (swiftlang#659) Run scalar semantic benchmarks * Refactor operations to be on String (swiftlang#664) Finish refactoring logic onto String * Provide unique generic method parameter names (swiftlang#669) This is getting warned on in the 5.9 compiler, will be an error starting in Swift 6. * Enable quantification optimizations for scalar semantics (swiftlang#671) * Quantified scalar semantic matching * Remove redundant test --------- Co-authored-by: Nate Cook <natecook@apple.com> Co-authored-by: Butta <repo@butta.fastem.com> Co-authored-by: Ole Begemann <ole@oleb.net> Co-authored-by: Alex Martini <amartini@apple.com> Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com> Co-authored-by: David Ewing <dewing@apple.com> Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>

* Atomically load the lowered program (#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves #609. * Add tests for line start/end word boundary diffs (#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue #613. * Add tweaks for Android * Fix documentation typo (#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes #625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request #628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (#641) * Speed up general character class matching (#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (#648) * General ascii fast paths for character classes (#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. * Fix range-based quantification fast path (#653) The fast path for quantification incorrectly discards the last save position when the quantification used up all possible trips, which is only possible with range-based quantifications (e.g. `{0,3}`). This bug shows up when a range-based quantifier matches the maximum - 1 repetitions of the preceding pattern. For example, the regex `/a{0,2}a/` should succeed as a full match any of the strings "aa", "aaa", or "aaaa". However, the pattern fails to match "aaa", since the save point allowing a single "a" to match the first `a{0,2}` part of the regex is discarded. This change only discards the last save position when advancing the quantifier fails due to a failure to match, not maxing out the number of trips. * Add in ASCII fast-path for anyNonNewline (#654) * Avoid long expression type checks (#657) These changes remove several seconds of type-checking time from the RegexBuilder test cases, bringing all expressions under 150ms (on the tested computer). * Processor cleanup (#655) Clean up and refactor the processor * Simplify instruction fetching * Refactor metrics out, and void their storage in release builds *Put operations onto String * Fix `firstRange(of:)` search (#656) Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter actually use two different string searching algorithms. `ranges(of:)` uses the "z-searcher" algorithm, while `firstRange(of:)` uses a two-way search. Since it's better to align on a single path for these searches, the z-searcher has lower requirements, and the two-way search implementation has a correctness bug, this change removes the two-way search algorithm and uses z-search for `firstRange(of:)`. The correctness bug in `firstRange(of:)` appears only when searching for the second (or later) occurrence of a substring, which you have to be fairly deliberate about. In the example below, the substring at offsets `7..<12` is missed: let text = "ADACBADADACBADACB" // ===== -----===== let pattern = "ADACB" let firstRange = text.firstRange(of: pattern)! // firstRange ~= 0..<5 let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)! // secondRange ~= 12..<17 This change also removes some unrelated, unused code in Split.swift, in addition to removing an (unused) usage of `TwoWaySearcher`. rdar://92794248 * Bug fix and hot path for quantified `.` (#658) Bug fix in newline hot path, and apply hot path to quantified dot * Run scalar-semantic benchmark variants (#659) Run scalar semantic benchmarks * Refactor operations to be on String (#664) Finish refactoring logic onto String * Provide unique generic method parameter names (#669) This is getting warned on in the 5.9 compiler, will be an error starting in Swift 6. * Enable quantification optimizations for scalar semantics (#671) * Quantified scalar semantic matching * Remove redundant test --------- Co-authored-by: Nate Cook <natecook@apple.com> Co-authored-by: Butta <repo@butta.fastem.com> Co-authored-by: Ole Begemann <ole@oleb.net> Co-authored-by: Alex Martini <amartini@apple.com> Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com> Co-authored-by: David Ewing <dewing@apple.com> Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>

* Atomically load the lowered program (#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves #609. * Add tests for line start/end word boundary diffs (#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue #613. * Add tweaks for Android * Fix documentation typo (#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes #625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request #628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (#641) * Speed up general character class matching (#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (#648) * General ascii fast paths for character classes (#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. * Fix range-based quantification fast path (#653) The fast path for quantification incorrectly discards the last save position when the quantification used up all possible trips, which is only possible with range-based quantifications (e.g. `{0,3}`). This bug shows up when a range-based quantifier matches the maximum - 1 repetitions of the preceding pattern. For example, the regex `/a{0,2}a/` should succeed as a full match any of the strings "aa", "aaa", or "aaaa". However, the pattern fails to match "aaa", since the save point allowing a single "a" to match the first `a{0,2}` part of the regex is discarded. This change only discards the last save position when advancing the quantifier fails due to a failure to match, not maxing out the number of trips. * Add in ASCII fast-path for anyNonNewline (#654) * Avoid long expression type checks (#657) These changes remove several seconds of type-checking time from the RegexBuilder test cases, bringing all expressions under 150ms (on the tested computer). * Processor cleanup (#655) Clean up and refactor the processor * Simplify instruction fetching * Refactor metrics out, and void their storage in release builds *Put operations onto String * Fix `firstRange(of:)` search (#656) Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter actually use two different string searching algorithms. `ranges(of:)` uses the "z-searcher" algorithm, while `firstRange(of:)` uses a two-way search. Since it's better to align on a single path for these searches, the z-searcher has lower requirements, and the two-way search implementation has a correctness bug, this change removes the two-way search algorithm and uses z-search for `firstRange(of:)`. The correctness bug in `firstRange(of:)` appears only when searching for the second (or later) occurrence of a substring, which you have to be fairly deliberate about. In the example below, the substring at offsets `7..<12` is missed: let text = "ADACBADADACBADACB" // ===== -----===== let pattern = "ADACB" let firstRange = text.firstRange(of: pattern)! // firstRange ~= 0..<5 let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)! // secondRange ~= 12..<17 This change also removes some unrelated, unused code in Split.swift, in addition to removing an (unused) usage of `TwoWaySearcher`. rdar://92794248 * Bug fix and hot path for quantified `.` (#658) Bug fix in newline hot path, and apply hot path to quantified dot * Run scalar-semantic benchmark variants (#659) Run scalar semantic benchmarks * Refactor operations to be on String (#664) Finish refactoring logic onto String * Provide unique generic method parameter names (#669) This is getting warned on in the 5.9 compiler, will be an error starting in Swift 6. * Enable quantification optimizations for scalar semantics (#671) * Quantified scalar semantic matching * Fix doc comment for trimPrefix and trimmingPrefix funcs (#673) * Update availability for the 5.8 release (#680) * Optimize search for start-anchored regexes (#682) When a regex is anchored to the start of a subject, there's no need to search throughout a string for the pattern when searching for the first match: a prefix match is sufficient. This adds a regex compilation-time check about whether a match can only be found at the start of a subject, and then uses that to choose whether to defer to `prefixMatch` from within `firstMatch`. * Fix misuse of `XCTSkip()` (#685) * Handle boundaries when matching in substrings (#675) * Handle boundaries when matching in substrings Some of our existing matching routines use the start/endIndex of the input, which is basically never the right thing to do. This change revises those checks to use the search bounds, by either moving the boundary check out of the matching method, or if the boundary is a part of what needs to be matched (e.g. word boundaries have different behavior at the start/end than in the middle of a string) the search bounds are passed into the matching method. Testing is currently handled by piggy-backing on the existing match tests; we should add more tests to handle substring- specific edge cases. * Handle sub-character substring boundaries This change passes the end boundary down into matching methods, and uses it to find the actual character that is part of the input substring, even if the substring's end boundary is in the middle of a grapheme cluster. Substrings cannot have sub-Unicode scalar boundaries as of Swift 5.7; we can remove a check for this when matching an individual scalar. * Overhaul quantification fast-path (#689) Overhaul quantification save points and fast path logic, for significant wins in simplicity and performance. * adopt the stdlib’s pattern for atomic lazy references - avoids reliance on a pointer conversion * pass a pointer instead of inout conversion - this function is imported in a way that causes the compiler to not detect it as a C function * Update Sources/_StringProcessing/Regex/Core.swift comment spelling fix * Adds SPI for a NSRE compatibility mode option (#698) NSRegularExpression matches at the Unicode scalar level, but also matches `\r\n` sequences with a single `.` when single-line mode is enabled. This adds a `_nsreCompatibility` property that enables both of those behaviors, and implements support for the special case handling of `.`. * Add ASCII fast-path ASCII character class matching (#690) Uses quickASCIICharacter to speed up ASCII character class matching. 2x speedup for EmailLookahead_All and many, many others. 10% regression in AnchoredNotFound_First and related. --------- Co-authored-by: Nate Cook <natecook@apple.com> Co-authored-by: Butta <repo@butta.fastem.com> Co-authored-by: Ole Begemann <ole@oleb.net> Co-authored-by: Alex Martini <amartini@apple.com> Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com> Co-authored-by: David Ewing <dewing@apple.com> Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com> Co-authored-by: Valeriy Van <github@w7software.com> Co-authored-by: Jonathan Grynspan <grynspan@me.com> Co-authored-by: Guillaume Lessard <guillaume.lessard@apple.com> Co-authored-by: Guillaume Lessard <glessard@users.noreply.github.com>

natecook1000 added 3 commits October 20, 2022 17:25

Documentation improvements for core types/methods

585bd21

Documentation improvements for RegexBuilder

9ea3e60

Documentation for generated variadics

4ad3756

natecook1000 requested review from Azoy, amartini51, milseman and rxwei October 20, 2022 22:28

rxwei approved these changes Oct 21, 2022

View reviewed changes

amartini51 reviewed Oct 21, 2022

View reviewed changes

Azoy approved these changes Oct 21, 2022

View reviewed changes

Correct minor errors

adbe1f6

natecook1000 commented Oct 21, 2022

View reviewed changes

xAlien95 reviewed Oct 23, 2022

View reviewed changes

Sources/RegexBuilder/DSL.swift Outdated Show resolved Hide resolved

Update Sources/RegexBuilder/DSL.swift

b7cf000

Co-authored-by: Stefano De Carolis <xalien95@live.it>

milseman reviewed Oct 26, 2022

View reviewed changes

natecook1000 added 4 commits October 27, 2022 13:56

Rename Regex curation extension file

5cb91dc

Update code samples and doc styling

c339396

Add Capture and TryCapture docs

f5589b9

Add tests with documentation code samples

8f5fa6b

Merge branch 'main' into regex_docs

3f55be6

natecook1000 merged commit ed95066 into swiftlang:main Dec 14, 2022

natecook1000 mentioned this pull request Dec 14, 2022

[swift/main] Integrate latest changes #620

Merged

milseman mentioned this pull request Apr 5, 2023

Update 5.9 with state of swift/main #652

Merged

	/// see the `RegexComponent/wordBoundaryKind(_:)` method.
	/// see the ``RegexComponent/wordBoundaryKind(_:)`` method.

Improve StringProcessing and RegexBuilder documentation #611

Improve StringProcessing and RegexBuilder documentation #611

Uh oh!

Conversation

natecook1000 commented Oct 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Azoy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natecook1000 commented Oct 26, 2022

Uh oh!

natecook1000 commented Oct 27, 2022

Uh oh!

natecook1000 commented Dec 14, 2022

Uh oh!

natecook1000 commented Dec 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants