[swift/main] Merge performance improvements into swift/main #705

milseman · 2023-12-11T20:29:02Z

No description provided.

Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves swiftlang#609.

The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue swiftlang#613.

The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it.

* Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely.

This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples.

This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available.

Add tweaks for Android

These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543

A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type.

…rkaround Add type annotations in RegexBuilder tests

…in the result builder. <rdar://104480703>

…onversion Fix an issue where named character classes weren't getting converted …

When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010

vertial -> vertical rdar://104602317

Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes swiftlang#625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue.

…anges_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13.

This supports a type checker fix after the change in how result builder closure parameters are type-checked.

Type checker workaround: adjust test

Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up.

General ASCII fast-paths for builtin character classes

We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path.

The fast path for quantification incorrectly discards the last save position when the quantification used up all possible trips, which is only possible with range-based quantifications (e.g. `{0,3}`). This bug shows up when a range-based quantifier matches the maximum - 1 repetitions of the preceding pattern. For example, the regex `/a{0,2}a/` should succeed as a full match any of the strings "aa", "aaa", or "aaaa". However, the pattern fails to match "aaa", since the save point allowing a single "a" to match the first `a{0,2}` part of the regex is discarded. This change only discards the last save position when advancing the quantifier fails due to a failure to match, not maxing out the number of trips.

These changes remove several seconds of type-checking time from the RegexBuilder test cases, bringing all expressions under 150ms (on the tested computer).

Clean up and refactor the processor * Simplify instruction fetching * Refactor metrics out, and void their storage in release builds *Put operations onto String

Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter actually use two different string searching algorithms. `ranges(of:)` uses the "z-searcher" algorithm, while `firstRange(of:)` uses a two-way search. Since it's better to align on a single path for these searches, the z-searcher has lower requirements, and the two-way search implementation has a correctness bug, this change removes the two-way search algorithm and uses z-search for `firstRange(of:)`. The correctness bug in `firstRange(of:)` appears only when searching for the second (or later) occurrence of a substring, which you have to be fairly deliberate about. In the example below, the substring at offsets `7..<12` is missed: let text = "ADACBADADACBADACB" // ===== -----===== let pattern = "ADACB" let firstRange = text.firstRange(of: pattern)! // firstRange ~= 0..<5 let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)! // secondRange ~= 12..<17 This change also removes some unrelated, unused code in Split.swift, in addition to removing an (unused) usage of `TwoWaySearcher`. rdar://92794248

Bug fix in newline hot path, and apply hot path to quantified dot

Run scalar semantic benchmarks

Finish refactoring logic onto String

This is getting warned on in the 5.9 compiler, will be an error starting in Swift 6.

* Quantified scalar semantic matching

When a regex is anchored to the start of a subject, there's no need to search throughout a string for the pattern when searching for the first match: a prefix match is sufficient. This adds a regex compilation-time check about whether a match can only be found at the start of a subject, and then uses that to choose whether to defer to `prefixMatch` from within `firstMatch`.

* Handle boundaries when matching in substrings Some of our existing matching routines use the start/endIndex of the input, which is basically never the right thing to do. This change revises those checks to use the search bounds, by either moving the boundary check out of the matching method, or if the boundary is a part of what needs to be matched (e.g. word boundaries have different behavior at the start/end than in the middle of a string) the search bounds are passed into the matching method. Testing is currently handled by piggy-backing on the existing match tests; we should add more tests to handle substring- specific edge cases. * Handle sub-character substring boundaries This change passes the end boundary down into matching methods, and uses it to find the actual character that is part of the input substring, even if the substring's end boundary is in the middle of a grapheme cluster. Substrings cannot have sub-Unicode scalar boundaries as of Swift 5.7; we can remove a check for this when matching an individual scalar.

Overhaul quantification save points and fast path logic, for significant wins in simplicity and performance.

- avoids reliance on a pointer conversion

- this function is imported in a way that causes the compiler to not detect it as a C function

comment spelling fix

…triction [nfc] Avoid pointer conversions

NSRegularExpression matches at the Unicode scalar level, but also matches `\r\n` sequences with a single `.` when single-line mode is enabled. This adds a `_nsreCompatibility` property that enables both of those behaviors, and implements support for the special case handling of `.`.

Uses quickASCIICharacter to speed up ASCII character class matching. 2x speedup for EmailLookahead_All and many, many others. 10% regression in AnchoredNotFound_First and related.

milseman · 2023-12-11T20:30:43Z

@swift-ci please test

stephentyrone · 2023-12-11T20:43:19Z

Tests/RegexTests/MatchTests.swift

-    firstMatchTest(maxExtraTrips, input: String(repeating: "a", count: maxStorable), match: String(repeating: "a", count: maxStorable))
-    firstMatchTest(maxExtraTrips, input: String(repeating: "a", count: maxStorable + 1), match: String(repeating: "a", count: maxStorable))
-    XCTAssertNil(try Regex(maxExtraTrips).wholeMatch(in: String(repeating: "a", count: maxStorable + 1)))
+    let maxmaxExtraTrips = "a{,\(maxStorable)}"


find/replaceo?

natecook1000 and others added 30 commits October 6, 2022 17:25

Atomically load the lowered program (swiftlang#610)

0b38ca9

Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves swiftlang#609.

Add tests for line start/end word boundary diffs (swiftlang#616)

335a0c2

The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue swiftlang#613.

Add tweaks for Android

54ff516

Fix documentation typo (swiftlang#615)

eb7f801

Fix abstract for Regex.dotMatchesNewlines(_:). (swiftlang#614)

c51e8f2

The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it.

Set availability for inverted character class test (swiftlang#621)

c34cea5

This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available.

Merge pull request swiftlang#618 from buttaface/droid

3ca8b13

Add tweaks for Android

Workaround for fileprivate array issue

6c4f291

A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type.

Merge pull request swiftlang#628 from apple/result_builder_changes_wo…

7e059b7

…rkaround Add type annotations in RegexBuilder tests

Fix an issue where named character classes weren't getting converted …

6a4077f

…in the result builder. <rdar://104480703>

Merge pull request swiftlang#629 from apple/dewing/CharacterClassDSLC…

8184fc0

…onversion Fix an issue where named character classes weren't getting converted …

Correct misspelling in DSL renderer (swiftlang#627)

d5a6cec

vertial -> vertical rdar://104602317

Revert "Merge pull request swiftlang#628 from apple/result_builder_ch…

070e0ec

…anges_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13.

Use some syntax in variadics

1358fc0

This supports a type checker fix after the change in how result builder closure parameters are type-checked.

Type checker workaround: adjust test

083d32a

Further refactor to work around type checker regression

ca92db7

Merge pull request swiftlang#643 from milseman/typechecker_workaround

336f9c5

Type checker workaround: adjust test

Align availability macro with OS versions (swiftlang#641)

852b890

Speed up general character class matching (swiftlang#642)

236b47c

Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up.

Test for \s matching CRLF when scalar matching (swiftlang#648)

348e6c3

General ascii fast paths for character classes (swiftlang#644)

a7ba701

General ASCII fast-paths for builtin character classes

Add in ASCII fast-path for anyNonNewline (swiftlang#654)

923cf5e

Avoid long expression type checks (swiftlang#657)

9ea9936

These changes remove several seconds of type-checking time from the RegexBuilder test cases, bringing all expressions under 150ms (on the tested computer).

milseman and others added 20 commits April 14, 2023 09:26

Processor cleanup (swiftlang#655)

58626cc

Clean up and refactor the processor * Simplify instruction fetching * Refactor metrics out, and void their storage in release builds *Put operations onto String

Bug fix and hot path for quantified . (swiftlang#658)

57b343d

Bug fix in newline hot path, and apply hot path to quantified dot

Run scalar-semantic benchmark variants (swiftlang#659)

6695027

Run scalar semantic benchmarks

Refactor operations to be on String (swiftlang#664)

8eafd55

Finish refactoring logic onto String

Provide unique generic method parameter names (swiftlang#669)

0354667

This is getting warned on in the 5.9 compiler, will be an error starting in Swift 6.

Enable quantification optimizations for scalar semantics (swiftlang#671)

98d5ddc

* Quantified scalar semantic matching

Fix doc comment for trimPrefix and trimmingPrefix funcs (swiftlang#673)

b4b4315

Update availability for the 5.8 release (swiftlang#680)

d61ba4c

Fix misuse of XCTSkip() (swiftlang#685)

ba6e49d

Overhaul quantification fast-path (swiftlang#689)

185ebd6

Overhaul quantification save points and fast path logic, for significant wins in simplicity and performance.

adopt the stdlib’s pattern for atomic lazy references

d45027b

- avoids reliance on a pointer conversion

pass a pointer instead of inout conversion

74637cc

- this function is imported in a way that causes the compiler to not detect it as a C function

Update Sources/_StringProcessing/Regex/Core.swift

bc70423

comment spelling fix

Merge pull request swiftlang#691 from glessard/pointer-conversion-res…

45fd8ec

…triction [nfc] Avoid pointer conversions

Add ASCII fast-path ASCII character class matching (swiftlang#690)

4e742b4

Uses quickASCIICharacter to speed up ASCII character class matching. 2x speedup for EmailLookahead_All and many, many others. 10% regression in AnchoredNotFound_First and related.

Merge remote-tracking branch 'origin/main' into swift/main

4760abe

milseman requested review from natecook1000 and stephentyrone December 11, 2023 20:29

milseman mentioned this pull request Dec 11, 2023

DO NOT MERGE: Regex testing PR swiftlang/swift#70386

Closed

stephentyrone reviewed Dec 11, 2023

View reviewed changes

milseman merged commit 646946d into swiftlang:swift/main Dec 15, 2023

milseman deleted the swift_main_perf branch December 15, 2023 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[swift/main] Merge performance improvements into swift/main #705

[swift/main] Merge performance improvements into swift/main #705

Uh oh!

milseman commented Dec 11, 2023

Uh oh!

milseman commented Dec 11, 2023

Uh oh!

stephentyrone Dec 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

[swift/main] Merge performance improvements into swift/main #705

[swift/main] Merge performance improvements into swift/main #705

Uh oh!

Conversation

milseman commented Dec 11, 2023

Uh oh!

milseman commented Dec 11, 2023

Uh oh!

stephentyrone Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants