Skip to content

Conversation

@lorentey
Copy link
Member

@lorentey lorentey commented Jul 25, 2025

Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex #29. Start exposing the core primitives (as well as String-level interfaces), so that folks can prototype proper API for these concepts.

  • Fix _wordIndex(after:) to always advance forward. It now requires its input index to be on a word boundary. Remove the @_spi attribute, exposing it as a (hidden, but) public entry point.
  • The old SPIs _wordIndex(before:) and _nearestWordIndex(atOrBelow:) were irredeemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public _wordIndex(somewhereAtOrBefore:) entry pont.
  • Expose artisanal handcrafted low-level state machines for detecting word boundaries (_WordRecognizer, _RandomAccessWordRecognizer), following the design of _CharacterRecognizer.
  • Add tests to reliably validate that the two state machine flavors always produce consistent results.

rdar://155482680

@lorentey lorentey requested a review from a team as a code owner July 25, 2025 00:34
@lorentey
Copy link
Member Author

@swift-ci test

@lorentey lorentey requested a review from Azoy July 25, 2025 00:36
@lorentey
Copy link
Member Author

Linux:

stdlib/StringWordBreaking.swift:9:8: error: no such module 'Foundation'

D'oh

@lorentey
Copy link
Member Author

@swift-ci test

@lorentey
Copy link
Member Author

lorentey commented Aug 5, 2025

@swift-ci smoke test macOS platform

@lorentey
Copy link
Member Author

lorentey commented Aug 5, 2025

@swift-ci test macOS platform

@lorentey
Copy link
Member Author

lorentey commented Aug 5, 2025

Hm

[2025-08-05T22:07:21.662Z] /Users/ec2-user/jenkins/workspace/swift-PR-macos/branch-main/swift/lib/ASTGen/Sources/ASTGen/SourceFile.swift:80:44: error: type 'Parser.ExperimentalFeatures' has no member 'inlineArrayTypeSugar'
[2025-08-05T22:07:21.662Z]  78 |     mapFeature(.OldOwnershipOperatorSpellings, to: .oldOwnershipOperatorSpellings)
[2025-08-05T22:07:21.662Z]  79 |     mapFeature(.KeyPathWithMethodMembers, to: .keypathWithMethodMembers)
[2025-08-05T22:07:21.662Z]  80 |     mapFeature(.InlineArrayTypeSugar, to: .inlineArrayTypeSugar)
[2025-08-05T22:07:21.662Z]     |                                            `- error: type 'Parser.ExperimentalFeatures' has no member 'inlineArrayTypeSugar'
[2025-08-05T22:07:21.662Z]  81 |     mapFeature(.DefaultIsolationPerFile, to: .defaultIsolationPerFile)
[2025-08-05T22:07:21.662Z]  82 |   }

@lorentey
Copy link
Member Author

lorentey commented Aug 6, 2025

@swift-ci test

…ndaries

Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex swiftlang#29. Start exposing the core primitives (as well as `String`-level interfaces), so that folks can prototype proper API for these concepts.

- Fix `_wordIndex(after:)` to always advance forward. It now requires its input index to be on a word boundary. Remove the `@_spi` attribute, exposing it as a (hidden, but) public entry point.
- The old SPIs `_wordIndex(before:)` and `_nearestWordIndex(atOrBelow:)` were irredemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public `_wordIndex(somewhereAtOrBefore:)` entry pont.
- Expose handcrafted low-level state machines for detecting word boundaries (_WordRecognizer`, `_RandomAccessWordRecognizer`), following the design of `_CharacterRecognizer`.
- Add tests to reliably validate that the two state machine flavors always produce consistent results.

rdar://155482680
@lorentey lorentey force-pushed the pushing-word-boundaries branch from ab558c2 to bb4e6ea Compare August 6, 2025 03:06
@lorentey
Copy link
Member Author

lorentey commented Aug 6, 2025

@swift-ci test

@lorentey lorentey force-pushed the pushing-word-boundaries branch from bb4e6ea to 847df72 Compare August 8, 2025 21:02
@lorentey
Copy link
Member Author

lorentey commented Aug 8, 2025

@swift-ci test

@lorentey lorentey merged commit 14b9b80 into swiftlang:main Aug 11, 2025
5 checks passed
@lorentey lorentey deleted the pushing-word-boundaries branch August 11, 2025 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants