[stdlib] Fix implementation of Unicode text segmentation for word boundaries #83314

lorentey · 2025-07-25T00:34:55Z

Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex #29. Start exposing the core primitives (as well as String-level interfaces), so that folks can prototype proper API for these concepts.

Fix _wordIndex(after:) to always advance forward. It now requires its input index to be on a word boundary. Remove the @_spi attribute, exposing it as a (hidden, but) public entry point.
The old SPIs _wordIndex(before:) and _nearestWordIndex(atOrBelow:) were irredeemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public _wordIndex(somewhereAtOrBefore:) entry pont.
Expose artisanal handcrafted low-level state machines for detecting word boundaries (_WordRecognizer, _RandomAccessWordRecognizer), following the design of _CharacterRecognizer.
Add tests to reliably validate that the two state machine flavors always produce consistent results.

rdar://155482680

lorentey · 2025-07-25T00:35:05Z

@swift-ci test

lorentey · 2025-07-25T17:19:00Z

Linux:

stdlib/StringWordBreaking.swift:9:8: error: no such module 'Foundation'

D'oh

lorentey · 2025-07-25T17:24:11Z

@swift-ci test

lorentey · 2025-08-05T21:09:32Z

@swift-ci smoke test macOS platform

lorentey · 2025-08-05T21:09:38Z

@swift-ci test macOS platform

lorentey · 2025-08-05T22:12:27Z

Hm

[2025-08-05T22:07:21.662Z] /Users/ec2-user/jenkins/workspace/swift-PR-macos/branch-main/swift/lib/ASTGen/Sources/ASTGen/SourceFile.swift:80:44: error: type 'Parser.ExperimentalFeatures' has no member 'inlineArrayTypeSugar'
[2025-08-05T22:07:21.662Z]  78 |     mapFeature(.OldOwnershipOperatorSpellings, to: .oldOwnershipOperatorSpellings)
[2025-08-05T22:07:21.662Z]  79 |     mapFeature(.KeyPathWithMethodMembers, to: .keypathWithMethodMembers)
[2025-08-05T22:07:21.662Z]  80 |     mapFeature(.InlineArrayTypeSugar, to: .inlineArrayTypeSugar)
[2025-08-05T22:07:21.662Z]     |                                            `- error: type 'Parser.ExperimentalFeatures' has no member 'inlineArrayTypeSugar'
[2025-08-05T22:07:21.662Z]  81 |     mapFeature(.DefaultIsolationPerFile, to: .defaultIsolationPerFile)
[2025-08-05T22:07:21.662Z]  82 |   }

lorentey · 2025-08-06T02:42:52Z

@swift-ci test

…ndaries Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex swiftlang#29. Start exposing the core primitives (as well as `String`-level interfaces), so that folks can prototype proper API for these concepts. - Fix `_wordIndex(after:)` to always advance forward. It now requires its input index to be on a word boundary. Remove the `@_spi` attribute, exposing it as a (hidden, but) public entry point. - The old SPIs `_wordIndex(before:)` and `_nearestWordIndex(atOrBelow:)` were irredemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public `_wordIndex(somewhereAtOrBefore:)` entry pont. - Expose handcrafted low-level state machines for detecting word boundaries (_WordRecognizer`, `_RandomAccessWordRecognizer`), following the design of `_CharacterRecognizer`. - Add tests to reliably validate that the two state machine flavors always produce consistent results. rdar://155482680

lorentey · 2025-08-06T03:06:28Z

@swift-ci test

lorentey · 2025-08-08T21:02:13Z

@swift-ci test

lorentey requested a review from a team as a code owner July 25, 2025 00:34

lorentey requested a review from Azoy July 25, 2025 00:36

Azoy approved these changes Jul 31, 2025

View reviewed changes

lorentey force-pushed the pushing-word-boundaries branch from ab558c2 to bb4e6ea Compare August 6, 2025 03:06

[test] Resolve failures detected by CI

847df72

lorentey force-pushed the pushing-word-boundaries branch from bb4e6ea to 847df72 Compare August 8, 2025 21:02

lorentey merged commit 14b9b80 into swiftlang:main Aug 11, 2025
5 checks passed

lorentey deleted the pushing-word-boundaries branch August 11, 2025 18:28

lorentey mentioned this pull request Aug 12, 2025

[test][abi] Fix unusual off-by-one backreference issue in arm64 ABI list #83679

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[stdlib] Fix implementation of Unicode text segmentation for word boundaries #83314

[stdlib] Fix implementation of Unicode text segmentation for word boundaries #83314

Uh oh!

lorentey commented Jul 25, 2025 •

edited

Loading

Uh oh!

lorentey commented Jul 25, 2025

Uh oh!

lorentey commented Jul 25, 2025

Uh oh!

lorentey commented Jul 25, 2025

Uh oh!

lorentey commented Aug 5, 2025

Uh oh!

lorentey commented Aug 5, 2025

Uh oh!

lorentey commented Aug 5, 2025

Uh oh!

lorentey commented Aug 6, 2025

Uh oh!

lorentey commented Aug 6, 2025

Uh oh!

lorentey commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[stdlib] Fix implementation of Unicode text segmentation for word boundaries #83314

[stdlib] Fix implementation of Unicode text segmentation for word boundaries #83314

Uh oh!

Conversation

lorentey commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentey commented Jul 25, 2025

Uh oh!

lorentey commented Jul 25, 2025

Uh oh!

lorentey commented Jul 25, 2025

Uh oh!

lorentey commented Aug 5, 2025

Uh oh!

lorentey commented Aug 5, 2025

Uh oh!

lorentey commented Aug 5, 2025

Uh oh!

lorentey commented Aug 6, 2025

Uh oh!

lorentey commented Aug 6, 2025

Uh oh!

lorentey commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lorentey commented Jul 25, 2025 •

edited

Loading