Support case-changes to Annotated{String,Char}s #54013

tecosaur · 2024-04-09T19:01:17Z

Arguably an overlooked area, this PR adds specialised methods for some of the functions in unicode.jl , namely the case-changing functions and textwidth. The case-changing functions now all preserve annotations, and the textwidth specialisation makes it about ~12x faster in some basic local benchmarks.

See the commit message for (many) more details.

Screenshot

NB: ſ/S and ⱥ/Ⱥ have a different number of codeunits.

tecosaur · 2024-04-09T19:02:35Z

Oh, I'll add some test cases for this tomorrow.

tecosaur · 2024-04-10T05:29:27Z

There we go, that should be a pretty decent test.

fingolfin

Looks good to me, thank you

base/strings/annotated.jl

base/strings/unicode.jl

fingolfin · 2024-04-10T21:58:38Z

Genuine CI error?

Error in testset strings/annotated:
Error During Test at /cache/build/tester-amdci4-12/julialang/julia-master/julia-cd05c112d1/share/julia/test/strings/annotated.jl:111
  Got exception outside of a @test
  MethodError: no method matching (::Base.Unicode.var"#4#5")(::Char, ::@NamedTuple{startword::Bool, state::Base.RefValue{Int32}, c0::Base.AnnotatedChar{Char}, wordsep::ComposedFunction{typeof(!), typeof(isletter)}, strict::Bool})
  The function `#4` exists, but no method is defined for this combination of argument types.

@check

Previously, any case changes to Annotated{String,Char} types triggered "fall back to non-annotated type" non-specialised methods. It would be nice to keep the annotations though, and that can be done so long as we keep track of any potential changes to the number of bytes taken by each character on case changes. This is unusual, but can happen with some letters (e.g. the upper case of 'ſ' is 'S'). To handle this, a helper function annotated_chartransform is introduced. This allows for efficient uppercase/lowercase methods (about 50% overhead in managing the annotation ranges, compared to just transforming a String). The {upper,lower}casefirst and titlecase transformations are much more inefficient with this style of implementation, but not prohibitively so. If somebody has a bright idea, or they emerge as an area deserving of more attention, the performance characteristics can be improved. As a bonus, a specialised textwidth method is implemented to avoid the generic fallback, providing a ~12x performance improvement. To check that annotated_chartransform is accurate, as are the specialised case-transformations, a few million random collections of strings were pre- and post-annotated and checked to be the same in a fuzzing check performed with Supposition.jl. const short_str = Data.Text(Data.Characters(), max_len=20) const short_strs = Data.Vectors(short_str, max_size=10) const case_transform_fn = Data.SampledFrom((uppercase, lowercase)) function annot_caseinvariant(f::Function, strs::Vector{String}) annot_strs = map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]), enumerate(strs)) f_annot_strs = map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]), enumerate(map(f, strs))) pre_join = Base.annotated_chartransform(join(annot_strs), f) post_join = join(f_annot_strs) pre_join == post_join end @check max_examples=1_000_000 annot_caseinvariant(case_transform_fn, short_strs) This helped me determine that in annotated_chartransform the "- 1" was needed with offset position calculation, and that in the "findlast" calls that less than *or equal* was the correct equality test.

tecosaur · 2024-04-11T03:14:52Z

Ah, looks like the parenthesis I put around the do arguments are actually unwanted.

@check

Previously, any case changes to Annotated{String,Char} types triggered "fall back to non-annotated type" non-specialised methods. It would be nice to keep the annotations though, and that can be done so long as we keep track of any potential changes to the number of bytes taken by each character on case changes. This is unusual, but can happen with some letters (e.g. the upper case of 'ſ' is 'S'). To handle this, a helper function annotated_chartransform is introduced. This allows for efficient uppercase/lowercase methods (about 50% overhead in managing the annotation ranges, compared to just transforming a String). The {upper,lower}casefirst and titlecase transformations are much more inefficient with this style of implementation, but not prohibitively so. If somebody has a bright idea, or they emerge as an area deserving of more attention, the performance characteristics can be improved. As a bonus, a specialised textwidth method is implemented to avoid the generic fallback, providing a ~12x performance improvement. To check that annotated_chartransform is accurate, as are the specialised case-transformations, a few million random collections of strings were pre- and post-annotated and checked to be the same in a fuzzing check performed with Supposition.jl. const short_str = Data.Text(Data.Characters(), max_len=20) const short_strs = Data.Vectors(short_str, max_size=10) const case_transform_fn = Data.SampledFrom((uppercase, lowercase)) function annot_caseinvariant(f::Function, strs::Vector{String}) annot_strs = map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]), enumerate(strs)) f_annot_strs = map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]), enumerate(map(f, strs))) pre_join = Base.annotated_chartransform(join(annot_strs), f) post_join = join(f_annot_strs) pre_join == post_join end @check max_examples=1_000_000 annot_caseinvariant(case_transform_fn, short_strs) This helped me determine that in annotated_chartransform the "- 1" was needed with offset position calculation, and that in the "findlast" calls that less than *or equal* was the correct equality test. (cherry picked from commit 38a9725)

Backported PRs: - [x] #53665  - [x] #53976  - [x] #54005  - [x] #54010  - [x] #54069  - [x] #53750  - [x] #53984  - [x] #54102  - [x] #54070  - [x] #54013  - [x] #53941  - [x] #54137  - [x] #54129  - [x] #54153  - [x] #54143  - [x] #54151  - [x] #54213  - [x] #54222  - [x] #54233  - [x] #54255  - [x] #54259  - [x] #54251  - [x] #54276  - [x] #54248  - [x] #54308  - [x] #54302  - [x] #54243  - [x] #54350  - [x] #54331  - [x] #53509  - [x] #54335  - [x] #54239  - [x] #54288 - [x] #54067 - [x] #53715  - [x] #54289  - [x] #53815  - [x] #54130  - [x] #54428  - [x] #54332  - [x] #53826  - [x] #54465  - [x] #54514  - [x] #54499  - [x] #54210  - [x] #54359  Non-merged PRs with backport label: - [ ] #54471  - [ ] #54457  - [ ] #54323  - [ ] #54322  - [ ] #54191  - [ ] #53957  - [ ] #53882  - [ ] #53707  - [ ] #53452  - [ ] #53402  - [ ] #53286  - [ ] #52694  - [ ] #51479

tecosaur added strings "Strings!" backport 1.11 Change should be backported to release-1.11 labels Apr 9, 2024

tecosaur force-pushed the annotated-case-changes branch 2 times, most recently from 740a230 to 0f4fc12 Compare April 10, 2024 05:29

tecosaur force-pushed the annotated-case-changes branch from 0f4fc12 to a3eded6 Compare April 10, 2024 05:30

tecosaur added the awaiting review PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. label Apr 10, 2024

fingolfin approved these changes Apr 10, 2024

View reviewed changes

base/strings/annotated.jl Outdated Show resolved Hide resolved

base/strings/unicode.jl Outdated Show resolved Hide resolved

tecosaur force-pushed the annotated-case-changes branch from a3eded6 to cd05c11 Compare April 10, 2024 17:32

tecosaur force-pushed the annotated-case-changes branch from cd05c11 to 6fb0d5a Compare April 11, 2024 03:14

tecosaur removed the awaiting review PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. label Apr 11, 2024

KristofferC mentioned this pull request Apr 17, 2024

Backports for 1.11.0-beta2 #54112

Merged

59 tasks

tecosaur merged commit 38a9725 into JuliaLang:master Apr 18, 2024
8 checks passed

tecosaur deleted the annotated-case-changes branch May 2, 2024 09:45

KristofferC removed the backport 1.11 Change should be backported to release-1.11 label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support case-changes to Annotated{String,Char}s #54013

Support case-changes to Annotated{String,Char}s #54013

tecosaur commented Apr 9, 2024

tecosaur commented Apr 9, 2024

tecosaur commented Apr 10, 2024

fingolfin left a comment

fingolfin commented Apr 10, 2024

tecosaur commented Apr 11, 2024

Support case-changes to Annotated{String,Char}s #54013

Support case-changes to Annotated{String,Char}s #54013

Conversation

tecosaur commented Apr 9, 2024

Screenshot

tecosaur commented Apr 9, 2024

tecosaur commented Apr 10, 2024

fingolfin left a comment

Choose a reason for hiding this comment

fingolfin commented Apr 10, 2024

tecosaur commented Apr 11, 2024