-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support case-changes to Annotated{String,Char}s #54013
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tecosaur
added
strings
"Strings!"
backport 1.11
Change should be backported to release-1.11
labels
Apr 9, 2024
Oh, I'll add some test cases for this tomorrow. |
tecosaur
force-pushed
the
annotated-case-changes
branch
2 times, most recently
from
April 10, 2024 05:29
740a230
to
0f4fc12
Compare
There we go, that should be a pretty decent test. |
tecosaur
force-pushed
the
annotated-case-changes
branch
from
April 10, 2024 05:30
0f4fc12
to
a3eded6
Compare
tecosaur
added
the
awaiting review
PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated.
label
Apr 10, 2024
fingolfin
approved these changes
Apr 10, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you
tecosaur
force-pushed
the
annotated-case-changes
branch
from
April 10, 2024 17:32
a3eded6
to
cd05c11
Compare
Genuine CI error?
|
Previously, any case changes to Annotated{String,Char} types triggered "fall back to non-annotated type" non-specialised methods. It would be nice to keep the annotations though, and that can be done so long as we keep track of any potential changes to the number of bytes taken by each character on case changes. This is unusual, but can happen with some letters (e.g. the upper case of 'ſ' is 'S'). To handle this, a helper function annotated_chartransform is introduced. This allows for efficient uppercase/lowercase methods (about 50% overhead in managing the annotation ranges, compared to just transforming a String). The {upper,lower}casefirst and titlecase transformations are much more inefficient with this style of implementation, but not prohibitively so. If somebody has a bright idea, or they emerge as an area deserving of more attention, the performance characteristics can be improved. As a bonus, a specialised textwidth method is implemented to avoid the generic fallback, providing a ~12x performance improvement. To check that annotated_chartransform is accurate, as are the specialised case-transformations, a few million random collections of strings were pre- and post-annotated and checked to be the same in a fuzzing check performed with Supposition.jl. const short_str = Data.Text(Data.Characters(), max_len=20) const short_strs = Data.Vectors(short_str, max_size=10) const case_transform_fn = Data.SampledFrom((uppercase, lowercase)) function annot_caseinvariant(f::Function, strs::Vector{String}) annot_strs = map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]), enumerate(strs)) f_annot_strs = map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]), enumerate(map(f, strs))) pre_join = Base.annotated_chartransform(join(annot_strs), f) post_join = join(f_annot_strs) pre_join == post_join end @check max_examples=1_000_000 annot_caseinvariant(case_transform_fn, short_strs) This helped me determine that in annotated_chartransform the "- 1" was needed with offset position calculation, and that in the "findlast" calls that less than *or equal* was the correct equality test.
tecosaur
force-pushed
the
annotated-case-changes
branch
from
April 11, 2024 03:14
cd05c11
to
6fb0d5a
Compare
Ah, looks like the parenthesis I put around the |
tecosaur
removed
the
awaiting review
PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated.
label
Apr 11, 2024
59 tasks
KristofferC
pushed a commit
that referenced
this pull request
Apr 25, 2024
Previously, any case changes to Annotated{String,Char} types triggered "fall back to non-annotated type" non-specialised methods. It would be nice to keep the annotations though, and that can be done so long as we keep track of any potential changes to the number of bytes taken by each character on case changes. This is unusual, but can happen with some letters (e.g. the upper case of 'ſ' is 'S'). To handle this, a helper function annotated_chartransform is introduced. This allows for efficient uppercase/lowercase methods (about 50% overhead in managing the annotation ranges, compared to just transforming a String). The {upper,lower}casefirst and titlecase transformations are much more inefficient with this style of implementation, but not prohibitively so. If somebody has a bright idea, or they emerge as an area deserving of more attention, the performance characteristics can be improved. As a bonus, a specialised textwidth method is implemented to avoid the generic fallback, providing a ~12x performance improvement. To check that annotated_chartransform is accurate, as are the specialised case-transformations, a few million random collections of strings were pre- and post-annotated and checked to be the same in a fuzzing check performed with Supposition.jl. const short_str = Data.Text(Data.Characters(), max_len=20) const short_strs = Data.Vectors(short_str, max_size=10) const case_transform_fn = Data.SampledFrom((uppercase, lowercase)) function annot_caseinvariant(f::Function, strs::Vector{String}) annot_strs = map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]), enumerate(strs)) f_annot_strs = map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]), enumerate(map(f, strs))) pre_join = Base.annotated_chartransform(join(annot_strs), f) post_join = join(f_annot_strs) pre_join == post_join end @check max_examples=1_000_000 annot_caseinvariant(case_transform_fn, short_strs) This helped me determine that in annotated_chartransform the "- 1" was needed with offset position calculation, and that in the "findlast" calls that less than *or equal* was the correct equality test. (cherry picked from commit 38a9725)
KristofferC
added a commit
that referenced
this pull request
May 28, 2024
Backported PRs: - [x] #53665 <!-- use afoldl instead of tail recursion for tuples --> - [x] #53976 <!-- LinearAlgebra: LazyString in interpolated error messages --> - [x] #54005 <!-- make `view(::Memory, ::Colon)` produce a Vector --> - [x] #54010 <!-- Overload `Base.literal_pow` for `AbstractQ` --> - [x] #54069 <!-- Allow PrecompileTools to see MI's inferred by foreign abstract interpreters --> - [x] #53750 <!-- inference correctness: fields and globals can revert to undef --> - [x] #53984 <!-- Profile: fix heap snapshot is valid char check --> - [x] #54102 <!-- Explicitly compute stride in unaliascopy for SubArray --> - [x] #54070 <!-- Fix integer overflow in `skip(s::IOBuffer, typemax(Int64))` --> - [x] #54013 <!-- Support case-changes to Annotated{String,Char}s --> - [x] #53941 <!-- Fix writing of AnnotatedChars to AnnotatedIOBuffer --> - [x] #54137 <!-- Fix typo in docs for `partialsortperm` --> - [x] #54129 <!-- use correct size when creating output data from an IOBuffer --> - [x] #54153 <!-- Fixup IdSet docstring --> - [x] #54143 <!-- Fix `make install` from tarballs --> - [x] #54151 <!-- LinearAlgebra: Correct zero element in `_generic_matvecmul!` for block adj/trans --> - [x] #54213 <!-- Add `public` statement to `Base.GC` --> - [x] #54222 <!-- Utilize correct tbaa when emitting stores of unions. --> - [x] #54233 <!-- set MAX_OS_WRITE on unix --> - [x] #54255 <!-- fix `_checked_mul_dims` in the presence of 0s and overflow. --> - [x] #54259 <!-- Fix typo in `readuntil` --> - [x] #54251 <!-- fix typo in gc_mark_memory8 when chunking a large array --> - [x] #54276 <!-- Fix solve for complex `Hermitian` with non-vanishing imaginary part on diagonal --> - [x] #54248 <!-- ensure package callbacks are invoked when no valid precompile file exists for an "auto loaded" stdlib --> - [x] #54308 <!-- Implement eval-able AnnotatedString 2-arg show --> - [x] #54302 <!-- Specialised substring equality for annotated strs --> - [x] #54243 <!-- prevent `package_callbacks` to run multiple time for a single package --> - [x] #54350 <!-- add a precompile signature to Artifacts code that is used by JLLs --> - [x] #54331 <!-- correctly track freed bytes in jl_genericmemory_to_string --> - [x] #53509 <!-- revert moving "creating packages" from Pkg.jl --> - [x] #54335 <!-- When accessing the data pointer for an array, first decay it to a Derived Pointer --> - [x] #54239 <!-- Make sure `fieldcount` constant-folds for `Tuple{...}` --> - [x] #54288 - [x] #54067 - [x] #53715 <!-- Add read/write specialisation for IOContext{AnnIO} --> - [x] #54289 <!-- Rework annotation ordering/optimisations --> - [x] #53815 <!-- create phantom task for GC threads --> - [x] #54130 <!-- inference: handle `LimitedAccuracy` in `handle_global_assignment!` --> - [x] #54428 <!-- Move ConsoleLogging.jl into Base --> - [x] #54332 <!-- Revert "add unsetindex support to more copyto methods (#51760)" --> - [x] #53826 <!-- Make all command-line options documented in all related files --> - [x] #54465 <!-- typeintersect: conservative typevar subtitution during `finish_unionall` --> - [x] #54514 <!-- typeintersect: followup cleanup for the nothrow path of type instantiation --> - [x] #54499 <!-- make `@doc x` work without REPL loaded --> - [x] #54210 <!-- attach finalizer in `mmap` to the correct object --> - [x] #54359 <!-- Pkg REPL: cache `pkg_mode` lookup --> Non-merged PRs with backport label: - [ ] #54471 <!-- Actually setup jit targets when compiling packageimages instead of targeting only one --> - [ ] #54457 <!-- Make `String(::Memory)` copy --> - [ ] #54323 <!-- inference: fix too conservative effects for recursive cycles --> - [ ] #54322 <!-- effects: add new `@consistent_overlay` macro --> - [ ] #54191 <!-- make `AbstractPipe` public --> - [ ] #53957 <!-- tweak how filtering is done for what packages should be precompiled --> - [ ] #53882 <!-- Warn about cycles in extension precompilation --> - [ ] #53707 <!-- Make ScopedValue public --> - [ ] #53452 <!-- RFC: allow Tuple{Union{}}, returning Union{} --> - [ ] #53402 <!-- Add `jl_getaffinity` and `jl_setaffinity` --> - [ ] #53286 <!-- Raise an error when using `include_dependency` with non-existent file or directory --> - [ ] #52694 <!-- Reinstate similar for AbstractQ for backward compatibility --> - [ ] #51479 <!-- prevent code loading from lookin in the versioned environment when building Julia -->
KristofferC
removed
the
backport 1.11
Change should be backported to release-1.11
label
May 28, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Arguably an overlooked area, this PR adds specialised methods for some of the functions in
unicode.jl
, namely the case-changing functions andtextwidth
. The case-changing functions now all preserve annotations, and thetextwidth
specialisation makes it about ~12x faster in some basic local benchmarks.See the commit message for (many) more details.
Screenshot
NB:
ſ
/S
andⱥ
/Ⱥ
have a different number of codeunits.