Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support case-changes to Annotated{String,Char}s #54013

Merged
merged 1 commit into from
Apr 18, 2024

Commits on Apr 11, 2024

  1. Support case-changes to Annotated{String,Char}s

    Previously, any case changes to Annotated{String,Char} types triggered
    "fall back to non-annotated type" non-specialised methods. It would be
    nice to keep the annotations though, and that can be done so long as we
    keep track of any potential changes to the number of bytes taken by each
    character on case changes. This is unusual, but can happen with some
    letters (e.g. the upper case of 'ſ' is 'S').
    
    To handle this, a helper function annotated_chartransform is introduced.
    This allows for efficient uppercase/lowercase methods (about 50%
    overhead in managing the annotation ranges, compared to just
    transforming a String). The {upper,lower}casefirst and titlecase
    transformations are much more inefficient with this style of
    implementation, but not prohibitively so. If somebody has a bright idea,
    or they emerge as an area deserving of more attention, the performance
    characteristics can be improved.
    
    As a bonus, a specialised textwidth method is implemented to avoid the
    generic fallback, providing a ~12x performance improvement.
    
    To check that annotated_chartransform is accurate, as are the
    specialised case-transformations, a few million random collections of
    strings were pre- and post-annotated and checked to be the same in a
    fuzzing check performed with Supposition.jl.
    
        const short_str = Data.Text(Data.Characters(), max_len=20)
        const short_strs = Data.Vectors(short_str, max_size=10)
        const case_transform_fn = Data.SampledFrom((uppercase, lowercase))
    
        function annot_caseinvariant(f::Function, strs::Vector{String})
            annot_strs =
                map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]),
                    enumerate(strs))
            f_annot_strs =
                map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]),
                    enumerate(map(f, strs)))
            pre_join = Base.annotated_chartransform(join(annot_strs), f)
            post_join = join(f_annot_strs)
            pre_join == post_join
        end
    
        @check max_examples=1_000_000 annot_caseinvariant(case_transform_fn, short_strs)
    
    This helped me determine that in annotated_chartransform the "- 1" was
    needed with offset position calculation, and that in the "findlast"
    calls that less than *or equal* was the correct equality test.
    tecosaur committed Apr 11, 2024
    Configuration menu
    Copy the full SHA
    6fb0d5a View commit details
    Browse the repository at this point in the history