Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

Merged
merged 5 commits into from
Sep 4, 2019

Commits on Aug 28, 2019

  1. Fix a broken link in a comment in is_normalized.

    This link doesn't work.
    
    Going back through that UAX's history to find the version that was
    current when this code was added in commit 7a0fedf in 2009-04,
    we find that that anchor still works in that version:
      https://www.unicode.org/reports/tr15/tr15-29.html#Annex8
    
    It's a section heading "14. Detecting Normalization Forms".  Happily
    the anchor that the corresponding section heading now offers looks
    much more reasonable -- it's the title of the section -- and so likely
    to be long-term stable.  ("Annex 8" seems like some kind of editing
    error.)  Switch to that.
    gnprice committed Aug 28, 2019
    Configuration menu
    Copy the full SHA
    4025110 View commit details
    Browse the repository at this point in the history
  2. bpo-37966: Fully implement the UAX python#15 quick-check algorithm.

    The purpose of the `unicodedata.is_normalized` function is to answer
    the question `str == unicodedata.normalized(form, str)` more
    efficiently than writing just that, by using the "quick check"
    optimization described in the Unicode standard in UAX python#15.
    
    However, it turns out the code doesn't implement the full algorithm
    from the standard, and as a result we often miss the optimization and
    end up having to compute the whole normalized string after all.
    
    Implement the standard's algorithm.  This greatly speeds up
    `unicodedata.is_normalized` in many cases where our partial variant
    of quick-check had been returning MAYBE and the standard algorithm
    returns NO.
    
    At a quick test on my desktop, the existing code takes about 4.4 ms/MB
    (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
    has to do the slow normalize-and-compare:
    
      $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
          -- 'unicodedata.is_normalized("NFD", s)'
      50 loops, best of 5: 4.39 msec per loop
    
    With this patch, it gets the answer instantly (58 ns) on the same 1 MB
    string:
    
      $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
          -- 'unicodedata.is_normalized("NFD", s)'
      5000000 loops, best of 5: 58.2 nsec per loop
    gnprice committed Aug 28, 2019
    Configuration menu
    Copy the full SHA
    2a222da View commit details
    Browse the repository at this point in the history
  3. bpo-37966: Add yes_only flag to is_normalized helper.

    This restores a small optimization that the original version of this
    code had for the `unicodedata.normalize` use case.
    
    With this, that case is actually faster than in master!
    
    $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
        -- 'unicodedata.normalize("NFD", s)'
    500 loops, best of 5: 561 usec per loop
    
    $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
        -- 'unicodedata.normalize("NFD", s)'
    500 loops, best of 5: 512 usec per loop
    gnprice committed Aug 28, 2019
    Configuration menu
    Copy the full SHA
    26892d3 View commit details
    Browse the repository at this point in the history

Commits on Aug 29, 2019

  1. Configuration menu
    Copy the full SHA
    27e8122 View commit details
    Browse the repository at this point in the history
  2. Use bool for a boolean.

    gnprice committed Aug 29, 2019
    Configuration menu
    Copy the full SHA
    3762787 View commit details
    Browse the repository at this point in the history