Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

Merged
merged 5 commits into from
Sep 4, 2019

Conversation

gnprice
Copy link
Contributor

@gnprice gnprice commented Aug 28, 2019

The purpose of the unicodedata.is_normalized function is to answer
the question str == unicodedata.normalized(form, str) more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm. This greatly speeds up
unicodedata.is_normalized in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
    -- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
    -- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop

https://bugs.python.org/issue37966

This link doesn't work.

Going back through that UAX's history to find the version that was
current when this code was added in commit 7a0fedf in 2009-04,
we find that that anchor still works in that version:
  https://www.unicode.org/reports/tr15/tr15-29.html#Annex8

It's a section heading "14. Detecting Normalization Forms".  Happily
the anchor that the corresponding section heading now offers looks
much more reasonable -- it's the title of the section -- and so likely
to be long-term stable.  ("Annex 8" seems like some kind of editing
error.)  Switch to that.
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX python#15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop
This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
Copy link
Contributor

@benjaminp benjaminp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good. Here are a few nitty comments.

is_normalized(PyObject *self, PyObject *input, int nfc, int k)
// This needs to match the logic in makeunicodedata.py
// which constructs the quickcheck data.
typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makeunicodedata.py should output this enum (with better name namespacing) then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea!

Do you think it'd be cleaner to add that to this PR, or make a small separate one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done latter, I think.

@@ -208,6 +208,8 @@ def test_issue29456(self):
self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b)
self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b)

# For tests of unicodedata.is_normalized / self.db.is_normalized ,
# see test_normalization.py .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, I would support merging test_normalization into this file for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.

A bit of context on someone's thinking in having a separate file can be found above at line 178:

        # The rest can be found in test_normalization.py
        # which requires an external file.

I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)

Happy to send a PR to fold that file in.

Modules/unicodedata.c Outdated Show resolved Hide resolved
*/
static QuickcheckResult
is_normalized_quickcheck(PyObject *self, PyObject *input,
int nfc, int k, int yes_only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These "boolean int" parameters could be actual bools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow it's the future! 🙂 I guess that's a C99 feature, and one of those we now accept using. Good to know.

I'm not sure I want to convert the existing two parameters within this PR; it seems like that could muddy the diff a bit with changes that are orthogonal to what it's really doing. But I guess I can see how it looks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a change to bool for the new parameter.

I was a bit surprised to find that I needed to add an #include <stdbool.h> line to the file! Perhaps that should be in a central include file, in the same way as limits.h and inttypes.h are. That's certainly orthogonal to this PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a quick further commit converting nfc and k to bool too (including in the callers, and using true/false rather than 1/0 for literal values.)

I'm ambivalent about adding it here. It expands the diff a bit -- was:
1 file changed, 51 insertions(+), 24 deletions(-)
and becomes:
1 file changed, 57 insertions(+), 30 deletions(-)
Slightly inclined to leave it out, but if you'd rather include it I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to avoid polluting public headers with a stdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases. Using stdbool.h where it's needed is also consistent with IWYU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to avoid polluting public headers with a stdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases.

Ah, fair enough. I think that's consistent with putting it in a generic internal-only header, though. Within the tree, the very few existing uses are either stdbool, or (in Modules/expat/xmltok.c) a sort of manual stdbool.

Using stdbool.h where it's needed is also consistent with IWYU.

Hmm true. I guess IWYU doesn't feel like how this codebase is mostly written... but maybe it's best to make it more so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Here's the extra commit: gnprice@896c970f5
The lines not otherwise touched are where we declare and then fiddle with the nfc and k locals in unicodedata_UCD_is_normalized_impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we can fix up the rest in a sequel.

* normalized or certainly not, and MAYBE if quickcheck is unable to
* tell.
*
* If `yes_only` is true, then return MAYBE as soon as we determine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes_only is a bizarre name for a parameter that causes the function to return MAYBE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not really a fan of this name; I'd be glad for a better one.

The idea is that this flag says the caller doesn't care about NO vs. MAYBE; it only cares whether the answer is YES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe invert the sense and call it want_definite meaning "process the string until we're sure it's not normalized or we reach the end"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm in the standard is the yes_only = false (or want_definite = true) case, so I think things are simplest to describe if that's the baseline and the other case is described by contrast with it. Could certainly handle that with some added words, though.

Another aspect is that there is an asymmetry between a definite YES and a definite NO. In the yes_only / !want_definite case, what the caller really most wants is a definite YES... it's just a definite NO that is no more helpful than a MAYBE. Or another angle on that: if the caller really didn't care about getting a definite answer, it could just assume MAYBE and not bother calling the helper 😉 It's in the hope of getting a YES that it makes the call.

More ideas just now: perhaps maybe_is_no, to mean "a MAYBE is no worse than a NO"?

Or yes_or_maybe, for "there are only two distinct answers, YES and MAYBE". That's basically what I was going for with yes_only, but probably a less paradoxical way of putting it 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API also surprised me. I suggest tho rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Hmm, I like the idea of covering over some of the multiplication of flags with wrapper functions!

I don't like quite these names, though. First, I don't think is_normalized is a good name, for either value of the flag, because the function doesn't actually take full responsibility for finding out if the string is normalized -- it can return MAYBE, and then it's up to the caller to go and normalize the string and compare.

What the function does do is implement a quick but partial check to try to see if the string is normalized. Specifically it implements one that's defined in the standard, which the standard calls quickCheck in its sample code, and which uses a property in the UCD called Quick_Check.

More precisely, the yes_only=false case implements that standard quick-check algorithm; the yes_only=true (or yes_or_maybe=true, etc.) case (which is the version in master) is a variation on it. I think the name "quickcheck" or is_normalized_quickcheck, with no further qualifications, sounds like it will mean the standard's algorithm and it's confusing if it means a different algorithm instead.

Here's another possible set of names to use with the same idea:

  • rename this function to is_normalized_quickcheck_impl
  • is_normalized_quickcheck is the standard's quickcheck (yes_only/yes_or_maybe=false)
  • is_normalized_quickercheck is the variant that's more impatient to return and hand responsibility back to the caller (yes_or_maybe=true)
    • Or is_normalized_quickcheck_quicker

Or if those names feel too long: is_normalized_qc_impl, is_normalized_qc, is_normalized_qc_quicker.

How does that sound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or yes_or_maybe, for "there are only two distinct answers, YES and MAYBE".

Another idea for the flag name (though its name becomes less important if we use Victor's idea of a pair of wrapper functions): yes_vs_maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like anything is standing out as dramatically better. At least the semantics are clearly documented and easily read from the function.

So, let's go with the current state.

@benjaminp benjaminp merged commit 2f09413 into python:master Sep 4, 2019
@miss-islington
Copy link
Contributor

Thanks @gnprice for the PR, and @benjaminp for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-15671 is a backport of this pull request to the 3.8 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 4, 2019
…orithm. (pythonGH-15558)

The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX pythonGH-15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413)

Co-authored-by: Greg Price <gnprice@gmail.com>
miss-islington added a commit that referenced this pull request Sep 4, 2019
GH-15558)

The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX GH-15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413)

Co-authored-by: Greg Price <gnprice@gmail.com>
@gnprice gnprice deleted the pr-quickcheck branch September 6, 2019 03:23
@gnprice
Copy link
Contributor Author

gnprice commented Sep 6, 2019

Thanks for all the reviews and the merge!

@benjaminp , I'll go on and send next some of those followups you suggested.

@gnprice
Copy link
Contributor Author

gnprice commented Sep 6, 2019

@benjaminp , I'll go on and send next some of those followups you suggested.

Filed bpo-38043 for that, and sent PRs for two of them.

lisroach pushed a commit to lisroach/cpython that referenced this pull request Sep 10, 2019
…ithm. (pythonGH-15558)

The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX python#15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
DinoV pushed a commit to DinoV/cpython that referenced this pull request Jan 14, 2020
…ithm. (pythonGH-15558)

The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX python#15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
websurfer5 pushed a commit to websurfer5/cpython that referenced this pull request Jul 20, 2020
…ithm. (pythonGH-15558)

The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX python#15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants