-
Notifications
You must be signed in to change notification settings - Fork 52
How Does Natsort Work? (1 ‐ Basics)
natsort works by breaking strings into smaller sub-components (numbers or everything else), and returning these components in a tuple. Sorting tuples in Python is well-defined, and this fact is used to sort the input strings properly. But how does one break a string into sub-components? And what does one do to those components once they are split? Below I will explain the algorithm that was chosen for the natsort module, and some of the thinking that went into those design decisions. I will also mention some of the stumbling blocks I ran into because getting sorting right is surprisingly hard.
If you are impatient, you can skip to TL;DR 1 for the algorithm in the simplest case, and TL;DR 2 to see what extra code is needed to handle special cases.
If I want to compare '2 ft 7 in' to '2 ft 11 in', I might do the following
>>> '2 ft 7 in' < '2 ft 11 in'
False
We as humans know that the above should be true, but why does Python think it is false? Here is how it is performing the comparison:
'2' <=> '2' ==> equal, so keep going ' ' <=> ' ' ==> equal, so keep going 'f' <=> 'f' ==> equal, so keep going 't' <=> 't' ==> equal, so keep going ' ' <=> ' ' ==> equal, so keep going '7' <=> '1' ==> different, use result of '7' < '1'
'7' evaluates as greater than '1' so the statement is false. When sorting, if a value is less than another it is placed first, so in our above example '2 ft 11 in' would end up before '2 ft 7 in', which is not correct. What to do?
The best way to handle this is to break the string into sub-components of numbers and non-numbers, and then convert the numeric parts into float() or int() types. This will force Python to actually understand the context of what it is sorting and then "do the right thing." Luckily, it handles sorting lists of strings right out-of-the-box, so the only hard part is actually making this string-to-list transformation and then Python will handle the rest.
'2 ft 7 in' ==> (2, ' ft ', 7, ' in') '2 ft 11 in' ==> (2, ' ft ', 11, ' in')
When Python compares the two, it roughly follows the below logic:
2 <=> 2 ==> equal, so keep going ' ft ' <=> ' ft ' ==> a string is a special type of sequence - evaluate each character individually || --> ' ' <=> ' ' ==> equal, so keep going 'f' <=> 'f' ==> equal, so keep going 't' <=> 't' ==> equal, so keep going ' ' <=> ' ' ==> equal, so keep going <== Back to parent sequence 7 <=> 11 ==> different, use the result of 7 < 11
Clearly, seven is less than eleven, so our comparison is as we expect, and we would get the sorting order we wanted.
At its heart, natsort is simply a tool to break strings into tuples,
turning numbers in strings (i.e. '79'
) into ints and floats as it does this.
The first major hurtle to overcome is to decompose the string into sub-components. Remarkably, this turns out to be the easy part, owing mostly to Python's easy access to regular expressions. Breaking an arbitrary string based on a pattern is pretty straightforward.
>>> import re
>>> re.split(r'(\d+)', '2 ft 11 in')
['', '2', ' ft ', '11', ' in']
Clear (assuming you can read regular expressions) and concise.
The reason I began developing natsort in the first place was because I
needed to handle the natural sorting of strings containing real numbers, not
just unsigned integers as the above example contains. By real numbers, I mean
those like -45.4920E-23
. natsort can handle just about any number
definition; to that end, here are all the regular expressions used in
natsort:
>>> unsigned_int = r'([0-9]+)'
>>> signed_int = r'([-+]?[0-9]+)'
>>> unsigned_float = r'((?:[0-9]+\.?[0-9]*|\.[0-9]+)(?:[eE][-+]?[0-9]+)?)'
>>> signed_float = r'([-+]?(?:[0-9]+\.?[0-9]*|\.[0-9]+)(?:[eE][-+]?[0-9]+)?)'
>>> unsigned_float_no_exponent = r'((?:[0-9]+\.?[0-9]*|\.[0-9]+))'
>>> signed_float_no_exponent = r'([-+]?(?:[0-9]+\.?[0-9]*|\.[0-9]+))'
Note that "inf"
and "nan"
are deliberately omitted from the float
definition because you wouldn't want (for example) "banana"
to be converted
into ['ba', 'nan', 'a']
, Let's see an example:
>>> re.split(signed_float, 'The mass of 3 electrons is 2.732815068E-30 kg')
['The mass of ', '3', ' electrons is ', '2.732815068E-30', ' kg']
NOTE: It is a bit of a lie to say the above are the complete regular expressions. In the actual code there is also handling for non-ASCII unicode characters (such as ⑦), but I will ignore that aspect of natsort in this discussion.
Now, when the user wants to change the definition of a number, it is as easy as changing the pattern supplied to the regular expression engine.
Choosing the right default is hard, though (well, in this case it shouldn't have been but I was rather thick-headed). In retrospect, it should have been obvious that since essentially all the code examples I had/have seen for natural sorting were for unsigned integers, I should have made the default definition of a number an unsigned integer. But, in the brash days of my youth I assumed that since my use case was real numbers, everyone else would be happier sorting by real numbers; so, I made the default definition of a number a signed float with exponent. This astonished a lot of people (and some people aren't very nice when they are astonished). Starting with natsort version 4.0.0 the default number definition was changed to an unsigned integer which satisfies the "least astonishment" principle, and I have not heard a complaint since.
There has been some debate on Stack Overflow as to what method is best to coerce a string to a number if it can be coerced, and leaving it alone otherwise (see this one for coercion and this one for checking for some high traffic questions), but it mostly boils down to two different solutions, shown here:
>>> def coerce_try_except(x):
... try:
... return int(x)
... except ValueError:
... return x
...
>>> def coerce_regex(x):
... # Note that precompiling the regex is more performant,
... # but I do not show that here for clarity's sake.
... return int(x) if re.match(r'[-+]?\d+$', x) else x
...
Here are some timing results run on my machine:
In [0]: numbers = list(map(str, range(100))) # A list of numbers as strings
In [1]: not_numbers = ['banana' + x for x in numbers]
In [2]: %timeit [coerce_try_except(x) for x in numbers]
15.4 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit [coerce_try_except(x) for x in not_numbers]
85.9 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [4]: %timeit [coerce_regex(x) for x in not_numbers]
25.9 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit [coerce_regex(x) for x in numbers]
39.6 µs ± 527 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
What can we learn from this? The try: except
method (arguably the most
"pythonic" of the solutions) is best for numeric input, but performs over 4X
slower for non-numeric input. Conversely, the regular expression method, though
slower than try: except
for both input types, is more efficient for
non-numeric input than for input that can be converted to an int
. Further,
even though the regular expression method is slower for both input types, it is
always at least twice as fast as the worst case for the try: except
.
Why do I care? Shouldn't I just pick a method and not worry about it? Probably.
However, I am very conscious about the performance of natsort, and want
it to be a true drop-in replacement for sorted() without having to incur
a performance penalty. For the purposes of natsort, there is no clear
winner between the two algorithms - the data being passed to this function will
likely be a mix of numeric and non-numeric string content. Do I use the
try: except
method and hope the speed gains on numbers will offset the
non-number performance, or do I use regular expressions and take the more
stable performance?
It turns out that within the context of natsort, some assumptions can be made that make a hybrid approach attractive. Because all strings are pre-split into numeric and non-numeric content before being passed to this coercion function, the assumption can be made that if a string begins with a digit or a sign, it can be coerced into a number.
>>> def coerce_to_int(x):
... if x[0] in '0123456789+-':
... try:
... return int(x)
... except ValueError:
... return x
... else:
... return x
...
So how does this perform compared to the standard coercion methods?
In [6]: %timeit [coerce_to_int(x) for x in numbers]
19.7 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit [coerce_to_int(x) for x in not_numbers]
10.2 µs ± 87.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The hybrid method eliminates most of the time wasted on numbers checking that it is in fact a number before passing to int(), and eliminates the time wasted in the exception stack for input that is not a number.
That's as fast as we can get, right? In pure Python, probably. At least, it's close. But because I am crazy and a glutton for punishment, I decided to see if I could get any faster writing a C extension. It's called fastnumbers and contains a C implementation of the above coercion functions called try_int() that can iterate over the entire list in C code. How does it fair? Pretty well.
In [8]: %timeit try_int(numbers, map=list)
7.07 µs ± 76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit try_int(not_numbers, map=list)
7.85 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
During development of natsort, I wanted to ensure that using it did not get in the way of a user's program by introducing a performance penalty to their code. To that end, I do not feel like my adventures down the rabbit hole of optimization of coercion functions was a waste; I can confidently look users in the eye and say I considered every option in ensuring natsort is as efficient as possible. This is why if fastnumbers is installed it will be used for this step, and otherwise the hybrid method will be used.
NOTE: Modifying the hybrid coercion function for floats is straightforward.
>>> def coerce_to_float(x): ... if x[0] in '.0123456789+-' or x.lower().lstrip()[:3] in ('nan', 'inf'): ... try: ... return float(x) ... except ValueError: ... return x ... else: ... return x ...
At this point, our natsort algorithm is essentially the following:
>>> import re
>>> def natsort_key(x, as_float=False, signed=False):
... if as_float:
... regex = signed_float if signed else unsigned_float
... else:
... regex = signed_int if signed else unsigned_int
... split_input = re.split(regex, x)
... split_input = filter(None, split_input) # removes null strings
... coerce = coerce_to_float if as_float else coerce_to_int
... return tuple(coerce(s) for s in split_input)
...
I have written the above for clarity and not performance. This pretty much matches most natural sort solutions for python on Stack Overflow (except the above includes customization of the definition of a number).
Continue to the second how-it-works page to finish the story!