-
-
Notifications
You must be signed in to change notification settings - Fork 31.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
difflib.py Differ.compare is too slow [for degenerate cases] #119105
Comments
Changed label from "type-bug" to "performance" and "type-feature". In your proposal. please add some high-level comments about the meaning of "_gravity" and what range of values might be useful. The name on its own is just a mystery. |
BTW, I agree this class of inputs is worth addressing 😄. Please open a Pull Request? Offhand, it seems to me that this would be a faster and much more obvious way to get a value that's 0 at
|
Sure that will probably do as well; was just avoiding function calls. |
Micro-optimizations like that are increasingly out of style in core Python code. They generally don't help (or even hurt) when running under PyPy, and CPython is making real progress at speeding the common cases too. Relatedly, nobody ever abuses the default-argument mechanism to add Else clarity is valued more. |
Let's stay away from hard-coded constants, too - especially ones with "mystery names" 😉. Constants frequently bite us later, and are rarely needed. In this case, lena = ahi - alo
lenb = bhi - blo
smallest_ratio_delta = 2 * 0.99 / (lena + lenb) Defining Then if you have two "midrange bonus" functions that max out at 0.5 for i and j being at the midpoints of their ranges, midrange_total_bonus = (midrange_a_bonus + midrange_b_bonus) * smallest_ratio_delta will yield a score bonus less than the difference between any two distinct raw scores (the sum of the two bonuses can't be greater than 1.0). |
"Strength reduction" looks like a worthwhile optimization. Short course: all function calls, and even multiplies, can be removed from the inner loop, by keeping the bonus as a running total, and adjusting it near the top of the loop like so: bonus += a_inc if i <= ihalf else -a_inc where the other names are function-level invariants: ihalf = (alo + ahi - 1) / 2
a_inc = smallest_ratio_delta / lena So the bonus keeps going up around the loop, until the midpoint is reached, after which it keeps going down. Similarly for the outer loop. Some details may be tricky. Better names could be picked 😉. If you don't want to bother, let me know here, and I'll open a pull request. |
Unfortunately, len_a = max(map(len, a))
len_b = max(map(len, b))
smallest_ratio_delta = 2 * 0.99 / (lena + lenb) ** 2 I plan to play around and open PR. My first one here! |
Wow - I was hallucinating badly, wasn't I It's all yours - I look forward to the PR 😄 Unless I'm still hallucinating, the square in the new expression isn't needed, is it? If the longest elements ever compared have lengths |
No problem: I often skip over large chunks and forget to come back and to explain. Given the ratio the differences between two arbitrary ratios are simply
Trying to minimize the above by the absolute value (and requiring it to be non-zero) we just discard whatever the details of the numerator except that it is a non-zero integer: thus, we take it Example (rather my rough understanding than the actual computation):
|
Thanks! Got it now. The smallest delta between Your more careful analysis looks right. Using the square to simplify reasoning is fine. It's not like we're pushing the limits of float precision here 😉. |
I am slightly concerned by how small weights can become (but only slightly). We now try to make differences as small as |
If we end up computing differences so small that they up getting totally lost to rounding when added to the ratio, I don't think it much matters - just a waste of time. The interesting cases are near the midpoints of the index ranges, where your factor Alternatives include using |
Unless I'm hallucinating again, there's a simpler way, which doesn't add new floating-point stress, or even require computing "epsilon". Store the ratio instead as a 2-tuple, Tuple comparison does the rest. If the first components differ (ratios we use today differ), that settles it. Else (ratios the same), the larger is whichever has the larger Quick coding of that worked fine. Strength reduction in the inner loop becomes just distsum += 1 if i <= ihalf else -1 It's fine to use a more long-winded way in the outer loop, though. In real life, the most important line of the inner loop is the |
I think I'll back off on "strength reduction". The full-blown |
I updated the branch with rolling weights. I think I like it the most; the only subtle point is that the weight is now an integer number: instead of computing I've put an |
I don['t even know of a case where "normalization" is helpful 😉. Do you? Suppose we have two sequences of wildly different lengths (so normalization could make most difference) but all element pairs have the same
Why not in the code? Checking for 0 is cheap. Note a subtlety with "rolling weights": in the innermost loop, the adjustment has to be done very near the top of the loop. Else there's a conditional (very rarely executed) |
FYI, this is the cheapest way I've thought of to get a non-rolling weight: # Function-level invariantd.
midi = (alo + ahi-1) / 2
midj = (blo + bhi-1) / 2
# Outer loop.
b_weight = -abs(j - midj)
...
# Inner loop.
weight = b_weight - abs(i - midi) It's a float (but with fractional part 0 or 0.5). If there are an odd number of possible indices, it's exactly 0 at the middle one, and gets smaller(more negative) at the same rate for indices moving away from the midpoint. If there are an even number of indices, the two center ones get the largest weight, -0.5. I don't think signs, or possible fractions of 0.5, matter. As in your original post, "convex" is what matters: msx out at the central indices, and get smaller the farther from the center. |
I think reasoning rules out that normalization can do any good. Consider what happens without it, and assuming weights are "convex". The only interesting cases are where the weights make a difference, at identical raw Convexity is scale-invariant, so normalization doesn't hurt this: we can multiply the So don't bother with multiplying at all 😄. |
I agree: there is no normalization needed in this case. But suppose we have just two competitors with indices
This one also caught my attention! But this is not always the case as I hopefully explained above. If we are choosing the And this may not be the end of the story. After some thinking I came up with this statement: Deep enough into the recursion tree we are guaranteed to have comparable (by size) sub-sequences (provided the specific choice of the weight function The
And it is particularly difficult to find an example where the weight function |
On a related note, there might be some room for further improvement in the above. I think in the case discussed here it is much faster to split into more than two chunks at the same recursion depth. Technically, we should find the longest strictly growing sequence Even more, we could avoid recursive calls completely by computing all pairs above the Yet, I believe this PR is good enough. |
You're right again 😄. I over-generalized from your specific test case, assuming the full Cartesian product of the index ranges would be in play in bad cases. Well, maybe they are in worst current cases. The sparser the subset of the full product in play, the harder it looks to contrive bad cases. I'll settle for a massive improvement in the worst cases.
Absolutely so. The recursive calls are doing purely redundant work, finding the same max ratio at (a subset of) the same line pairs over & over & over again. I knew that at the time, but an obvious quick way to exploit it didn't come to mind. As is, short-circuiting so that only a brand new max could make it through the gauntlet of "if" tests paid off a lot. Relaxing that to also find pairs that equal the current max was unattractive - and "bad cases" never showed up in real life. But, at the time, the library was overwhelmingly used only to compare files of Python and C code, and human-written prose. If you'd like to pursue it, have at it!
No question that it's a huge improvement as is. |
Something like this (wholly untested) might do the trick, preserving short-circuiting as much as possible without missing an "equals the best" case. For each from operator import gt, ge
max_pairs = [] # index pairs achieving best_ratio
maxi = -1 # `i` index of last pair in max_pairs
for j in range_b:
searching = True
for i in range_a:
cmp = ge if searching and i > maxi else gt
if cmp(real_quick_ratio, best_ratio) and ...:
searching = False
if ratio > best_ratio:
max_pairs.clear()
best_ratio = ratio
max_pairs.append((i, j))
maxi = i
# The indices in max_pairs all have the same, maximal ratio,
# and are strictly ascending in both positions. It's still a "greedy" approach. In your original test case, I expect |
Code from https://github.com/pulkin, in PR #119131 Greatly speeds `Differ` when there are many identically scoring pairs, by splitting the recursion near the inputs' midpoints instead of degenerating (as now) into just peeling off the first two lines. Co-authored-by: Tim Peters <tim.peters@gmail.com>
I'm not sanguine about that. The "synch pair" establishes a very high wall, with no cross-talk between the "before" and "after" segments of the inputs. This is mostly driven by the linear nature of diff output, restricted to "matches", "insert", "delete", and "replace", from first to last lines of both inputs. For example, suppose we have two thousand-line inputs. The first line of
Any ratio less than the unique very best is irrelevant to that outcome. If the synch pair occurred more toward the middle of the inputs, the recursive calls could do a lot of redundant work, but again ratios between the "before" elements of one input and the "after" elements of the other are useless after the synch pair is picked. There's also that difflib intends never to create a data structure whose size is (*) Of course that's arguable. Could be that it's the "very close match" that should be discounted, in favor of producing markup for the 999 "pretty close" matches. But that would be a Big Change, and this isn't Hard Science™. |
All true. I think it won't be irrelevant to study other diff libs. Just to mention, the code merged does change some diff outputs. For example, this one will find only one match, instead of two (before).
vs
If we want to restore it back (and to keep the recursion under control) we can take the approach above for I, too, have concerns about considering all |
I'm not determined to keep all results the same, but your example bothers me enough that I'm going to re-open this issue. The fundamental goal of this library was to create diffs that, to the extent possible, "make sense" to human eyes. Because, at the time, all Unix-y diff programs too often produced diffs for Python's C code that were more baffling than helpful (e.g., synching up on garbage like blank lines and lone curly braces far apart in the inputs). Locality seems to matter most to people, so the core approach builds on finding long stretches of consecutive matching lines. Another that matters to people is "top to bottom". It just makes "no sense" to my eyes that given two blocks of closely matching lines, the program would match the first line of one input's block with the last line of the other's. "Well, that's because they're both closest to the center of their index ranges" sounds like a geek making up technobabble excuses for absurd output 😉. I believe the pseudocode I gave would restore "first to first, second to second, ..." results, so I'm keen now to pursue it. About other diff libraries, over the years I haven't found that studying their details was helpful. At a high level, if I had it to do over again I'd start with git's "histogram" diff. Another thing that catches eyes is rare lines, and "histogram" addresses that directly. I think difflib's "junk" heuristics were a mostly-failed attempt toward the same end, by ignoring "too common" elements instead of identifying the rare ones. People generally don't know to use it, and even I rarely made the effort required. But "histogram" isn't a full diff algorithm either, just an early pass for identifying high-value "synch points". Producing diffs between synch points has to be done by some other algorithm. And for some tasks, like comparing genome sequences, "histogram" is useless (it won't find any rare elements). Then again, difflib is mostly useless for that too (and completely useless unless "autojunk" is disabled - in which case its generally quadratic runtime is too slow for that task). |
I opened a PR implementing the "track all optimal pairs" idea. |
I did some automated testing on various orderings of unique and similar groups of strings and your code looks equivalent to the packaged one output-wise. I have a small request (rather related to the original implementation): best_ratio = 1
for j in ...
for i in ...
if ai == bj:
if best_ratio != 1:
continue # we already have a close match; ignore the exact one
ratio = 1
else:
set_seq(...)
_cutoff = cutoff if best_ratio == 1 else best_ratio
if crqr() < _cutoff or rqr() < _cutoff or (ratio := qr()) < _cutoff:
continue # ratio is smaller
if ratio != best_ratio:
# start max_pairs over
else:
# do i, j checks
max_pairs.append((i, j)) There is an obvious subtlety with |
Thanks! That was the intent, and I think it's pretty clear it "should" act the same.
I'm also keen to keep making a distinction between ">" and ">=". Because, using the lines from your example: >>> from difflib import SequenceMatcher as S
>>> m = S(None, "0123456789\n", "01234a56789\n")
>>> m.ratio()
0.9565217391304348
>>> m.quick_ratio()
0.9565217391304348
>>> m.real_quick_ratio()
0.9565217391304348 That is, the "quick upper bounds" are in fact good to the last bit.
I do. Those were basically pulled out of a hat. Interline markup can get more annoying than helpful unless the lines are "quite similar", and the cutoffs were adjusted by eyeball to quantify "quite". They're neither touchy nor critical. But, for backward compatibility, can never be changed 😉. Since what's in the PR now has been tested and seems to work fine, I'll merge it later tonight. If you or I want to pursue other ideas beyond it, that's fine - just make another PR. |
I should add that's a fuzzy "average case" claim. In your test case, a million pairs are compared, but |
…s] (#119376) Track all pairs achieving the best ratio in Differ(). This repairs the "very deep recursion and cubic time" bad cases in a way that preserves previous output.
Here's a different case that's still cubic-time, and blows the recursion stack. The first line pair has a unique best ratio, and the same is true at each recursion level: import difflib
a = []
b = []
N = 1000
for i in range(N):
s = "0" * (N - i)
s2 = s + "x"
a.append(s + "\n")
b.append(s2 + "\n")
list(difflib.Differ().compare(a, b)) I don't have a slick idea to get around that while preserving current results (well, for this specific input, sure - I mean "in general"). Half-baked: for each Wouldn't always eliminate non-trivial recursive calls. In this test case, and in the original, it would. How it may affect current results is clear as mud to me 😄. EDIT: why not build the list with strictly increasing |
Yes I was aware about this one. It is special in a sense of arbitrary long strings: if you limit string length then asymptotically it is still |
Over my head 😉. I don't know what "ONP" means(*), or "keeping string comparison intact". In general, I don't know of a worst-case sub-quadratic algorithm for "longest common subsequence" problems. Suffix trees can, in theory, solve "longest contiguous common subsequence" problems in linear time, but are exceedingly delicate with high constant factors. That said, it's easy to believe difflib isn't suitable for your application. It was aimed at human-friendly output, pretty much regardless of cost. "Shortest edit script" sucks at the "human-friendly" part. BTW, you can avoid (*) As in "An O(NP) Sequence Comparison Algorithm" by Wu, Manber, & Myers? |
And seems possible that's what you really wanted all along. If you're thinking of implementing a Myers diff algorithm, then you don't care about intraline markup. That can make a huge difference.. Here's your original test, but with a million lines in each input: >>> import difflib
>>> a = ["0123456789\n"] * 1_000_000
>>> b = ["01234a56789\n"] * 1_000_000
>>> s = difflib.SequenceMatcher(None, a, b) # appears instantaneous
>>> list(s.get_matching_blocks()) # also very fast
[Match(a=1000000, b=1000000, size=0)] # see the docs - this always has a size-0 dummy at the end
>>> for x in s.get_opcodes(): # and here's your "edit script"
... print(x)
('replace', 0, 1000000, 0, 1000000)
>>> Any kind of "bad case" you make up for Under the covers, One final "speed hint": PyPy generally runs diffllib functions much faster than CPython does. In your original test, using difflib's original code, PyPy finishes in about 5 seconds on my box, without overflowing the stack. I don't know all the reasons for that (in part, function inlining cuts recursion depth). |
FYI, in the latest PR, |
Yes: thinking about using But the straightforward implementation that you showed is not something very useful in my case: I do care about close but not exact matches. Roughly speaking I would like to simplify it to this: @dataclass
class CloseIsEqualWrapper:
token: Any
def __eq__(self, other):
assert isinstance(other, CloseIsEqualWrapper)
return difflib.SequenceMatcher(a=self.token, b=other.token).ratio() >= .75
a = list(map(CloseIsEqualWrapper, a))
b = list(map(CloseIsEqualWrapper, a))
print(list(difflib.SequenceMatcher(None, a, b).get_matching_blocks())) Then, if |
Alas, I doubt that can be made to work. Dicts are critical to how matching works, and defining There's also that "close to" doesn't define an equivalence relation, because it's not transitive. That A is close to B, and B is close to C, doesn't imply that A is close to C.
Any two strings of sufficient length can be transformed to the other via such a chain. And there's also - as the docs note - that I expect you'd do better by not trying to trick Of course if you're coding a Meyers-style algorithm from scratch, none of the above matters. Then I'd recommend the classic Levenshtein distance (which is symmetric) as a measure of similarity (or a wrapper around |
``_fancy_replace()`` is no longer recursive. and a single call does a worst-case linear number of ratio() computations instead of quadratic. This renders toothless a universe of pathological cases. Some inputs may produce different output, but that's rare, and I didn't find a case where the final diff appeared to be of materially worse quality. To the contrary, by refusing to even consider synching on lines "far apart", there was more easy-to-digest locality in the output.
…hon#119131) Code from https://github.com/pulkin, in PR python#119131 Greatly speeds `Differ` when there are many identically scoring pairs, by splitting the recursion near the inputs' midpoints instead of degenerating (as now) into just peeling off the first two lines. Co-authored-by: Tim Peters <tim.peters@gmail.com>
…e cases] (python#119376) Track all pairs achieving the best ratio in Differ(). This repairs the "very deep recursion and cubic time" bad cases in a way that preserves previous output.
…ython#119492) ``_fancy_replace()`` is no longer recursive. and a single call does a worst-case linear number of ratio() computations instead of quadratic. This renders toothless a universe of pathological cases. Some inputs may produce different output, but that's rare, and I didn't find a case where the final diff appeared to be of materially worse quality. To the contrary, by refusing to even consider synching on lines "far apart", there was more easy-to-digest locality in the output.
Bug report
Bug description:
The case is pathological in the sense of many lines with the same exact diff / ratio.
The issue is that in the current implementation
_fancy_replace
will take the first pair of lines (with the same ratio) as a split point and will call itself recursively for all lines starting at 2, then 3, 4, etc. This repeats1_000
times resulting in a massive recursion depth andO(N^3)
complexity scaling.For an average random case it should split anywhere in the range of
1_000
with a complexity scaling ofO(N^2 log N)
.I personally encountered this in diffing csv files where one of the files has a column added which, apparently, results in all-same ratios for every line in the file.
Proposal
Fixing this is not so hard by adding some heuristics (WIP) pulkin@31e1ed0
The idea is very straightforward: while doing the
_fancy_replace
magic, if you see many diffs with the same exact ratio, pick the one closest to the middle of chunks rather than the first one (which can be the worst possible choice). This is done by promoting the ratios of those pairs of lines that are closer to the middle of the chunks.The
_drag_to_center
function turns line number into the weight added to the ratio (twice: for a and for b). The weight is zero for both ends of chunks and maximal in the middle (quadratic poly was chosen for simplicity). The magnitude of the weight "_gravity
" is small enough to only affect ratios that are exactly the same: it relies on the assumption that we probably have less than 500k symbols in the line such that the steps in theratio
are greater than 1e-6. If this assumption fails some diffs may become different (not necessarily worse).Performance impact for non-pathological cases is probably minimal.
CPython versions tested on:
3.9, 3.12
Operating systems tested on:
Linux
Linked PRs
The text was updated successfully, but these errors were encountered: