Description
Bug report
Bug description:
import difflib
a = ["0123456789\n"] * 1_000
b = ["01234a56789\n"] * 1_000
list(difflib.Differ().compare(a, b)) # very slow and will probably hit max recursion depth
The case is pathological in the sense of many lines with the same exact diff / ratio.
The issue is that in the current implementation _fancy_replace
will take the first pair of lines (with the same ratio) as a split point and will call itself recursively for all lines starting at 2, then 3, 4, etc. This repeats 1_000
times resulting in a massive recursion depth and O(N^3)
complexity scaling.
For an average random case it should split anywhere in the range of 1_000
with a complexity scaling of O(N^2 log N)
.
I personally encountered this in diffing csv files where one of the files has a column added which, apparently, results in all-same ratios for every line in the file.
Proposal
Fixing this is not so hard by adding some heuristics (WIP) pulkin@31e1ed0
The idea is very straightforward: while doing the _fancy_replace
magic, if you see many diffs with the same exact ratio, pick the one closest to the middle of chunks rather than the first one (which can be the worst possible choice). This is done by promoting the ratios of those pairs of lines that are closer to the middle of the chunks.
The _drag_to_center
function turns line number into the weight added to the ratio (twice: for a and for b). The weight is zero for both ends of chunks and maximal in the middle (quadratic poly was chosen for simplicity). The magnitude of the weight "_gravity
" is small enough to only affect ratios that are exactly the same: it relies on the assumption that we probably have less than 500k symbols in the line such that the steps in the ratio
are greater than 1e-6. If this assumption fails some diffs may become different (not necessarily worse).
Performance impact for non-pathological cases is probably minimal.
CPython versions tested on:
3.9, 3.12
Operating systems tested on:
Linux