Skip to content

difflib.py Differ.compare is too slow [for degenerate cases] #119105

Closed
@pulkin

Description

@pulkin

Bug report

Bug description:

import difflib

a = ["0123456789\n"] * 1_000
b = ["01234a56789\n"] * 1_000
list(difflib.Differ().compare(a, b))  # very slow and will probably hit max recursion depth

The case is pathological in the sense of many lines with the same exact diff / ratio.

The issue is that in the current implementation _fancy_replace will take the first pair of lines (with the same ratio) as a split point and will call itself recursively for all lines starting at 2, then 3, 4, etc. This repeats 1_000 times resulting in a massive recursion depth and O(N^3) complexity scaling.

For an average random case it should split anywhere in the range of 1_000 with a complexity scaling of O(N^2 log N).

I personally encountered this in diffing csv files where one of the files has a column added which, apparently, results in all-same ratios for every line in the file.

Proposal

Fixing this is not so hard by adding some heuristics (WIP) pulkin@31e1ed0

The idea is very straightforward: while doing the _fancy_replace magic, if you see many diffs with the same exact ratio, pick the one closest to the middle of chunks rather than the first one (which can be the worst possible choice). This is done by promoting the ratios of those pairs of lines that are closer to the middle of the chunks.

The _drag_to_center function turns line number into the weight added to the ratio (twice: for a and for b). The weight is zero for both ends of chunks and maximal in the middle (quadratic poly was chosen for simplicity). The magnitude of the weight "_gravity" is small enough to only affect ratios that are exactly the same: it relies on the assumption that we probably have less than 500k symbols in the line such that the steps in the ratio are greater than 1e-6. If this assumption fails some diffs may become different (not necessarily worse).

Performance impact for non-pathological cases is probably minimal.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions