Improve similarities checker by considering relative indentation of code blocks rather than absolute indentation #8882

anadon · 2023-07-26T01:39:52Z

Current problem

Currently, the following is not caught as duplicate code. This is sanitized from a production codebase I work on.

if cond:
    stmt_1
    ...
    stmt_n
    return
stmt_1
...
stmt_n

Desired solution

I would expect R0801 to be flagged for these two code segments.

Additional context

Distinct but possibly related to #1457 . This seems to be an issue with the fact that for comparing code fragments, it does so using the absolute indentation level, and not converting each code segment to use a relative indentation level. Now, in order to make this change it does cause a considerable increase in algorithmic complexity and would require the current hash based approach to actually use a number of different hashes per line, one per indentation level. However, I have found it much easier to argue for better coding when a tool says that code should be changed than when I say it.

The text was updated successfully, but these errors were encountered:

Pierre-Sassoulas · 2023-07-26T04:50:09Z

Using a hash without the leading indentation spaces could lead to false positives, multiplying the number of check on a line by the number of indentation level would be disastrous performance wise especially on large code base where the base algo already compare every line to every other line which is inherentely costly and scale poorly.

I added the design proposal needed label because I think the enhancement makes sense but details matter and jumping to implementation on this one will go especially badly. Imo a state of the art of what other duplicate finders do is required, probably a scientific litterature review too.

anadon · 2023-07-26T13:15:48Z

I'm not saying ignore indentation. That would produce garbage. What I am saying is to use relative indentation.

…

On Wed, Jul 26, 2023, 00:50 Pierre Sassoulas ***@***.***> wrote: Using a hash without the leading indentation spaces could lead to false positives, multiplying the number of check on a line by the number of indentation level would be *disastrous* performance wise especially on large code base where the base algo already compare every line to every other line which is inherentely costly and scale poorly. I added the design proposal needed label because I think the enhancement makes sense but details matter and jumping to implementation on this one will go especially badly. Imo a state of the art of what other duplicate finders do is required, probably a scientific litterature review too. — Reply to this email directly, view it on GitHub <#8882 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXYEBPT5SOY5GT34CCTP2TXSCOYXANCNFSM6AAAAAA2X3T3I4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

anadon · 2023-07-26T13:17:11Z

If the algorithm were to be changed from using a hash based approach to a suffix array or suffix tree, that could be workable. On Wed, Jul 26, 2023, 09:14 Josh Marshall ***@***.***> wrote:

…

I'm not saying ignore indentation. That would produce garbage. What I am saying is to use relative indentation. On Wed, Jul 26, 2023, 00:50 Pierre Sassoulas ***@***.***> wrote: > Using a hash without the leading indentation spaces could lead to false > positives, multiplying the number of check on a line by the number of > indentation level would be *disastrous* performance wise especially on > large code base where the base algo already compare every line to every > other line which is inherentely costly and scale poorly. > > I added the design proposal needed label because I think the enhancement > makes sense but details matter and jumping to implementation on this one > will go especially badly. Imo a state of the art of what other duplicate > finders do is required, probably a scientific litterature review too. > > — > Reply to this email directly, view it on GitHub > <#8882 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAXYEBPT5SOY5GT34CCTP2TXSCOYXANCNFSM6AAAAAA2X3T3I4> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

Pierre-Sassoulas · 2023-07-26T13:38:36Z

Could you give an example of what you mean by relative indentation @anadon ? (Also note that we have option to ignore some kind of things like docstrings, function signatures, imports, etc.)

anadon · 2023-07-26T13:59:22Z

With regards to my initial comment,

0: if cond:
1:     stmt_1
2:     ...
3:     stmt_n
4:     return
5: stmt_1
6: ...
7: stmt_n

0 -> if cond:
1 -> stmt_1, stmt_1
2 -> ..., ...
3 -> stmt_n, stmt_n
5 -> stmt_1
6 -> ...
7 -> stmt_n

lines 1 to 3 at depth 1 match lines 5 to 7 at depth 0. If these were to be added to a suffix tree or suffix array in a tagged manner, all matches could be detected in O(n) time or O(nlog(n)) time and O(n) space. It just so happens that O(nlog(n)) time implementations tend to be faster for reasons I've not been able to study to sufficient depth. I quick search reveals that the project landscape has changed considerable, so library selection would require some research and testing.

anadon · 2023-07-26T14:34:18Z

After work I'll give a bigger and more clear example of input and how it could be treated.

anadon · 2023-07-26T21:17:45Z

VAR = None

def func_1(*args):
    if not args: 
        stmt_1
        ...
        stmt_n
        return
    else:
        ...
        if args[0]:
            ...
            stmt_1
            ...
            stmt_n
    ...
    stmt_1
    ...
    stmt_n
    ...
    return

stmt_1
...
stmt_n

Can be taken as

VAR = None

def func_1(*args):
    if not args: 
        stmt_1
        ...
        stmt_n
        return
    else:
        ...
        if args[0]:
            ...
            stmt_1
            ...
            stmt_n
    ...
    stmt_1
    ...
    stmt_n
    ...
    return

stmt_1
...
stmt_n

if not args: 
    stmt_1
    ...
    stmt_n
    return
else:
    ...
    if args[0]:
        ...
        stmt_1
        ...
        stmt_n
...
stmt_1
...
stmt_n
...
return

stmt_1
...
stmt_n
return

...
if args[0]:
    ...
    stmt_1
    ...
    stmt_n

...
stmt_1
...
stmt_n

As can be seen by example, each block able to share the same level of contiguous indentation is represented as that continuous block in the same way the top level file is. These kind of virtual/mapped files can then be used directly to detect repeated code blocks independent of absolute level of indentation while preserving matching blocks with respect to their relative indentation. This does increase resource usage, but I think it is worth it given how much more effective duplicate code detection would be.

Pierre-Sassoulas · 2023-07-30T08:49:40Z

If these were to be added to a suffix tree or suffix array in a tagged manner, all matches could be detected in O(n) time or O(nlog(n)) time and O(n) space

Without looking too deep into the the practical aspect (and in particular the python specific treatment we do) this sounds acceptable, however this also sounds like a complete overhaul that is going to be a big time commitment (and perf benchmark are lacking for the duplicate-code check so it's also going to involves a lot of testing).

anadon added the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve similarities checker by considering relative indentation of code blocks rather than absolute indentation #8882

Improve similarities checker by considering relative indentation of code blocks rather than absolute indentation #8882

anadon commented Jul 26, 2023

Pierre-Sassoulas commented Jul 26, 2023

anadon commented Jul 26, 2023 via email

anadon commented Jul 26, 2023 via email

Pierre-Sassoulas commented Jul 26, 2023

anadon commented Jul 26, 2023

anadon commented Jul 26, 2023

anadon commented Jul 26, 2023 •

edited

Loading

Pierre-Sassoulas commented Jul 30, 2023 •

edited

Loading

Improve similarities checker by considering relative indentation of code blocks rather than absolute indentation #8882

Improve similarities checker by considering relative indentation of code blocks rather than absolute indentation #8882

Comments

anadon commented Jul 26, 2023

Current problem

Desired solution

Additional context

Pierre-Sassoulas commented Jul 26, 2023

anadon commented Jul 26, 2023 via email

anadon commented Jul 26, 2023 via email

Pierre-Sassoulas commented Jul 26, 2023

anadon commented Jul 26, 2023

anadon commented Jul 26, 2023

anadon commented Jul 26, 2023 • edited Loading

Pierre-Sassoulas commented Jul 30, 2023 • edited Loading

anadon commented Jul 26, 2023 •

edited

Loading

Pierre-Sassoulas commented Jul 30, 2023 •

edited

Loading