-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve similarities checker by considering relative indentation of code blocks rather than absolute indentation #8882
Comments
Using a hash without the leading indentation spaces could lead to false positives, multiplying the number of check on a line by the number of indentation level would be disastrous performance wise especially on large code base where the base algo already compare every line to every other line which is inherentely costly and scale poorly. I added the design proposal needed label because I think the enhancement makes sense but details matter and jumping to implementation on this one will go especially badly. Imo a state of the art of what other duplicate finders do is required, probably a scientific litterature review too. |
I'm not saying ignore indentation. That would produce garbage. What I am
saying is to use relative indentation.
…On Wed, Jul 26, 2023, 00:50 Pierre Sassoulas ***@***.***> wrote:
Using a hash without the leading indentation spaces could lead to false
positives, multiplying the number of check on a line by the number of
indentation level would be *disastrous* performance wise especially on
large code base where the base algo already compare every line to every
other line which is inherentely costly and scale poorly.
I added the design proposal needed label because I think the enhancement
makes sense but details matter and jumping to implementation on this one
will go especially badly. Imo a state of the art of what other duplicate
finders do is required, probably a scientific litterature review too.
—
Reply to this email directly, view it on GitHub
<#8882 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXYEBPT5SOY5GT34CCTP2TXSCOYXANCNFSM6AAAAAA2X3T3I4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
If the algorithm were to be changed from using a hash based approach to a
suffix array or suffix tree, that could be workable.
On Wed, Jul 26, 2023, 09:14 Josh Marshall ***@***.***>
wrote:
… I'm not saying ignore indentation. That would produce garbage. What I am
saying is to use relative indentation.
On Wed, Jul 26, 2023, 00:50 Pierre Sassoulas ***@***.***>
wrote:
> Using a hash without the leading indentation spaces could lead to false
> positives, multiplying the number of check on a line by the number of
> indentation level would be *disastrous* performance wise especially on
> large code base where the base algo already compare every line to every
> other line which is inherentely costly and scale poorly.
>
> I added the design proposal needed label because I think the enhancement
> makes sense but details matter and jumping to implementation on this one
> will go especially badly. Imo a state of the art of what other duplicate
> finders do is required, probably a scientific litterature review too.
>
> —
> Reply to this email directly, view it on GitHub
> <#8882 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAXYEBPT5SOY5GT34CCTP2TXSCOYXANCNFSM6AAAAAA2X3T3I4>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
Could you give an example of what you mean by relative indentation @anadon ? (Also note that we have option to ignore some kind of things like docstrings, function signatures, imports, etc.) |
With regards to my initial comment,
0 -> lines 1 to 3 at depth 1 match lines 5 to 7 at depth 0. If these were to be added to a suffix tree or suffix array in a tagged manner, all matches could be detected in O(n) time or O(nlog(n)) time and O(n) space. It just so happens that O(nlog(n)) time implementations tend to be faster for reasons I've not been able to study to sufficient depth. I quick search reveals that the project landscape has changed considerable, so library selection would require some research and testing. |
After work I'll give a bigger and more clear example of input and how it could be treated. |
Can be taken as
As can be seen by example, each block able to share the same level of contiguous indentation is represented as that continuous block in the same way the top level file is. These kind of virtual/mapped files can then be used directly to detect repeated code blocks independent of absolute level of indentation while preserving matching blocks with respect to their relative indentation. This does increase resource usage, but I think it is worth it given how much more effective duplicate code detection would be. |
Without looking too deep into the the practical aspect (and in particular the python specific treatment we do) this sounds acceptable, however this also sounds like a complete overhaul that is going to be a big time commitment (and perf benchmark are lacking for the |
Current problem
Currently, the following is not caught as duplicate code. This is sanitized from a production codebase I work on.
Desired solution
I would expect R0801 to be flagged for these two code segments.
Additional context
Distinct but possibly related to #1457 . This seems to be an issue with the fact that for comparing code fragments, it does so using the absolute indentation level, and not converting each code segment to use a relative indentation level. Now, in order to make this change it does cause a considerable increase in algorithmic complexity and would require the current hash based approach to actually use a number of different hashes per line, one per indentation level. However, I have found it much easier to argue for better coding when a tool says that code should be changed than when I say it.
The text was updated successfully, but these errors were encountered: