Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Suggest similar table/column/indexes names on missing errors #685

Merged
merged 4 commits into from
Apr 23, 2019

Conversation

juanjux
Copy link
Contributor

@juanjux juanjux commented Apr 19, 2019

Fixes #684

Signed-off-by: Juanjo Alvarez juanjo@sourced.tech

@juanjux juanjux self-assigned this Apr 19, 2019
Signed-off-by: Juanjo Alvarez <juanjo@sourced.tech>
@kuba--
Copy link
Contributor

kuba-- commented Apr 19, 2019

First of all I thought we won't pull any new dependencies. Distance function is just a one simple function. For instance here is sift4 (less general, but should be faster than Levenstein):

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
    if (!s1||!s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2||!s2.length) {
        return s1.length;
    }

    var l1=s1.length;
    var l2=s2.length;

    var c1 = 0;  //cursor for string 1
    var c2 = 0;  //cursor for string 2
    var lcss = 0;  //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0;  //number of transpositions ('ab' vs 'ba')
    var offset_arr=[];  //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            var isTrans=false;
            //see if current match is a transposition
            var i=0;
            while (i<offset_arr.length) {
                var ofs=offset_arr[i];
                if (c1<=ofs.c1 || c2 <= ofs.c2) {
                    // when two matches cross, the one considered a transposition is the one with the largest difference in offsets
                    isTrans=Math.abs(c2-c1)>=Math.abs(ofs.c2-ofs.c1);
                    if (isTrans)
                    {
                        trans++;
                    } else
                    {
                        if (!ofs.trans) {
                            ofs.trans=true;
                            trans++;
                        }
                    }
                    break;
                } else {
                    if (c1>ofs.c2 && c2>ofs.c1) {
                        offset_arr.splice(i,1);
                    } else {
                        i++;
                    }
                }
            }
            offset_arr.push({
                c1:c1,
                c2:c2,
                trans:isTrans
            });
        } else {
            lcss+=local_cs;
            local_cs=0;
            if (c1!=c2) {
                c1=c2=Math.min(c1,c2);  //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches 
            for (var i = 0; i < maxOffset && (c1+i<l1 || c2+i<l2); i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1+= i-1; 
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2+= i-1;
                    break;
                }
            }
        }
        c1++;
        c2++;
        if (maxDistance)
        {
            var temporaryDistance=Math.max(c1,c2)-lcss+trans;
            if (temporaryDistance>=maxDistance) return Math.round(temporaryDistance);
        }
        // this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
        if ((c1 >= l1) || (c2 >= l2)) {
            lcss+=local_cs;
            local_cs=0;
            c1=c2=Math.min(c1,c2);
        }
    }
    lcss+=local_cs;
    return Math.round(Math.max(l1,l2)- lcss +trans); //add the cost of transpositions to the final result
}

The second thought I had is maxDistance parameter. In other words if the distance is > 3 it doesn't make sense to suggest anything because most likely it's a garbage.
I would suggest just a minimum < 3 and if you have more values with the same distance suggest all of them. For instance someone thought about commits or ref_commits

Signed-off-by: Juanjo Alvarez <juanjo@sourced.tech>
@juanjux
Copy link
Contributor Author

juanjux commented Apr 19, 2019

@kuba-- you're right, as we spoke on Slack, this could be a pretty shallow dependency, used a simplified and working on strings version of the Levenshtein function instead.

Signed-off-by: Juanjo Alvarez <juanjo@sourced.tech>
@juanjux
Copy link
Contributor Author

juanjux commented Apr 19, 2019

@kuba-- also implemented your other suggestion of returning 0 to many results depending on the distance.

@juanjux
Copy link
Contributor Author

juanjux commented Apr 19, 2019

Looks like the Mono test in the CI matrix is failing for some missing package in some repository.

@erizocosmico
Copy link
Contributor

Ah, yes, the C# thing. Ignore it and don't worry.

ARCHITECTURE.md Show resolved Hide resolved
internal/text_distance/text_distance.go Outdated Show resolved Hide resolved
Signed-off-by: Juanjo Alvarez <juanjo@sourced.tech>
@ajnavarro ajnavarro merged commit e98fa12 into src-d:master Apr 23, 2019
@juanjux juanjux deleted the suggest-names branch April 23, 2019 09:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Suggest similar table/column names when one is not found in a query
4 participants