Mitigate #6044 and don't call suggestions for small identifiers #6049

forki · 2018-12-21T06:56:59Z

No description provided.

forki · 2018-12-21T08:26:34Z

what's the current trick to run the ci_part2 tests locally? I basically want to accept all the changes a new baseline

AviAvni · 2018-12-21T08:31:41Z

Build.cmd release ci_part2

majocha · 2018-12-21T08:41:12Z

https://github.com/Microsoft/visualfsharp/blob/fead0aac540485683f694524eadad79983ec28d9/src/fsharp/ErrorResolutionHints.fs#L55

I wonder if some memoization would make sense. Let's say the user types the same identifier more than once, as it happens. Each time it would be the same sequence of uppercaseText values and suggestedText would repeat a lot too?

forki · 2018-12-21T09:00:43Z

@AviAvni can you pease run it for me and PR my branch? would really appreciate that because I don't get it running.

abelbraaksma · 2018-12-21T12:39:59Z

@majocha, I think that's a good idea. In fact, memoization of, let's say the last 100 or so used identifiers (FIFO order) could be (very) beneficial, since most people will do a lot of editing in one file or function, and so will likely re-type certain identifiers multiple times.

forki · 2018-12-21T13:36:44Z

I don't see how memoize would help here. It's always different identifiers and suggestions

forki · 2018-12-22T10:59:10Z

It's green

cartermp · 2019-01-07T18:10:54Z

I'm not really convinced this will solve the problem:

It won't show suggestions for mistyped identifiers where the actual fix is <4 chars, and barring a statistical analysis of multiple codebases and what suggestions get generated, there's no telling what magic number "feels right"
It will still incur the cost of going through the full list (and generating that list) no matter the length of the mistyped identifier
Tying the tooling to error-handling introduces coupling that is difficult to maintain or evolve

Fundamentally, this feature is simply not built with IDE tooling in mind. It's great for a batch compile job/command line builds, but for a long-running process like a language server it's simply calculating and re-calculating too many things. The feature needs two "modes":

The current one for batch compile jobs
An out-of-process spellchecker routine that uses a tree (such as Roslyn's which uses a BKTree) to traverse a populated set of symbols independently of the in-proc work

The latter is not something that FCS can do yet. We're going to pursue this sort of architecture in the future, but not right now.

forki · 2019-01-07T19:41:28Z

It's not a magic number. I think it's the one from the original paper. Am Mo., 7. Jan. 2019, 19:10 hat Phillip Carter <notifications@github.com> geschrieben:

…

I'm not really convinced this will solve the problem: - It won't show suggestions for mistyped identifiers where the actual fix is <4 chars, and barring a statistical analysis of multiple codebases and what suggestions get generated, there's no telling what magic number "feels right" - It will still incur the cost of going through the full list (and generating that list) no matter the length of the mistyped identifier - Tying the tooling to error-handling introduces coupling that is difficult to maintain or evolve Fundamentally, this feature is simply *not* built with IDE tooling in mind. It's great for a batch compile job/command line builds, but for a long-running process like a language server it's simply calculating and re-calculating too many things. The feature needs two "modes": - The current one for batch compile jobs - An out-of-process spellchecker routine that uses a tree (such as Roslyn's which uses a BKTree <http://source.roslyn.io/#Microsoft.CodeAnalysis.Workspaces/Utilities/SpellChecker.cs>) to traverse a populated set of symbols independently of the in-proc work The latter is not something that FCS can do yet. We're going to pursue this sort of architecture in the future, but not right now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#6049 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADgNCu6DJZjmboneOPMMNBlVHwVEmMfks5vA42vgaJpZM4ZdeHn> .

Rickasaurus · 2019-01-08T20:13:29Z

If you're thresholding you can predict a cutoff based on the first few chars which may be helpful. If you're willing to do that you can build a lookup table ahead of time with potential matches based on the first N characters.

abelbraaksma · 2019-01-08T20:49:18Z

About switching it off, while it would be good to have a setting, I think it's great in the CS IDE. If you type a partial word, it gives you completion items that are more useful than only when matching the start of a string. Though this should run in its own thread and not block other processes, if it's fast enough, good, otherwise, it shouldn't degrade typing speed and experience.

I'm wondering why it's not a good idea to keep a dictionary of uppercased identifiers, after all, the list of in scope identifiers only occasionally changes (which itself can be done asynchronously).

Rickasaurus · 2019-01-08T20:53:45Z

Another idea: if you can shorten the length of the strings you're comparing against it will reduce the time by a large amount. If you knew the string lengths ahead of time I think you could use static reusable fixed length arrays.

I'd like to help more, but I don't understand what's being compared in the code well enough to contribute anything deeper.

forki · 2019-01-09T06:58:43Z

@abelbraaksma we only suggest identifiers that are valid in the current location. So the dictionary would track the position as well - and how to invalidate that when name resolution environment changes at that point?

abelbraaksma · 2019-01-09T20:08:22Z

@forki, that makes sense. Though local scope identifiers is almost always a small list. The list of globally available identifiers, either in current scope, truly global, or in scope of current file, module, class, changes much less frequently, if only because most users can't type in different files, classes etc at the same time, and switching this scope, even at fast paced editing, often takes seconds. That time could be used to build up the locally scoped library of identifiers. Make that a kind of dictionary with a key that includes the scope, and there's a big chance of not having to convert thousands of identifiers again and again.

I don't know if it's easy or hard, but it should release some strain off the GC and the CPU. But maybe I'm too optimistic, or maybe the effort isn't worth it.

forki · 2019-01-10T07:41:01Z

I would say it's pretty hard in the current architecture ;-)

gerardtoconnor · 2019-01-18T04:46:45Z

I was looking at his a while back and came up with few possibilities for improving perf

The suggestions are function to lazy load a large seq of collected/filtered strings, into hashsets, into more Seq ops, that is then tests all into an array and sorts, picking best 5 or so matches. If we use a specialised priority queue (a 5 array priority insertion sort), we can have the producers of results test matches as they are created so they can be added to priority queue (or dropped if not in top 5). This would change the suggestions Seqs to consumer pipelines and allow much smaller memory foot print.
With Levenshtein distance it is possible to do incremental calculations on subsequent char additions using matrix. With the radius overlap in JaroWinkler it makes it a little more complicated but it should be possible generate a char lookup tree from scoped suggestions (each char key has list of (pos,word)) such as a char for the current word (env) comes in (and be case insensitive checked), the next array cursor slot fills with collection of valid branches … and can even filter the best performers (with some wiggle room) to quickly hone in on results. The one way transversal (rather then both vs eachother) should be possible by keeping track of upper & lower bounds checks at each cursor position.... the key being the state of the traversal, for each new char coming in is kept and built on, or rewound if backspaced so we are not re-computing things ... I sketched this algo out on a notepad somewhere but cannot find it, there were a few rough edges to iron out but utilizing something like it may be very efficient for repeated typing attempts in same scope … especially when everything is being recomputed for each char added.

forki · 2019-01-18T06:20:51Z

Well, sounds great ;-) Am Fr., 18. Jan. 2019, 05:46 hat Gerard <notifications@github.com> geschrieben:

…

I was looking at his a while back and came up with few possibilities for improving perf 1. The suggestions are function to lazy load a large seq of collected/filtered strings, into hashsets, into more Seq ops, that is then tests all into an array and sorts, picking best 5 or so matches. If we use a specialised priority queue (a 5 array priority insertion sort), we can have the producers of results test matches as they are created so they can be added to priority queue (or dropped if not in top 5). This would change the suggestions Seqs to consumer pipelines and allow much smaller memory foot print. 2. With Levenshtein distance it is possible to do incremental calculations on subsequent char additions using matrix. With the radius overlap in JaroWinkler it makes it a little more complicated but it should be possible generate a char lookup tree from scoped suggestions (each char key has list of (pos,word)) such as a char for the current word (env) comes in (and be case insensitive checked), the next array cursor slot fills with collection of valid branches … and can even filter the best performers (with some wiggle room) to quickly hone in on results. The one way transversal (rather then both vs eachother) should be possible by keeping track of upper & lower bounds checks at each cursor position.... the key being the state of the traversal, for each new char coming in is kept and built on, or rewound if backspaced so we are not re-computing things ... I sketched this algo out on a notepad somewhere but cannot find it, there were a few rough edges to iron out but utilizing something like it may be very efficient for repeated typing attempts in same scope … especially when everything is being recomputed for each char added. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6049 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADgNMRIJdCskaxPFKQkw6oL0IF5_ZYGks5vEVG1gaJpZM4ZdeHn> .

gerardtoconnor · 2019-01-18T11:53:39Z

… may not have made much sense apologies, I will post code and clearer algo details for point 2 when find workings, for point 1, crude queue implementation (missing IEnumerable etc), a modified Insertion sort is as follows:

type MaxBuffer<'K,'V when 'K : comparison>(max:int) = 
    let data = Array.zeroCreate<KeyValuePair<'K,'V>>(max)
    let rec backup i j =
        if i < j  then
            data.[i] <- data.[i + 1]
            backup (i + 1) j        
    let rec find i k v =
        let kvpr : byref<_> = &data.[i] //only want key (not KVP)
        let kr = kvpr.Key
        if k < kr then 
            if i = 0 then false
            else 
                backup 0 (i-1)
                data.[i-1] <- KeyValuePair<_,_>(k,v)
                true             
        else
            if i = max - 1 then // end of the line, replace at head
                backup 0 i
                data.[i] <- KeyValuePair<_,_>(k,v)
                true
            else find (i + 1) k v
    member x.Insert (k,v) = find 0 k v
    member x.Result () = data

The reason it's a lot faster then adding hundreds of results to a re-growing array (or even an known large array size) for sort, is that quicksort etc will sort everything in approx. log n times, number of recursive pivots increasing with number of items, the elegance with this is that it can quickly discard results that are not greater then lower bound of captured max m items, so the compares & shifts are a subset (potentially tiny) of log n.

…nt to use it

forki · 2019-06-27T13:03:57Z

Closing in favor of #7060

forki mentioned this pull request Dec 21, 2018

JaroWinkler string distance algorithm allocates 1.11 GB and spends 4.8% of CPU time generating suggestions that are rarely used #6044

Closed

forki force-pushed the fastersuggestions branch from ccdeb58 to e8a793a Compare December 21, 2018 07:01

forki changed the title ~~[WIP] Mitigate #6044 and don't call suggestions for small identifiers~~ Mitigate #6044 and don't call suggestions for small identifiers Dec 21, 2018

forki mentioned this pull request Jan 7, 2019

Don't suggest names for errors in normal IDE usage #6063

Merged

forki and others added 4 commits June 25, 2019 09:16

Mitigate dotnet#6044 and don't call suggestions for small identifiers

1c00156

If the current identifier is a substring of the suggestion then we wa…

dc11da6

…nt to use it

update bsl

03ae751

fix yet another test

05bdc21

forki force-pushed the fastersuggestions branch from 2f21c77 to 05bdc21 Compare June 25, 2019 07:16

forki mentioned this pull request Jun 25, 2019

Use a MaxBuffer for Suggestions #7060

Merged

forki closed this Jun 27, 2019

forki deleted the fastersuggestions branch June 27, 2019 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate #6044 and don't call suggestions for small identifiers #6049

Mitigate #6044 and don't call suggestions for small identifiers #6049

forki commented Dec 21, 2018

forki commented Dec 21, 2018

AviAvni commented Dec 21, 2018

majocha commented Dec 21, 2018

forki commented Dec 21, 2018

abelbraaksma commented Dec 21, 2018

forki commented Dec 21, 2018

forki commented Dec 22, 2018

cartermp commented Jan 7, 2019 •

edited

Loading

forki commented Jan 7, 2019 via email

Rickasaurus commented Jan 8, 2019 •

edited

Loading

abelbraaksma commented Jan 8, 2019

Rickasaurus commented Jan 8, 2019 •

edited

Loading

forki commented Jan 9, 2019

abelbraaksma commented Jan 9, 2019 •

edited

Loading

forki commented Jan 10, 2019

gerardtoconnor commented Jan 18, 2019

forki commented Jan 18, 2019 via email

gerardtoconnor commented Jan 18, 2019

forki commented Jun 27, 2019

Mitigate #6044 and don't call suggestions for small identifiers #6049

Mitigate #6044 and don't call suggestions for small identifiers #6049

Conversation

forki commented Dec 21, 2018

forki commented Dec 21, 2018

AviAvni commented Dec 21, 2018

majocha commented Dec 21, 2018

forki commented Dec 21, 2018

abelbraaksma commented Dec 21, 2018

forki commented Dec 21, 2018

forki commented Dec 22, 2018

cartermp commented Jan 7, 2019 • edited Loading

forki commented Jan 7, 2019 via email

Rickasaurus commented Jan 8, 2019 • edited Loading

abelbraaksma commented Jan 8, 2019

Rickasaurus commented Jan 8, 2019 • edited Loading

forki commented Jan 9, 2019

abelbraaksma commented Jan 9, 2019 • edited Loading

forki commented Jan 10, 2019

gerardtoconnor commented Jan 18, 2019

forki commented Jan 18, 2019 via email

gerardtoconnor commented Jan 18, 2019

forki commented Jun 27, 2019

cartermp commented Jan 7, 2019 •

edited

Loading

Rickasaurus commented Jan 8, 2019 •

edited

Loading

Rickasaurus commented Jan 8, 2019 •

edited

Loading

abelbraaksma commented Jan 9, 2019 •

edited

Loading