Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate #6044 and don't call suggestions for small identifiers #6049

Closed
wants to merge 4 commits into from

Conversation

forki
Copy link
Contributor

@forki forki commented Dec 21, 2018

No description provided.

@forki
Copy link
Contributor Author

forki commented Dec 21, 2018

what's the current trick to run the ci_part2 tests locally? I basically want to accept all the changes a new baseline

@AviAvni
Copy link
Contributor

AviAvni commented Dec 21, 2018

Build.cmd release ci_part2

@majocha
Copy link
Contributor

majocha commented Dec 21, 2018

https://github.com/Microsoft/visualfsharp/blob/fead0aac540485683f694524eadad79983ec28d9/src/fsharp/ErrorResolutionHints.fs#L55

I wonder if some memoization would make sense. Let's say the user types the same identifier more than once, as it happens. Each time it would be the same sequence of uppercaseText values and suggestedText would repeat a lot too?

@forki
Copy link
Contributor Author

forki commented Dec 21, 2018

@AviAvni can you pease run it for me and PR my branch? would really appreciate that because I don't get it running.

@abelbraaksma
Copy link
Contributor

@majocha, I think that's a good idea. In fact, memoization of, let's say the last 100 or so used identifiers (FIFO order) could be (very) beneficial, since most people will do a lot of editing in one file or function, and so will likely re-type certain identifiers multiple times.

@forki forki changed the title [WIP] Mitigate #6044 and don't call suggestions for small identifiers Mitigate #6044 and don't call suggestions for small identifiers Dec 21, 2018
@forki
Copy link
Contributor Author

forki commented Dec 21, 2018

I don't see how memoize would help here. It's always different identifiers and suggestions

@forki
Copy link
Contributor Author

forki commented Dec 22, 2018

It's green

@cartermp
Copy link
Contributor

cartermp commented Jan 7, 2019

I'm not really convinced this will solve the problem:

  • It won't show suggestions for mistyped identifiers where the actual fix is <4 chars, and barring a statistical analysis of multiple codebases and what suggestions get generated, there's no telling what magic number "feels right"
  • It will still incur the cost of going through the full list (and generating that list) no matter the length of the mistyped identifier
  • Tying the tooling to error-handling introduces coupling that is difficult to maintain or evolve

Fundamentally, this feature is simply not built with IDE tooling in mind. It's great for a batch compile job/command line builds, but for a long-running process like a language server it's simply calculating and re-calculating too many things. The feature needs two "modes":

  • The current one for batch compile jobs
  • An out-of-process spellchecker routine that uses a tree (such as Roslyn's which uses a BKTree) to traverse a populated set of symbols independently of the in-proc work

The latter is not something that FCS can do yet. We're going to pursue this sort of architecture in the future, but not right now.

@forki
Copy link
Contributor Author

forki commented Jan 7, 2019 via email

@Rickasaurus
Copy link
Contributor

Rickasaurus commented Jan 8, 2019

If you're thresholding you can predict a cutoff based on the first few chars which may be helpful. If you're willing to do that you can build a lookup table ahead of time with potential matches based on the first N characters.

@abelbraaksma
Copy link
Contributor

About switching it off, while it would be good to have a setting, I think it's great in the CS IDE. If you type a partial word, it gives you completion items that are more useful than only when matching the start of a string. Though this should run in its own thread and not block other processes, if it's fast enough, good, otherwise, it shouldn't degrade typing speed and experience.

I'm wondering why it's not a good idea to keep a dictionary of uppercased identifiers, after all, the list of in scope identifiers only occasionally changes (which itself can be done asynchronously).

@Rickasaurus
Copy link
Contributor

Rickasaurus commented Jan 8, 2019

Another idea: if you can shorten the length of the strings you're comparing against it will reduce the time by a large amount. If you knew the string lengths ahead of time I think you could use static reusable fixed length arrays.

I'd like to help more, but I don't understand what's being compared in the code well enough to contribute anything deeper.

@forki
Copy link
Contributor Author

forki commented Jan 9, 2019

@abelbraaksma we only suggest identifiers that are valid in the current location. So the dictionary would track the position as well - and how to invalidate that when name resolution environment changes at that point?

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jan 9, 2019

@forki, that makes sense. Though local scope identifiers is almost always a small list. The list of globally available identifiers, either in current scope, truly global, or in scope of current file, module, class, changes much less frequently, if only because most users can't type in different files, classes etc at the same time, and switching this scope, even at fast paced editing, often takes seconds. That time could be used to build up the locally scoped library of identifiers. Make that a kind of dictionary with a key that includes the scope, and there's a big chance of not having to convert thousands of identifiers again and again.

I don't know if it's easy or hard, but it should release some strain off the GC and the CPU. But maybe I'm too optimistic, or maybe the effort isn't worth it.

@forki
Copy link
Contributor Author

forki commented Jan 10, 2019

I would say it's pretty hard in the current architecture ;-)

@gerardtoconnor
Copy link

I was looking at his a while back and came up with few possibilities for improving perf

  1. The suggestions are function to lazy load a large seq of collected/filtered strings, into hashsets, into more Seq ops, that is then tests all into an array and sorts, picking best 5 or so matches. If we use a specialised priority queue (a 5 array priority insertion sort), we can have the producers of results test matches as they are created so they can be added to priority queue (or dropped if not in top 5). This would change the suggestions Seqs to consumer pipelines and allow much smaller memory foot print.

  2. With Levenshtein distance it is possible to do incremental calculations on subsequent char additions using matrix. With the radius overlap in JaroWinkler it makes it a little more complicated but it should be possible generate a char lookup tree from scoped suggestions (each char key has list of (pos,word)) such as a char for the current word (env) comes in (and be case insensitive checked), the next array cursor slot fills with collection of valid branches … and can even filter the best performers (with some wiggle room) to quickly hone in on results. The one way transversal (rather then both vs eachother) should be possible by keeping track of upper & lower bounds checks at each cursor position.... the key being the state of the traversal, for each new char coming in is kept and built on, or rewound if backspaced so we are not re-computing things ... I sketched this algo out on a notepad somewhere but cannot find it, there were a few rough edges to iron out but utilizing something like it may be very efficient for repeated typing attempts in same scope … especially when everything is being recomputed for each char added.

@forki
Copy link
Contributor Author

forki commented Jan 18, 2019 via email

@gerardtoconnor
Copy link

… may not have made much sense apologies, I will post code and clearer algo details for point 2 when find workings, for point 1, crude queue implementation (missing IEnumerable etc), a modified Insertion sort is as follows:

type MaxBuffer<'K,'V when 'K : comparison>(max:int) = 
    let data = Array.zeroCreate<KeyValuePair<'K,'V>>(max)
    let rec backup i j =
        if i < j  then
            data.[i] <- data.[i + 1]
            backup (i + 1) j        
    let rec find i k v =
        let kvpr : byref<_> = &data.[i] //only want key (not KVP)
        let kr = kvpr.Key
        if k < kr then 
            if i = 0 then false
            else 
                backup 0 (i-1)
                data.[i-1] <- KeyValuePair<_,_>(k,v)
                true             
        else
            if i = max - 1 then // end of the line, replace at head
                backup 0 i
                data.[i] <- KeyValuePair<_,_>(k,v)
                true
            else find (i + 1) k v
    member x.Insert (k,v) = find 0 k v
    member x.Result () = data

The reason it's a lot faster then adding hundreds of results to a re-growing array (or even an known large array size) for sort, is that quicksort etc will sort everything in approx. log n times, number of recursive pivots increasing with number of items, the elegance with this is that it can quickly discard results that are not greater then lower bound of captured max m items, so the compares & shifts are a subset (potentially tiny) of log n.

@forki
Copy link
Contributor Author

forki commented Jun 27, 2019

Closing in favor of #7060

@forki forki closed this Jun 27, 2019
@forki forki deleted the fastersuggestions branch June 27, 2019 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants