Automatic classification: more aggressive debounce and perf considerations #129607

isidorn · 2021-07-27T19:17:44Z

On my machine it takes around 300ms for the language classification to compute. However some of our users have a slower machine, so here are some ideas on how to make sure to not slow down when users that are just scribbling in an untitled editor.

Do not classify when less than 20 characters present (unless to reset mode if file has auto detected already) - Done in microsoft/vscode-languagedetection@b024b29
Increase debounce from 300ms to 600ms (this will still not be a noticeable lag when the user stops typing, pastes) - Done in Improve performance of language detection #130006
If you have already run the classification n times and every time you are unsuccessful maybe stop trying and only on larger changes run
If you already auto detected a language - great! Do not be so aggressive then, only if a big change to the file content happened

Both 3 and 4 require some knowledge about what is a bigger change to a file. We can figure something out. Maybe @bpasero has ideas

The text was updated successfully, but these errors were encountered:

bpasero · 2021-08-02T13:08:19Z

💯

We have to be very very careful of running an expensive operation in the same thread where the user is typing in (see #27378 for example to get some inspiration). At the minimum we should debounce in a way that continuous typing further delays the computation up to the point where the user stops typing for a moment.

This becomes even more important if we decide to make language detection enabled by default.

Further thoughts:

would it make sense to move the language detection code into the untitled text editor model itself to possibly making the integration tighter? For example, I find the setMode change dubious with the fragile setExplicitly option which possibly would benefit from being dealt with internally. It is also weird to register a core workbench.editor.untitled.experimentalLanguageDetection setting that is only being used from a workbench contribution imho
we seem to be creating a full string of the text model each time to detect the language - this can be very expensive for large models and at the very minimum should support stream based detection via the text model factory (see for example how backups are created for text files where we convert a snapshot of the text model to a byte stream here)
can we explore moving the language detection into a worker and off the renderer process? I believe we do something similar when computing the diff of a file (@alexdima could possibly chime in on that matter or maybe @rebornix for how notebooks do this in src/vs/workbench/contrib/notebook/common/services/notebookWorkerServiceImpl.ts)

rebornix · 2021-08-02T14:53:14Z

Thanks @bpasero for bringing this to my attention. I don't think having it running in main thread scales in terms of performance, especially now we are also adding bracket matching on typing. Adding a few ms doesn't seem to hurt typing but if everyone is doing so, a single type can easily exceed 16ms and we can't finish rendering in one single frame.

Note that we already have tokenization and multiple event listeners (auto save trigger), which take a couple of ms each.

isidorn · 2021-08-02T15:01:49Z

All good points, and I think we should explore all our options here.

We should definitely not do this on every type but have a smart debounce strategy (some ideas mentioned in my initial comment)
Moving this to the worker sounds reasonable. @rebornix is there a good example you could share of some worker side computation
@bpasero model can only handle 100 000 character strings. So @TylerLeonhardt we should just take a subset of the whole content, and not take the full string in memory. fyi @yoeo for idea for model to handle streams, not only strings

rebornix · 2021-08-02T15:13:35Z

I can chat with @TylerLeonhardt on how to move it to a worker, we already have workers for word and link computation.

btw, some data on the perf of typing now with latest Insiders, which we should improve

So @TylerLeonhardt we should just take a subset of the whole content, and not take the full string in memory

@TylerLeonhardt note that the text model supports fetch string within a range which is more efficient than getting the whole string.

bpasero · 2021-08-02T15:32:30Z

Yeah this is an unfortunate consequence that is specific to untitled files where we use the content of the untitled buffer as title for the tab. If you want to file an issue for it where we track offenders, we can look into that individually 👍

I think it is limited to changes to the first line only though, not any line after that. So might not be the issue for typical usages.

TylerLeonhardt · 2021-08-06T20:49:11Z

Updated what's done and what's not

bpasero · 2021-08-07T05:53:24Z

Nice, I think from my end the biggest ask is to support streaming to prevent producing a string of the editor contents and maybe cancellation support, though I am not entirely sure that is even possible in the library itself...

TylerLeonhardt · 2021-08-07T14:53:15Z

I'm trying to understand how streaming would work if the work is done in a Worker.

Wouldn't the data need to be serialized to be transferred to the worker and lose the benefit of streaming?

Maybe I'm misunderstanding.

bpasero · 2021-08-07T15:12:51Z

Yeah you are right, that suggestion is from the time before we even thought about workers...

Still, even when IPC is involved you can avoid creating a string from the contents by simply sending over the chunks that you get from the editor itself. The editor contents are partitioned in chunks (not sure if line by line or a fixed size), so you could send these existing chunks over to the worker and pass them on into the detection library.

bpasero · 2021-08-07T15:13:45Z

The encoding library we use works with that model:

vscode/src/vs/workbench/services/textfile/common/encoding.ts

Lines 181 to 182 in 83f4dfd

    
           const iconv = await import('iconv-lite-umd'); 
        
           const encoder = iconv.getEncoder(toNodeEncoding(encoding), options);

...

vscode/src/vs/workbench/services/textfile/common/encoding.ts

Line 224 in 83f4dfd

return VSBuffer.wrap(encoder.write(chunk));

TylerLeonhardt · 2021-08-09T16:36:16Z

Still, even when IPC is involved you can avoid creating a string from the contents by simply sending over the chunks that you get from the editor itself.

I'm using the same code that the EditorWorker uses which sends over changed events that look like this:

export interface IModelContentChange {
	/**
	 * The range that got replaced.
	 */
	readonly range: IRange;
	/**
	 * The offset of the range that got replaced.
	 */
	readonly rangeOffset: number;
	/**
	 * The length of the range that got replaced.
	 */
	readonly rangeLength: number;
	/**
	 * The new text for the range.
	 */
	readonly text: string;
}

And the MirrorModel handles syncing resources.
https://github.com/microsoft/vscode/blob/c2f6aa497f7462c293e1f6301707bf0c091b1712/src/vs/editor/common/model/mirrorTextModel.ts

So we're already only sending chunks over to the worker when it changes...though not the chunks you're referring to I think. If we're talking about optimizing that further it should probably be for all of our editor worker code.

and pass them on into the detection library.

The model doesn't have any internal concepts of what the contents of a file looks like (in otherwords it doesn't keep state) so we need to send as much content as we can to it so that we have the best chance of guessing correctly. Running the model on anything smaller than the whole contents of the file I worry will come with a steep decrease in confidence.

bpasero · 2021-08-09T16:42:12Z

Oh yeah that is good, wasn't aware you also benefit from the mirror model code.

And yeah, agree the library needs to run detection on the entire contents for a confident decision but should probably have an upper limit of how large these contents can be. Given so you could avoid allocating the string of the entire file (which can be GBs) and only use the chunk size that is supported at max.

TylerLeonhardt · 2021-08-09T16:52:53Z

but should probably have an upper limit of how large these contents can be

It does :) 100000 characters:
https://github.com/microsoft/vscode-languagedetection/blob/main/lib/index.ts#L82

because after that, tfjs throws some error: #129597

ideally the string I create only creates a 100000 character string instead of the entire file... but I believe the file was represented as lines so I'd have to roll my own "get first 100000 characters of this string" function... which I haven't done yet.

bpasero · 2021-08-10T05:50:07Z

Yeah the closet I see is getValueInRange, @rebornix is there a way to get the value of the text buffer up to a maximum length without having to allocate the entire buffer into a string?

I guess one solution could be to model.snapshot and then iterate over the chunks until max length is reached, but I am not sure these APIs exist in the mirror model.

TylerLeonhardt · 2021-08-23T21:08:07Z

Moving the other things to the backlog as I don't think we need to do them right this second. I'm going to have a TPI for this so those testers can try out perf.

TylerLeonhardt · 2021-08-30T19:01:43Z

@rebornix gave me the solution:

                // Grab the first 10000 characters
		const end = model.positionAt(10000);
		const content = model.getValueInRange({
			startColumn: 1,
			startLineNumber: 1,
			endColumn: end.column,
			endLineNumber: end.lineNumber
		});

Added in f28e845 (with a code clean up commit 430e98b)

isidorn assigned TylerLeonhardt Jul 27, 2021

TylerLeonhardt added feature-request Request for new features or functionality languages-guessing Language guessing issues labels Jul 27, 2021

TylerLeonhardt added this to the August 2021 milestone Jul 27, 2021

bpasero added the important Issue identified as high-priority label Aug 2, 2021

bpasero changed the title ~~Automatic classification: more aggressive debounce~~ Automatic classification: more aggressive debounce and perf considerations Aug 2, 2021

TylerLeonhardt mentioned this issue Aug 2, 2021

Improve performance of language detection #130006

Merged

TylerLeonhardt removed the important Issue identified as high-priority label Aug 9, 2021

kieferrm mentioned this issue Aug 10, 2021

Iteration Plan for August 2021 #130368

Closed

91 tasks

TylerLeonhardt modified the milestones: August 2021, Backlog Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic classification: more aggressive debounce and perf considerations #129607

Automatic classification: more aggressive debounce and perf considerations #129607

isidorn commented Jul 27, 2021 •

edited by TylerLeonhardt

Loading

bpasero commented Aug 2, 2021 •

edited

Loading

rebornix commented Aug 2, 2021

isidorn commented Aug 2, 2021 •

edited

Loading

rebornix commented Aug 2, 2021 •

edited

Loading

bpasero commented Aug 2, 2021 •

edited

Loading

TylerLeonhardt commented Aug 6, 2021

bpasero commented Aug 7, 2021

TylerLeonhardt commented Aug 7, 2021

bpasero commented Aug 7, 2021

bpasero commented Aug 7, 2021 •

edited

Loading

TylerLeonhardt commented Aug 9, 2021

bpasero commented Aug 9, 2021 •

edited

Loading

TylerLeonhardt commented Aug 9, 2021

bpasero commented Aug 10, 2021

TylerLeonhardt commented Aug 23, 2021

TylerLeonhardt commented Aug 30, 2021

Automatic classification: more aggressive debounce and perf considerations #129607

Automatic classification: more aggressive debounce and perf considerations #129607

Comments

isidorn commented Jul 27, 2021 • edited by TylerLeonhardt Loading

bpasero commented Aug 2, 2021 • edited Loading

rebornix commented Aug 2, 2021

isidorn commented Aug 2, 2021 • edited Loading

rebornix commented Aug 2, 2021 • edited Loading

bpasero commented Aug 2, 2021 • edited Loading

TylerLeonhardt commented Aug 6, 2021

bpasero commented Aug 7, 2021

TylerLeonhardt commented Aug 7, 2021

bpasero commented Aug 7, 2021

bpasero commented Aug 7, 2021 • edited Loading

TylerLeonhardt commented Aug 9, 2021

bpasero commented Aug 9, 2021 • edited Loading

TylerLeonhardt commented Aug 9, 2021

bpasero commented Aug 10, 2021

TylerLeonhardt commented Aug 23, 2021

TylerLeonhardt commented Aug 30, 2021

isidorn commented Jul 27, 2021 •

edited by TylerLeonhardt

Loading

bpasero commented Aug 2, 2021 •

edited

Loading

isidorn commented Aug 2, 2021 •

edited

Loading

rebornix commented Aug 2, 2021 •

edited

Loading

bpasero commented Aug 2, 2021 •

edited

Loading

bpasero commented Aug 7, 2021 •

edited

Loading

bpasero commented Aug 9, 2021 •

edited

Loading