Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language detection #119325

Closed
wants to merge 1 commit into from
Closed

Language detection #119325

wants to merge 1 commit into from

Conversation

isidorn
Copy link
Contributor

@isidorn isidorn commented Mar 19, 2021

Creating this PR just as a sketch. After some discussions with team mates from Zurich here are some thoughts:

  • It defeats the purpose to have this functionality in an extension, since nobody will discover the extension and this functionality should help new users.

Here's the things which I think we need to do in order to have this in the core:

  • We need to convert the model to the tensorflow-js model and not have a node dependency. We need to do this so we can run in the browser. Without this, even for VS Code desktop we would have to run the classification in the shared process, since we are moving towards node free renderers, and putting additional work in the shared process is not good.
  • We should look into reducing the size of the model so we do not increase the install size too much. @TylerLeonhardt I think you already started looking into this
  • I believe this work is best suited for a new service in VS Code, for example ILanguageDetectionService, I created a skeleton for this and I can look into this further to have it integrated on the vscode side @isidorn
  • Figure out how often to call the service (after n characters) and what is the confidence we should use
  • We need to measure time performance for loading the model and classification
  • We need to measure memory footprint of model loading

If we get the above working in a good way then we need to:

  • Look into supporting JSON, xml and other not supported languages
  • Try to improve corner case classification - like JS/TS confusion

If all of this proves to be too much overhead we can look into some simpler heuristic to detect a language and not use machine learning.

@TylerLeonhardt let me know what you think, and feel free to add your name to some of the items you are interested in and also feel free to add items I might have forgot.

fixes #118455

@isidorn isidorn added the under-discussion Issue is under discussion for relevance, priority, approach label Mar 19, 2021
@isidorn isidorn added this to the April 2021 milestone Mar 19, 2021
@isidorn isidorn marked this pull request as draft March 19, 2021 13:31
@isidorn isidorn modified the milestones: April 2021, May 2021 Mar 29, 2021
export interface ILanguageDetectionService {
readonly _serviceBrand: undefined;

detectLanguage(contet: string): Promise<string | undefined>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: contet -> content

@TylerLeonhardt TylerLeonhardt modified the milestones: May 2021, June 2021 May 28, 2021
@TylerLeonhardt TylerLeonhardt modified the milestones: June 2021, July 2021 Jun 16, 2021
@TylerLeonhardt
Copy link
Member

Closing this in favor of #128708

@github-actions github-actions bot locked and limited conversation to collaborators Sep 7, 2021
@isidorn isidorn deleted the isidorn/languageDetection branch November 2, 2023 14:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
under-discussion Issue is under discussion for relevance, priority, approach
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatic language classification for Untitled files
3 participants