Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Treat the Chinese text as a Chinese sequence when usingCtrl+Left/Right #50045

Closed
imhuay opened this issue May 17, 2018 · 16 comments
Assignees
Labels
editor-wordnav Editor word navigation issues feature-request Request for new features or functionality
Milestone

Comments

@imhuay
Copy link
Contributor

imhuay commented May 17, 2018

Now the VSCode treats a long Chinese text as one “word”. Each time use Ctrl+Left/Right, it will move the cursor to the begin or end.

The feature request is that treat the Chinese text as a Chinese sequence, then each Ctrl+Left/Right, it just move one step. This act is the system text program default.

Example: (use | as the cursor )

|本文的学习公式
// Ctrl+Right
本文的学习公式|

Expected:

|本文的学习公式
// Ctrl+Right
本|文的学习公式
// Ctrl+Right
本文|的学习公式
// Ctrl+Right
本文的|学习公式

(Of course, It would be better if it can support Word Segmentation.)

@vscodebot vscodebot bot added editor editor-core Editor basic functionality labels May 17, 2018
@imhuay imhuay changed the title Feature request: Split the Chinese text into one word once Ctrl+Left/Right Feature request: Treat the Chinese text as a Chinese sequence when usingCtrl+Left/Right May 17, 2018
@gdh1995
Copy link

gdh1995 commented May 18, 2018

It's better if VS Code can support Word Segmentation just like what Chrome does, although I know that this requires a big data dict and increases the program package size a lot. But, if it doesn't like to segment words, then I suggest that it keeps moving cursor once a sentence, instead of a char - personally, I think it is too slow to jump a Chinese char on <Ctrl+Right>.

@imhuay
Copy link
Contributor Author

imhuay commented May 18, 2018

However, it doesn't move once a sentence. Actually, it also can't recognize the Chinese punctuations.

Examples:

|output gate 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output| gate 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output gate| 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output gate 会影响结果,因此该模型有两个版本,分别为是否使用|

@imhuay imhuay closed this as completed May 18, 2018
@imhuay imhuay reopened this May 18, 2018
@rebornix rebornix added the feature-request Request for new features or functionality label Sep 6, 2018
@smikitky
Copy link

smikitky commented Jan 15, 2019

This is a longstanding problem which virtually all East-Asian developers will notice once they start editing natural sentences (say, in Markdown) on vscode. I think this is fundamentally a problem of wrong word-splitting for CJK languages (and perhaps Thai, too), which use no spaces to delimit words. A similar problem happens when you double-click a word in a line (the whole line will be selected instead of the target word) and when you trigger an autocompletion using Ctrl+Space (a whole line will be shown as a candidate).

Ideally, dictionary-based word segmentation is desirable (this is available on MS Word, Google Chrome browser, etc), but it's not 100% correct, and I'm not sure if it is really necessary for a code editor. Another practical approach that works at least in Japanese is to split words based on character types, because a typical Japanese text is a mixture of kanji, hiragana and katakana (This algorithm is implemented on most domestic text editors and even MS Notepad.exe). Character types can be easily determined via Unicode code points.

Example:

(1) 吾輩は猫である。名前はまだない。
(2) 吾輩|は|猫|で|ある|。|名前|は|まだ|ない|。
(3) 吾輩|は|猫|である|。|名前|はまだない|。

(1): Natural Japanese text with two sentences. is a Japanese period.; (2): Dictionary-based word boundaries (|), available on MS Word, Chrome, etc.; (3): Codepoint-based kana-kanji boundaries, available on Firefox, Notepad.exe, etc.

There is already a popular extension that does (3) above for Japanese text. Unfortunately, it works on Ctrl+ / but nowhere else. It does not work on double-clicks, Ctrl+D, autocompletion, text search, and so on.

Personally, I think (3) should be implemented as part of the basic functionality of VSCode, considering the fact that it's available on any other decent text editors. Dictionary-based solution (2) may be too costly within the main vscode repository, but I hope there is a way to allow extension developers to override word-boundary detection algorithm or the double-click behavior.


By the way, for the meantime, you can alleviate this problem by tweaking "editor.wordSeparators" settings and adding multibyte punctuation marks such as . With this, you can stop the cursor at least at (double-byte) periods and commas using Ctrl + /

@smikitky
Copy link

smikitky commented Jan 15, 2019

So I searched related issues regarding CJK text navigations. I learned that "selection/navigation via double-click/keyboards" and "extracting words for autocompletion" are technically two different fields, but they are conceptually related anyway.

Keyboard navigation & Double click:

Word extraction for autocompletion:

So in conclusion, IMHO vscode should (by default, regardless of the language) assume there is a word boundary when a character type changes between "Latin alphabet/number", "CJK unified ideograph (hanja/kanji)", "Punctuations Marks (incl. multibyte ones)", "Japanese hiragana" and "Japanese katakana" even if there is no space. In addition, when Ctrl+Right is input inside a sequence of multiple "CJK unified ideographs", Chinese users (seem to) want the cursor to move by one character, whereas Japanese users usually want the cursor to move to the end of the sequence, as described by (3) above. This may have to be configurable with locale-based default values.

// Japanese
これは日|本語の文章
// ctrl + right
これは日本語|の文章

// Chinese
本文|的学习公式
// ctrl + right
本文的|学习公式

@rebornix
Copy link
Member

@smikitky thanks for your detailed investigation ;) IMO word navigation should work seamlessly with CJKV, as ASCII word separators can't handle CJKV words. I do have a prototype of delegating the word segmentation to the browser instead of dealing that ourselves and will work on that in the near team, stay tuned.

@rebornix rebornix added this to the On Deck milestone Jan 22, 2019
@WangLeto
Copy link

WangLeto commented Feb 8, 2019

I'd like to remind you of Ctrl + Delete, which I think may share the same logic as Ctrl + Arrow, and performs even more upset because you may easily delete too many characters by accident.

@smikitky
Copy link

@rebornix This issue was once included in iteration plans, but I'm seeing no recent activity. Since we're nearing the end of the housekeeping iteration, can I ask if you have any update on this?

@rebornix
Copy link
Member

Let's see if we can have time for it during holiday time.

@yuboona
Copy link

yuboona commented Mar 22, 2021

Is there any progress now, guys?

@yuboona
Copy link

yuboona commented Mar 26, 2021

I found a vscode extension CJK word handler, will it be offically adopt? @kieferrm @rebornix

@rebornix rebornix modified the milestones: On Deck, Backlog Nov 5, 2021
@alexdima alexdima added editor-wordnav Editor word navigation issues and removed editor-core Editor basic functionality labels Dec 10, 2022
@simonmysun
Copy link

Any updates?

Actually in any chromium based applications, a proper word segmentation library is bundled. You can try it in i.e. file renaming input box with common key bindings. This is however seems not available to the JavaScript interface: https://bugs.chromium.org/p/chromium/issues/detail?id=129706 .

While I am appreciative of japanese-word-handler by @sgryjp and CJK word handler by @SharzyL, it seems a bit redundant to include another segmentation JavaScript library, especially when we already have a much faster one in C++. I wonder if an alternative workaround would work: copy the current line to a hidden input box and synchronize the cursor movement.

@yume-chan
Copy link
Contributor

The segmenter is available in JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

For example:

console.log(
  JSON.stringify(
    Array.from(
      new Intl.Segmenter("en", { granularity: "word" }).segment(
        "本文的学习公式",
      ),
    ),
    undefined,
    4,
  ),
);

(In my test, it can segment any CJK language no matter which locale is specified in the constructor)

outputs

[
    {
        "segment": "本文",
        "index": 0,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "",
        "index": 2,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "学习",
        "index": 3,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "公式",
        "index": 5,
        "input": "本文的学习公式",
        "isWordLike": true
    }
]

@yutotnh
Copy link
Contributor

yutotnh commented Jan 21, 2024

The segmenter is available in JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

Just recently I developed an extension that takes advantage of it.

I really wanted to make a pull request to the main body of VS Code, but my technical capabilities were limited to releasing it as an extension.

Image in operation
examples

@yutotnh
Copy link
Contributor

yutotnh commented Jan 27, 2024

I was able to add functionality to VS Code itself, so I created a pull request.
The pull request created is #203605.

Now that it can be integrated within a process, it can do more than just be an extension, such as being able to select words with a double-click.

@rebornix rebornix assigned alexdima and unassigned rebornix Jan 29, 2024
@rinzwind5
Copy link

rinzwind5 commented Feb 18, 2024

I was surprised, since VS Code is based on browser technology and browsers handle this stuff great.
Also issue with Thai:
คนไทยที่นับถือศาสนาพุทธเกินห้าสิบเปอร์เซ็นต์
Should be broken down as คน ไทย ที่ นับถือ ศาสนา พุทธ เกิน ห้า สิบ เปอร์เซ็นต์

https://fuqua.io/thai-word-split/browser/

Makes VS Code at the very least unusable as generic text editor for these languages. Notepad does work fine btw (so Windows has native support as expected).

alexdima added a commit that referenced this issue Mar 15, 2024
…203605)

* Add support for recognizing word locales in word operations (#50045)

* Move intlSegmenterLocales in the WordCharacterClassifier class

* Rerun compiler

* Renames

* Avoid duplicating code

---------

Co-authored-by: Alex Dima <alexdima@microsoft.com>
chen-ky pushed a commit to chen-ky/vscode that referenced this issue Mar 18, 2024
…t#50045) (microsoft#203605)

* Add support for recognizing word locales in word operations (microsoft#50045)

* Move intlSegmenterLocales in the WordCharacterClassifier class

* Rerun compiler

* Renames

* Avoid duplicating code

---------

Co-authored-by: Alex Dima <alexdima@microsoft.com>
@alexdima
Copy link
Member

Thanks to #203605 It is now possible to configure editor.wordSegmenterLocales to define the locales to be used for word segmenting

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
editor-wordnav Editor word navigation issues feature-request Request for new features or functionality
Projects
None yet
Development

No branches or pull requests