Feature request: Treat the Chinese text as a Chinese sequence when using`Ctrl+Left/Right` #50045

imhuay · 2018-05-17T13:45:53Z

Now the VSCode treats a long Chinese text as one “word”. Each time use Ctrl+Left/Right, it will move the cursor to the begin or end.

The feature request is that treat the Chinese text as a Chinese sequence, then each Ctrl+Left/Right, it just move one step. This act is the system text program default.

Example: (use | as the cursor )

|本文的学习公式
// Ctrl+Right
本文的学习公式|

Expected:

|本文的学习公式
// Ctrl+Right
本|文的学习公式
// Ctrl+Right
本文|的学习公式
// Ctrl+Right
本文的|学习公式

(Of course, It would be better if it can support Word Segmentation.)

The text was updated successfully, but these errors were encountered:

gdh1995 · 2018-05-18T05:24:44Z

It's better if VS Code can support Word Segmentation just like what Chrome does, although I know that this requires a big data dict and increases the program package size a lot. But, if it doesn't like to segment words, then I suggest that it keeps moving cursor once a sentence, instead of a char - personally, I think it is too slow to jump a Chinese char on <Ctrl+Right>.

imhuay · 2018-05-18T06:48:06Z

However, it doesn't move once a sentence. Actually, it also can't recognize the Chinese punctuations.

Examples:

|output gate 会影响结果，因此该模型有两个版本，分别为是否使用
// <Ctrl+Right>
output| gate 会影响结果，因此该模型有两个版本，分别为是否使用
// <Ctrl+Right>
output gate| 会影响结果，因此该模型有两个版本，分别为是否使用
// <Ctrl+Right>
output gate 会影响结果，因此该模型有两个版本，分别为是否使用|

smikitky · 2019-01-15T07:24:50Z

This is a longstanding problem which virtually all East-Asian developers will notice once they start editing natural sentences (say, in Markdown) on vscode. I think this is fundamentally a problem of wrong word-splitting for CJK languages (and perhaps Thai, too), which use no spaces to delimit words. A similar problem happens when you double-click a word in a line (the whole line will be selected instead of the target word) and when you trigger an autocompletion using Ctrl+Space (a whole line will be shown as a candidate).

Ideally, dictionary-based word segmentation is desirable (this is available on MS Word, Google Chrome browser, etc), but it's not 100% correct, and I'm not sure if it is really necessary for a code editor. Another practical approach that works at least in Japanese is to split words based on character types, because a typical Japanese text is a mixture of kanji, hiragana and katakana (This algorithm is implemented on most domestic text editors and even MS Notepad.exe). Character types can be easily determined via Unicode code points.

Example:

(1) 吾輩は猫である。名前はまだない。
(2) 吾輩|は|猫|で|ある|。|名前|は|まだ|ない|。
(3) 吾輩|は|猫|である|。|名前|はまだない|。

(1): Natural Japanese text with two sentences. 。 is a Japanese period.; (2): Dictionary-based word boundaries (|), available on MS Word, Chrome, etc.; (3): Codepoint-based kana-kanji boundaries, available on Firefox, Notepad.exe, etc.

There is already a popular extension that does (3) above for Japanese text. Unfortunately, it works on Ctrl+ ←/→ but nowhere else. It does not work on double-clicks, Ctrl+D, autocompletion, text search, and so on.

Personally, I think (3) should be implemented as part of the basic functionality of VSCode, considering the fact that it's available on any other decent text editors. Dictionary-based solution (2) may be too costly within the main vscode repository, but I hope there is a way to allow extension developers to override word-boundary detection algorithm or the double-click behavior.

By the way, for the meantime, you can alleviate this problem by tweaking "editor.wordSeparators" settings and adding multibyte punctuation marks such as 。. With this, you can stop the cursor at least at (double-byte) periods and commas using Ctrl + ←/→

smikitky · 2019-01-15T09:53:45Z

So I searched related issues regarding CJK text navigations. I learned that "selection/navigation via double-click/keyboards" and "extracting words for autocompletion" are technically two different fields, but they are conceptually related anyway.

Keyboard navigation & Double click:

Double click to select word that don't recognize Chinese punctuation #27017 Double click to select word that don't recognize Chinese punctuation Probably shares the same root cause as this. Suggests the use of wordSeparators config, which is better than nothing, but not ideal for the aforementioned reason. Obviously there are usually many words between two commas/periods.
Moving cursor using Ctrl+(left/right arrow) in Chinese and English mixed text #25208 Moving cursor using Ctrl+(left/right arrow) in Chinese and English mixed text Is very similar to this, except that Moving cursor using Ctrl+(left/right arrow) in Chinese and English mixed text #25208 is about separation between hanja and English but this is mainly about separation between two hanjas (or between a hanja and a punctuation mark). Anyway, these are all something wordSeparators cannot handle.

Word extraction for autocompletion:

Suggestions for wordSeparators #37202 Suggestions for wordSeparators Not working even after autocomplete doesn't honor full width period. #15177 was marked as fixed
autocomplete doesn't honor full width period. #15177 autocomplete doesn't honor full width period. This was recently marked as "fixed". I confirmed in the latest Insiders that words are extracted taking commas/periods into consideration, but its usefulness is limited because there are usually many words between them. And why is this markdown-only? I think something like this should be enabled in plaintext.

So in conclusion, IMHO vscode should (by default, regardless of the language) assume there is a word boundary when a character type changes between "Latin alphabet/number", "CJK unified ideograph (hanja/kanji)", "Punctuations Marks (incl. multibyte ones)", "Japanese hiragana" and "Japanese katakana" even if there is no space. In addition, when Ctrl+Right is input inside a sequence of multiple "CJK unified ideographs", Chinese users (seem to) want the cursor to move by one character, whereas Japanese users usually want the cursor to move to the end of the sequence, as described by (3) above. This may have to be configurable with locale-based default values.

// Japanese
これは日|本語の文章
// ctrl + right
これは日本語|の文章

// Chinese
本文|的学习公式
// ctrl + right
本文的|学习公式

rebornix · 2019-01-22T18:08:43Z

@smikitky thanks for your detailed investigation ;) IMO word navigation should work seamlessly with CJKV, as ASCII word separators can't handle CJKV words. I do have a prototype of delegating the word segmentation to the browser instead of dealing that ourselves and will work on that in the near team, stay tuned.

WangLeto · 2019-02-08T10:36:00Z

I'd like to remind you of Ctrl + Delete, which I think may share the same logic as Ctrl + Arrow, and performs even more upset because you may easily delete too many characters by accident.

smikitky · 2019-10-30T06:43:41Z

@rebornix This issue was once included in iteration plans, but I'm seeing no recent activity. Since we're nearing the end of the housekeeping iteration, can I ask if you have any update on this?

rebornix · 2019-11-19T16:30:40Z

Let's see if we can have time for it during holiday time.

yuboona · 2021-03-22T07:33:07Z

Is there any progress now, guys?

yuboona · 2021-03-26T04:53:55Z

I found a vscode extension CJK word handler, will it be offically adopt? @kieferrm @rebornix

simonmysun · 2024-01-10T01:04:46Z

Any updates?

Actually in any chromium based applications, a proper word segmentation library is bundled. You can try it in i.e. file renaming input box with common key bindings. This is however seems not available to the JavaScript interface: https://bugs.chromium.org/p/chromium/issues/detail?id=129706 .

While I am appreciative of japanese-word-handler by @sgryjp and CJK word handler by @SharzyL, it seems a bit redundant to include another segmentation JavaScript library, especially when we already have a much faster one in C++. I wonder if an alternative workaround would work: copy the current line to a hidden input box and synchronize the cursor movement.

yume-chan · 2024-01-10T03:07:17Z

The segmenter is available in JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

For example:

console.log(
  JSON.stringify(
    Array.from(
      new Intl.Segmenter("en", { granularity: "word" }).segment(
        "本文的学习公式",
      ),
    ),
    undefined,
    4,
  ),
);

(In my test, it can segment any CJK language no matter which locale is specified in the constructor)

outputs

[
    {
        "segment": "本文",
        "index": 0,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "的",
        "index": 2,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "学习",
        "index": 3,
        "input": "本文的学习公式",
        "isWordLike": true
    },
    {
        "segment": "公式",
        "index": 5,
        "input": "本文的学习公式",
        "isWordLike": true
    }
]

yutotnh · 2024-01-21T06:04:30Z

The segmenter is available in JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

Just recently I developed an extension that takes advantage of it.

I really wanted to make a pull request to the main body of VS Code, but my technical capabilities were limited to releasing it as an extension.

Marketplace: Word Divider
Repository: yutotnh/word-divider

Image in operation

…t#50045)

yutotnh · 2024-01-27T12:21:04Z

I was able to add functionality to VS Code itself, so I created a pull request.
The pull request created is #203605.

Now that it can be integrated within a process, it can do more than just be an extension, such as being able to select words with a double-click.

rinzwind5 · 2024-02-18T03:48:57Z

I was surprised, since VS Code is based on browser technology and browsers handle this stuff great.
Also issue with Thai:
คนไทยที่นับถือศาสนาพุทธเกินห้าสิบเปอร์เซ็นต์
Should be broken down as คน ไทย ที่ นับถือ ศาสนา พุทธ เกิน ห้า สิบ เปอร์เซ็นต์

https://fuqua.io/thai-word-split/browser/

Makes VS Code at the very least unusable as generic text editor for these languages. Notepad does work fine btw (so Windows has native support as expected).

…203605) * Add support for recognizing word locales in word operations (#50045) * Move intlSegmenterLocales in the WordCharacterClassifier class * Rerun compiler * Renames * Avoid duplicating code --------- Co-authored-by: Alex Dima <alexdima@microsoft.com>

…t#50045) (microsoft#203605) * Add support for recognizing word locales in word operations (microsoft#50045) * Move intlSegmenterLocales in the WordCharacterClassifier class * Rerun compiler * Renames * Avoid duplicating code --------- Co-authored-by: Alex Dima <alexdima@microsoft.com>

alexdima · 2024-06-12T20:17:03Z

Thanks to #203605 It is now possible to configure editor.wordSegmenterLocales to define the locales to be used for word segmenting

vscodebot bot added editor editor-core Editor basic functionality labels May 17, 2018

imhuay changed the title ~~Feature request: Split the Chinese text into one word once Ctrl+Left/Right~~ Feature request: Treat the Chinese text as a Chinese sequence when usingCtrl+Left/Right May 17, 2018

mjbvz assigned rebornix May 17, 2018

imhuay closed this as completed May 18, 2018

imhuay reopened this May 18, 2018

rebornix mentioned this issue Sep 6, 2018

Optimize word selection for wide character #42005

Closed

rebornix added the feature-request Request for new features or functionality label Sep 6, 2018

alexdima removed editor labels Sep 20, 2018

kieferrm mentioned this issue Nov 14, 2018

Iteration Plan for November 2018 #62876

Closed

45 tasks

kieferrm mentioned this issue Dec 21, 2018

Iteration Plan for January 2019 #65570

Closed

42 tasks

rebornix added this to the On Deck milestone Jan 22, 2019

sgryjp mentioned this issue Aug 12, 2019

Is there better way to handle double click selection? sgryjp/japanese-word-handler#4

Open

HiroKws mentioned this issue Sep 6, 2021

Adjust auto-highlighting of the cursor position to the setting at select. #130869

Open

rebornix modified the milestones: On Deck, Backlog Nov 5, 2021

alexdima added editor-wordnav Editor word navigation issues and removed editor-core Editor basic functionality labels Dec 10, 2022

yutotnh added a commit to yutotnh/vscode that referenced this issue Jan 27, 2024

Add support for recognizing word locales in word operations (microsof…

bc776b1

…t#50045)

yutotnh mentioned this issue Jan 27, 2024

Add support for recognizing word locales in word operations (#50045) #203605

Merged

rebornix assigned alexdima and unassigned rebornix Jan 29, 2024

This was referenced Apr 15, 2024

VSCode Intl usage is problem for all of us? CodinGame/monaco-vscode-api#397

Closed

Browsers support coverage for Intl.Segmenter #210577

Closed

WellWells mentioned this issue Apr 26, 2024

Enhancing VSCode with Word-Level Navigation for Multilingual Support #210123

Closed

alexdima closed this as completed Jun 12, 2024

alexdima modified the milestones: Backlog, April 2024 Jun 12, 2024

Ninglo mentioned this issue Jul 26, 2024

editor.wordSegmenterLocales configuration don't take effect in simpleWidget editors (like chat or SCM input Editor) #223920

Open

vs-code-engineering bot locked and limited conversation to collaborators Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Treat the Chinese text as a Chinese sequence when using`Ctrl+Left/Right` #50045

Feature request: Treat the Chinese text as a Chinese sequence when using`Ctrl+Left/Right` #50045

imhuay commented May 17, 2018

gdh1995 commented May 18, 2018 •

edited

Loading

imhuay commented May 18, 2018

smikitky commented Jan 15, 2019 •

edited

Loading

smikitky commented Jan 15, 2019 •

edited

Loading

rebornix commented Jan 22, 2019

WangLeto commented Feb 8, 2019

smikitky commented Oct 30, 2019

rebornix commented Nov 19, 2019

yuboona commented Mar 22, 2021

yuboona commented Mar 26, 2021

simonmysun commented Jan 10, 2024

yume-chan commented Jan 10, 2024

yutotnh commented Jan 21, 2024

yutotnh commented Jan 27, 2024

rinzwind5 commented Feb 18, 2024 •

edited

Loading

alexdima commented Jun 12, 2024

Feature request: Treat the Chinese text as a Chinese sequence when usingCtrl+Left/Right #50045

Feature request: Treat the Chinese text as a Chinese sequence when usingCtrl+Left/Right #50045

Comments

imhuay commented May 17, 2018

gdh1995 commented May 18, 2018 • edited Loading

imhuay commented May 18, 2018

smikitky commented Jan 15, 2019 • edited Loading

smikitky commented Jan 15, 2019 • edited Loading

rebornix commented Jan 22, 2019

WangLeto commented Feb 8, 2019

smikitky commented Oct 30, 2019

rebornix commented Nov 19, 2019

yuboona commented Mar 22, 2021

yuboona commented Mar 26, 2021

simonmysun commented Jan 10, 2024

yume-chan commented Jan 10, 2024

yutotnh commented Jan 21, 2024

yutotnh commented Jan 27, 2024

rinzwind5 commented Feb 18, 2024 • edited Loading

alexdima commented Jun 12, 2024

Feature request: Treat the Chinese text as a Chinese sequence when using`Ctrl+Left/Right` #50045

Feature request: Treat the Chinese text as a Chinese sequence when using`Ctrl+Left/Right` #50045

gdh1995 commented May 18, 2018 •

edited

Loading

smikitky commented Jan 15, 2019 •

edited

Loading

smikitky commented Jan 15, 2019 •

edited

Loading

rinzwind5 commented Feb 18, 2024 •

edited

Loading