-
Notifications
You must be signed in to change notification settings - Fork 29.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Treat the Chinese text as a Chinese sequence when usingCtrl+Left/Right
#50045
Comments
Ctrl+Left/Right
Ctrl+Left/Right
It's better if VS Code can support Word Segmentation just like what Chrome does, although I know that this requires a big data dict and increases the program package size a lot. But, if it doesn't like to segment words, then I suggest that it keeps moving cursor once a sentence, instead of a char - personally, I think it is too slow to jump a Chinese char on |
However, it doesn't move once a sentence. Actually, it also can't recognize the Chinese punctuations. Examples:
|
This is a longstanding problem which virtually all East-Asian developers will notice once they start editing natural sentences (say, in Markdown) on vscode. I think this is fundamentally a problem of wrong word-splitting for CJK languages (and perhaps Thai, too), which use no spaces to delimit words. A similar problem happens when you double-click a word in a line (the whole line will be selected instead of the target word) and when you trigger an autocompletion using Ctrl+Space (a whole line will be shown as a candidate). Ideally, dictionary-based word segmentation is desirable (this is available on MS Word, Google Chrome browser, etc), but it's not 100% correct, and I'm not sure if it is really necessary for a code editor. Another practical approach that works at least in Japanese is to split words based on character types, because a typical Japanese text is a mixture of kanji, hiragana and katakana (This algorithm is implemented on most domestic text editors and even MS Notepad.exe). Character types can be easily determined via Unicode code points. Example: (1) 吾輩は猫である。名前はまだない。 (2) 吾輩|は|猫|で|ある|。|名前|は|まだ|ない|。 (3) 吾輩|は|猫|である|。|名前|はまだない|。 (1): Natural Japanese text with two sentences. There is already a popular extension that does (3) above for Japanese text. Unfortunately, it works on Ctrl+ ←/→ but nowhere else. It does not work on double-clicks, Ctrl+D, autocompletion, text search, and so on. Personally, I think (3) should be implemented as part of the basic functionality of VSCode, considering the fact that it's available on any other decent text editors. Dictionary-based solution (2) may be too costly within the main vscode repository, but I hope there is a way to allow extension developers to override word-boundary detection algorithm or the double-click behavior. By the way, for the meantime, you can alleviate this problem by tweaking |
So I searched related issues regarding CJK text navigations. I learned that "selection/navigation via double-click/keyboards" and "extracting words for autocompletion" are technically two different fields, but they are conceptually related anyway. Keyboard navigation & Double click:
Word extraction for autocompletion:
So in conclusion, IMHO vscode should (by default, regardless of the language) assume there is a word boundary when a character type changes between "Latin alphabet/number", "CJK unified ideograph (hanja/kanji)", "Punctuations Marks (incl. multibyte ones)", "Japanese hiragana" and "Japanese katakana" even if there is no space. In addition, when Ctrl+Right is input inside a sequence of multiple "CJK unified ideographs", Chinese users (seem to) want the cursor to move by one character, whereas Japanese users usually want the cursor to move to the end of the sequence, as described by (3) above. This may have to be configurable with locale-based default values.
|
@smikitky thanks for your detailed investigation ;) IMO word navigation should work seamlessly with CJKV, as ASCII word separators can't handle CJKV words. I do have a prototype of delegating the word segmentation to the browser instead of dealing that ourselves and will work on that in the near team, stay tuned. |
I'd like to remind you of Ctrl + Delete, which I think may share the same logic as Ctrl + Arrow, and performs even more upset because you may easily delete too many characters by accident. |
@rebornix This issue was once included in iteration plans, but I'm seeing no recent activity. Since we're nearing the end of the housekeeping iteration, can I ask if you have any update on this? |
Let's see if we can have time for it during holiday time. |
Is there any progress now, guys? |
Any updates? Actually in any chromium based applications, a proper word segmentation library is bundled. You can try it in i.e. file renaming input box with common key bindings. This is however seems not available to the JavaScript interface: https://bugs.chromium.org/p/chromium/issues/detail?id=129706 . While I am appreciative of japanese-word-handler by @sgryjp and CJK word handler by @SharzyL, it seems a bit redundant to include another segmentation JavaScript library, especially when we already have a much faster one in C++. I wonder if an alternative workaround would work: copy the current line to a hidden input box and synchronize the cursor movement. |
The segmenter is available in JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter For example: console.log(
JSON.stringify(
Array.from(
new Intl.Segmenter("en", { granularity: "word" }).segment(
"本文的学习公式",
),
),
undefined,
4,
),
); (In my test, it can segment any CJK language no matter which locale is specified in the constructor) outputs [
{
"segment": "本文",
"index": 0,
"input": "本文的学习公式",
"isWordLike": true
},
{
"segment": "的",
"index": 2,
"input": "本文的学习公式",
"isWordLike": true
},
{
"segment": "学习",
"index": 3,
"input": "本文的学习公式",
"isWordLike": true
},
{
"segment": "公式",
"index": 5,
"input": "本文的学习公式",
"isWordLike": true
}
] |
Just recently I developed an extension that takes advantage of it. I really wanted to make a pull request to the main body of VS Code, but my technical capabilities were limited to releasing it as an extension.
|
I was able to add functionality to VS Code itself, so I created a pull request. Now that it can be integrated within a process, it can do more than just be an extension, such as being able to select words with a double-click. |
I was surprised, since VS Code is based on browser technology and browsers handle this stuff great. https://fuqua.io/thai-word-split/browser/ Makes VS Code at the very least unusable as generic text editor for these languages. Notepad does work fine btw (so Windows has native support as expected). |
…t#50045) (microsoft#203605) * Add support for recognizing word locales in word operations (microsoft#50045) * Move intlSegmenterLocales in the WordCharacterClassifier class * Rerun compiler * Renames * Avoid duplicating code --------- Co-authored-by: Alex Dima <alexdima@microsoft.com>
Thanks to #203605 It is now possible to configure |
Now the VSCode treats a long Chinese text as one “word”. Each time use
Ctrl+Left/Right
, it will move the cursor to the begin or end.The feature request is that treat the Chinese text as a Chinese sequence, then each
Ctrl+Left/Right
, it just move one step. This act is the system text program default.Example: (use
|
as the cursor )Expected:
(Of course, It would be better if it can support Word Segmentation.)
The text was updated successfully, but these errors were encountered: