Add new option: search_word_boundary #2898
Open
+44
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
New attempt at #2896 in a single commit.
The default / current word boundary regex
^|\\b|\\s
is remarkably bad at actually finding word boundaries in languages that use non-unicode characters. A word boundary is basically detected after any non-ascii character (fx. ü, å, ø and æ to mention just a few - but there are MANY).I've looked into possibilities, and unfortunately there doesn't seem to be any way to get decent word-boundary detection for anything except ascii in javascripts RegExp implementation... without either using a third-party library or including some 4k+ characters in the string.
Therefore, I don't see any way to reliably detect word boundaries with any pre-set, hardcoded regex.
Turning it into an option means that people can at least set something appropriate for their individual language and / or use case if they care about word boundaries being detected in "weird" places.
Please double-check that:
package.json
.References
First partial PR from: #2894
Should solve this issue: #2862