You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now anythingButString is implemented in a very non-ideal way (see #58). The plan is to replace the existing function, and potentially add one more.
anythingButString('aeiou') will produce output like:
// non-capturing group, containing a lookahead for exact string, then matching any characters repeatedly for inputString.length/(?:(?!aeiou).{5})/
This implementation will only work predictably for ascii-type strings, because length actually counts UTF-16 codepoints. The same unicode characters can be encoded in multiple distinct ways due to the fact that UTF-16 is not normalised.
To provide an API that is also able to deal with unicode strings, something like anythingButStringUnicode(inputString, numCharactersToMatch) could be added. In this case, the user would be expected to provide the actual number of characters that should be matched after the lookahead. This is kind of fraught in itself due to normalisation, and the fact that whatever string you'd want to match in place may not match the number of code points anyway.
I imagine that this API would still cause confusion with users, both those looking explicitly to match unicode strings, and those who assume they should use this version of the function because why wouldn't you use unicode? In that case, it may be better to skip it altogether, and allow the user to use the group/assertAhead/anyChar/exactly APIs to build the equivalent manually. Though in that case, it still might be worth adding a anyDataUnit as a low-level API for unicode matching.
The text was updated successfully, but these errors were encountered:
Right now
anythingButString
is implemented in a very non-ideal way (see #58). The plan is to replace the existing function, and potentially add one more.anythingButString('aeiou')
will produce output like:This implementation will only work predictably for ascii-type strings, because length actually counts UTF-16 codepoints. The same unicode characters can be encoded in multiple distinct ways due to the fact that UTF-16 is not normalised.
To provide an API that is also able to deal with unicode strings, something like
anythingButStringUnicode(inputString, numCharactersToMatch)
could be added. In this case, the user would be expected to provide the actual number of characters that should be matched after the lookahead. This is kind of fraught in itself due to normalisation, and the fact that whatever string you'd want to match in place may not match the number of code points anyway.I imagine that this API would still cause confusion with users, both those looking explicitly to match unicode strings, and those who assume they should use this version of the function because why wouldn't you use unicode? In that case, it may be better to skip it altogether, and allow the user to use the
group
/assertAhead
/anyChar
/exactly
APIs to build the equivalent manually. Though in that case, it still might be worth adding aanyDataUnit
as a low-level API for unicode matching.The text was updated successfully, but these errors were encountered: