You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Per #3743 (comment), we should weight all left and right brackets, not only ASCII parentheses, to avoid breaking on the wrong side of them. A comprehensive list of such brackets can be found by querying the Unicode Character Database for the following properties:
Of interest to major Western languages are the following brackets:
()[]{}<>«»‹›
These quotation marks may be problematic because they bind on different sides depending on the language. I think we should penalize them only when surrounded by ideographic characters:
Maybe it simplifies things to just disable breaking for all of these when they're adjacent to non ideographic characters? Although opening/closing punctuation might be a decent breaking point in most western text, you'd also usually expect to have a space before or after the punctuation...
FWIW, we can query the Unicode character properties table in our code using ICU, but it'll pull in a 35KB data dependency if we do.
Intuitively, I’d expect most of these punctuation marks (the non-ideographic ones) to behave just like the ASCII parentheses that we’ve special-cased. If we’ve special-cased ASCII parentheses for the ideographic case specifically, then I agree that we should only treat them as breaking when they’re in the middle of text that doesn’t use spaces as word separators (particularly Chinese, Japanese, and Thai).
In the absence of word separators, probably all the charHasNeutralVerticalOrientation() characters are breakable, but the brackets are breakable only on one side. So I guess I’d consider that left/right bias to be the criteria for special-casing.
I don’t think it’d be necessary to query the entire UCD at runtime. We could manually build a list of qualifying characters based on these properties, just as we did in script_detection.js. Alternatively, we could automate that process in a build step. Either way, we’d only pull in the specific characters we want to assign special weights to.
Per #3743 (comment), we should weight all left and right brackets, not only ASCII parentheses, to avoid breaking on the wrong side of them. A comprehensive list of such brackets can be found by querying the Unicode Character Database for the following properties:
Ps
(Punctuation, open)Pe
(Punctuation, close)Pi
(Punctuation, initial quote)Pf
(Punctuation, final quote)This table may be a good starting point.
Of interest to major Western languages are the following brackets:
These quotation marks may be problematic because they bind on different sides depending on the language. I think we should penalize them only when surrounded by ideographic characters:
Of interest to CJK are the above, plus:
/ref #3505
/cc @ChrisLoer @nickidlugash
The text was updated successfully, but these errors were encountered: