Avoid breaking on wrong side of brackets #3811

1ec5 · 2016-12-15T10:41:27Z

Per #3743 (comment), we should weight all left and right brackets, not only ASCII parentheses, to avoid breaking on the wrong side of them. A comprehensive list of such brackets can be found by querying the Unicode Character Database for the following properties:

Ps (Punctuation, open)
Pe (Punctuation, close)
Pi (Punctuation, initial quote)
Pf (Punctuation, final quote)

This table may be a good starting point.

Of interest to major Western languages are the following brackets:

()[]{}<>«»‹›

These quotation marks may be problematic because they bind on different sides depending on the language. I think we should penalize them only when surrounded by ideographic characters:

“”‘’„

Of interest to CJK are the above, plus:

（）｛｝〔〕〘〙【】《》〈〉〖〗＜＞［］｟｠「」『』｢｣

/ref #3505
/cc @ChrisLoer @nickidlugash

The text was updated successfully, but these errors were encountered:

ChrisLoer · 2016-12-16T00:17:58Z

Maybe it simplifies things to just disable breaking for all of these when they're adjacent to non ideographic characters? Although opening/closing punctuation might be a decent breaking point in most western text, you'd also usually expect to have a space before or after the punctuation...

FWIW, we can query the Unicode character properties table in our code using ICU, but it'll pull in a 35KB data dependency if we do.

1ec5 · 2016-12-16T02:34:00Z

Intuitively, I’d expect most of these punctuation marks (the non-ideographic ones) to behave just like the ASCII parentheses that we’ve special-cased. If we’ve special-cased ASCII parentheses for the ideographic case specifically, then I agree that we should only treat them as breaking when they’re in the middle of text that doesn’t use spaces as word separators (particularly Chinese, Japanese, and Thai).

In the absence of word separators, probably all the charHasNeutralVerticalOrientation() characters are breakable, but the brackets are breakable only on one side. So I guess I’d consider that left/right bias to be the criteria for special-casing.

I don’t think it’d be necessary to query the entire UCD at runtime. We could manually build a list of qualifying characters based on these properties, just as we did in script_detection.js. Alternatively, we could automate that process in a build step. Either way, we’d only pull in the specific characters we want to assign special weights to.

1ec5 added the feature 🍏 label Dec 15, 2016

This was referenced Dec 15, 2016

Fix for issue #3658: improve line breaking #3743

Merged

Halfwidth punctuation prevents ideographic line breaking #3658

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid breaking on wrong side of brackets #3811

Avoid breaking on wrong side of brackets #3811

1ec5 commented Dec 15, 2016

ChrisLoer commented Dec 16, 2016

1ec5 commented Dec 16, 2016

Avoid breaking on wrong side of brackets #3811

Avoid breaking on wrong side of brackets #3811

Comments

1ec5 commented Dec 15, 2016

ChrisLoer commented Dec 16, 2016

1ec5 commented Dec 16, 2016