-
Notifications
You must be signed in to change notification settings - Fork 5
Segmenting Brahmi-derived scripts involving conjuncts #53
Comments
Here's a little more background on the situation. Consonant clusters in indic scripts like Devanagari, Bengali, and Tamil kill the inherent vowel with a virama. A consonant cluster (ie. more than one consonant without intervening vowels) can be rendered in one of two ways: The former is very common for Devanagari and Bengali. The latter is the default for modern Tamil, apart from a small number of exceptions. If there's no visible virama, the consonant cluster must not be broken. For example, a vowel-sign that is placed before the base, and which is pronounced after all the consonants, must appear before the first consonant in the whole cluster. If there is a visible virama, the consonant cluster can be broken, and a pre-base vowel-sign will appear between the consonants, just before the last. There are 3-, and occasionally 4- consonant clusters, and the same rules apply to them. So Chrome is treating the conjuncts as a single unit. Which is great. But it's also treating sequences with a visible virama as a single unit, which isn't as great. But then this is a difficult problem to solve, because (crucially) the sequence of code points for clusters that are conjuncts and those that aren't (ie. have a visible virama) is exactly identical. The difference is only produced as the font does its rendering, and decides whether or not to hide the virama. And some fonts have more conjunct glyphs than others, so it varies by font (see an example at https://r12a.github.io/scripts/devanagari/#visiblevirama). So unless we can tell, for the particular font being used in that instance, how it renders the cluster, we can't decide whether to keep everything as a single unit, or allow it to break after the virama. For a brief introduction with pictures see Typographic character units in complex scripts. |
And finally, here are some thoughts from @kojiishi:
|
I second the suggestion that, in the absence of cleverer methods, the segmentation used by Chrome is a better default behaviour. I can also suggest that if the codepoint sequence contains a virama character followed by a ZWNJ, that would indicate an explicit virama that is not an accident of a particular font behaviour, so that would suggest a safe place to segment. If, for example, I wanted to apply CSS styling in a way that would only affect the first consonant+virama in a Tamil sequence, I could add ZWNJ to make clear that the cluster can be segmented at that point. [As an aside: I have been leaning for some time towards the opinion that all the Brahmi derived scripts in Unicode should be given explicit virama characters that are independent of the graphical conjunct forming characters. I know there are all sorts of reasons why this is unlikely to happen.] OpenType Layout shaping engines track outcomes of some glyph substitution features and, in particular, the presence of explicit virama in a string after the orthographic unit shaping feature block has been processed, because this information is necessary to the reordering of ikar and reph (in modern convention). So the logic for determining if a semi-processed glyph string contains an explicit virama exists and is well defined. The question, I suppose, is whether browsers could apply or access that logic during segmentation? |
Thank you @tiroj for the comment, this is very much appreciated. I work on browser layout engine (Blink), but let me admit that I'm not very familiar with Indic scripts nor with inside of shaping engines. First, probably a novice question, but is "explicit virama" the same as "visible virama"? @r12a told me that we want to split the cluster only when there's a "visible virama", and you said shaping engines know whether an "explicit virama" exists or not. If these two terminologies are the same, I think you're right, it's not a matter of the spec but about shaping engines to expose that information to clients. I don't think browsers can access that logic today, but we can probably move this discussion to HarfBuzz to enable it. /cc @behdad |
Explicit virama and visible virama are often the same thing, but not always. I think perhaps there is probably a better term than either explicit or visible, but I am not sure yet what it is. By explicit virama, I mean a unique glyph in the run that singularly represents the virama character. This is what is tracked by the shaping engine, and hence plays a role in the reordering stage of Indic layout. The reason I do not like the term visible virama, is that it is possible to have a visible virama sign appear in text without it being a unique glyph singularly representing that character. So, for example, when we make fonts that support the older Now, in a So things get complicated for you, because at the font level this is all about what happens at the reordering stage of Indic layout, which might be helpful to you in determining how and where to apply some kinds of CSS text display, but won’t necessarily. The shaping engine will be tracking the presence of an explicit, unique glyph that singularly represents the virama character, but there may legitimately be situations in which a virama is visible but not tracked because it isn’t explicit, unique, etc. |
Btw, just some extra background (talking about codepoints rather than glyphs now)... There are 3 types of vowel-killer distinguished by Unicode properties, and linked to specific characters in particular scripts by https://www.unicode.org/Public/13.0.0/ucd/IndicSyllabicCategory.txt. These are: Invisible Stackers, which produce conjuncts, but which are always invisible. We have no beef with these characters either. Viramas, which may sometimes be invisible and other times visible. These are the problematic characters. It's useful to have a list of the 27 scripts where we need to invoke special behaviours. However, some of those scripts (like Tamil) will rarely use the virama to generate conjuncts in modern text, and need a different default approach to others (like Newa) that will use it to make conjuncts for most consonant clusters. |
‘Always invisible’ is an interesting intention. In my experience, this is not always the case in practice. I have been looking at the traditional Meetei Mayek orthography recently, which is encoded in Unicode with an invisible stacker:
In fonts, such characters tend to be represented as shown in the Unicode glyph charts, with a small subscript + sign. Having a visible representation of the character can aid in both making the font—especially in a graphical GSUB interface such as VOLT—and in editing text: it is nice to have visual feedback on what you are typing, even if the glyph is subsequently swallowed in some form of conjunct display (whether ligature formation, below-base form substitution, or some combination of methods in the font). But what if it isn’t swallowed? What if the font does not contain conjunct shaping for a particular sequence of consonant + invisible stacker virama + conjunct? Some scripts form conjuncts in systematic ways, e.g. using subscribed below-base or post-base forms for secondary conjuncts, and these generally are handled okay with an invisible stacker virama, as the name suggests. But Meetei Mayek as an example of a script in which conjunct formation is not systematic, but instead used an evolving set of conventional ligature forms for conjuncts, which differed across time and locale. These conventions are not well documented, and I suspect that making a traditional Meetei Mayek font is currently impossible: the necessary information about conjunct ligature sets is not available, and will require significant research involving original manuscripts in Manipur and collections elsewhere. But even if one were to document and make a font that represented some standardised form of the traditional orthography, Unicode text may still present U+AAF6 in a sequence that the font cannot represent graphically as a conjunct. This is, fundamentally, an issue of the script: the traditional Meetei Mayek orthography had no secondary method to represent conjuncts, no visible virama option (the modern, reformed orthography introduced a distinctive visible virama mechanism, which is exclusively used to indicate conjuncts now, but this cannot be incorporated in the traditional orthography). All of which is a long and not especially helpful way to say that ‘always invisible’ may be the intention, but the reality is that these characters can easily show up as visible entities in text. |
One of the key issues for typographic handling of Indian scripts on the Web is how to handle conjuncts, since they don't map to grapheme clusters. But the situation is complicated by the fact that the exact same underlying sequence of code points may need to be handled differently, depending on what the font does with it.
I was discussing privately with @kojiishi but figured that it would be useful to open the discussion to this group. I'll include some things here from that conversation. The discussion was mostly related to initial-letter selection and letter-spacing.
I'll include some explanations in the next comment, but try to capture the issue in a nutshell here.
Currently for Devanagari and Bengali, ::first-letter in Blink and Webkit selects the whole consonant cluster (plus combining characters) as a unit, whereas Gecko selects the initial grapheme cluster instead in most conjuncts (in particular, half forms). The Blink/Webkit approach is great for handling conjuncts, but doesn't allow first-letter or letter-spacing to separate the consonants in a cluster when they don't form a conjunct. The opposite applies for Gecko's approach.
Tamil, however, is treated slightly differently. (Modern Tamil has very few conjuncts (2~3) but plenty of clusters, and uses a visible virama as the default.) Although Blink/Webkit do just select the initial grapheme cluster for Tamil, this time they break the few conjuncts that should be kept together. Gecko's intial grapheme cluster selection works well for Tamil, but actually this time they also manage to recognise and keep together the Tamil conjuncts.
I can't see a way of resolving this problem by focusing on code points. It could only be resolved by interrogating what the font is doing.
However, in the meantime, until we have a clever fix, the code points are all we have. It seems to me that, in the interim, perhaps less harm is done by preventing virama-visible clusters from splitting in certain scripts than by allowing conjuncts to split. So the Blink/Webkit approach for Devanagari/Bengali seems a useful default, as an interim approach, despite its side-effects. I'd be interested to hear whether others agree.
Yesterday i created some gap-analysis content related to initial-letter selection, that links to tests and describes the results:
First-letter:
Devanagari/Bengali: https://www.w3.org/TR/deva-gap/#issue94_initials and https://www.w3.org/TR/beng-gap/#issue115_initials
Tamil: https://www.w3.org/TR/taml-gap/#issue116_initials
Letter-spacing:
Devanagari/Bengali: https://www.w3.org/TR/deva-gap/#issue117_spacing and https://www.w3.org/TR/beng-gap/#issue117_spacing
(Tamil works fine, except for a bug with one form of the shri conjunct.)
It's clear from the results that the Blink/Webkit engine uses different algorithms for first-letter and letter-spacing.
The text was updated successfully, but these errors were encountered: