-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Unicode combining characters #2388
Conversation
THANKS! looks pretty straight forward. |
Yes, it's a specific Unicode blocks for combining diacritics. I initially tried to infer this from negative font metrics, which resulted in some comical results such as: That's because the cursive "f" has a tail that hangs into negative space. Perhaps what I should have done is check for metrics.advance==0 but I'm increasingly convinced that the safest thing for this narrow issue to explicitly test for characters that we know for absolute certain are supposed to be handled in this way. |
Based on unicode inspector, that text doesn't have any combining diacriticals. I'll run an overpass query and see if we can find some other examples. |
Overpass query for unicode combining characters:
I'm having trouble locating RTL examples (at least on objects that Americana renders). |
Interesting! Thanks for the info. |
As far as I know in Hebrew these common characters are not used, and I also think Arabic has different characters as well (tick above, below, a 9 on the side), so it would make sense you won't find them, also they are usually omitted when writing in both Arabic and Hebrew (they are not "mandatory", so they kinda get neglected, I think). |
Yep, I found the list of Hebrew and Arabic combiners. They're called something different in unicode (accents, points, etc) but the concept seems the same. This approach doesn't work out of the box with the RTL code points on initial testing, so I think I need to add some RTL detection and flip the direction. |
Feel free to split this into two if it makes sense from your end - the split can be without RTL support and with RTL support (assuming not supporting RTL doesn't make things worse). |
function isCombiningCharacter(codePoint: number): boolean { | ||
return codePoint >= 0x0300 && codePoint <= 0x036F; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid these magic numbers by checking whether the codepoint is in one of these ranges:
const unicodeBlockLookup: UnicodeBlockLookup = { |
As discussed above, this is only one of the Unicode blocks dedicated to combining characters, but other combining characters are scattered elsewhere in other blocks. To detect a combining character more reliably, check whether the character’s general property in the Unicode character database is Mark or a subcategory thereof. Note that some marks are spacing, unlike the characters we’ve considered so far.
If this library’s minimum supported browser configuration supports the Unicode flag in regular expressions, you can simply match on a property escape:
function isCombiningCharacter(codePoint: number): boolean { | |
return codePoint >= 0x0300 && codePoint <= 0x036F; | |
} | |
function isCombiningCharacter(codePoint: number): boolean { | |
return !/\P{M}/u.test(String.fromCodePoint(codePoint)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, move this utility function to src/util/script_detection.ts, which has lots of similar character range tests. I’m pretty sure that file can be simplified considerably if we’re able to use the Unicode flag in regular expressions. When Mapbox first implemented some of this functionality, we still had to support some older browsers that lacked support for this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This captures both LTR and RTL combing characters, but this PR only works for LTR so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could combine this check with an RTL check if necessary:
maplibre-gl-js/src/util/script_detection.ts
Lines 278 to 279 in e50a264
export function charInRTLScript(char: number) { | |
// Main blocks for Hebrew, Arabic, Thaana and other RTL scripts |
With property escapes, this method can probably also be simplified to test for \p{Bidi_Class=Right_To_Left}
. But either way, you’d have to test the surrounding text, not just the character in question, because some characters are used in both LTR and RTL text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With property escapes, this method can probably also be simplified to test for
\p{Bidi_Class=Right_To_Left}
.
Apparently Bidi_Class
is not exposed in ECMAScript yet, even though there are some other simplification opportunities along these lines: #4541.
src/symbol/shaping.ts
Outdated
let retreat = 0; | ||
|
||
if (isCombiningCharacter(codePoint)) { | ||
retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line should retreat if left
is negative, even if the character is non-combining. This is useful for spacing characters that overshoot the standard bounds. For example, in many fonts, the letter “f” has a descender that extends to the left of the logical origin. Without an affordance for this case, the kerning gets wonky in italic text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left
should also affect the current glyph’s position, not just the line’s advance. I haven’t tried it myself, but I’m guessing the before screenshot above looks OK because there’s enough letter spacing to obscure the poor kerning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the after screen shot above, left
only affects the glyph's position, not the line's advance. The advance is taken directly from font metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The after screenshot above doesn’t make sense to me. Why would adding a negative left
to the x position shift the glyph to the right? I agree that the rest of this conditional block would be irrelevant to a non-combining character; prevMetrics
wouldn’t matter in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my current (uncommitted) code that's causing this effect:
if (!vertical) {
let yPos = y + baselineOffset;
let xPos = x;
if (isCombiningCharacter(codePoint)) {
const retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width;
xPos += retreat * section.scale - spacing;
yPos += section.scale * (metrics.height - prevMetrics?.height ?? metrics.height);
} else {
xPos += metrics.left * section.scale;
}
positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect});
if (metrics.advance > 0) {
x += metrics.advance * section.scale + spacing;
}
} else {
} | ||
|
||
const xPos = x + retreat * section.scale; | ||
positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect}); | ||
x += metrics.advance * section.scale + spacing; | ||
} else { | ||
shaping.verticalizable = true; | ||
positionedGlyphs.push({glyph: codePoint, imageName, x, y: y + baselineOffset, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A negative offset matters in vertical text too. Depending whether the previous character has been rotated, this character needs to be offset either up or to the left.
|
||
if (isCombiningCharacter(codePoint)) { | ||
retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width - spacing; | ||
yPos += section.scale * (metrics.height - prevMetrics?.height ?? metrics.height); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we can assume that a combining character is always on top of the base letter by the height of the diacritic. Many diacritics go under the base letter or overlap it, and this font (Noto) places its superscript diacritics closer to the base letter.
Fonts usually come with information about where to place diacritics relative to a given base letter glyph. If we need to keep superscript diacritics from touching the base letter, we might have to hard-code the specific ones to shift upward and by how much…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I'm seeing for glyph information.
Preceeding character:
{"rect":{"x":144,"y":165,"w":29,"h":22},"metrics":{"width":21,"height":14,"left":0,"top":-6,"advance":20}}
Combiner character:
{"rect":{"x":195,"y":165,"w":14,"h":15},"metrics":{"width":6,"height":7,"left":-3,"top":-2,"advance":0}}
Is there enough information here to position the combiner? If so, what should the logic be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrueType and OpenType fonts come with anchors on both the base letters and the combining diacritics, which presumably are used to join up the glyphs intelligently. I don’t know much more offhand about how that works. Here’s FontForge showing U+006D and U+0301 in Noto Sans Regular:
The anchor positions vary by glyph and by font. Unfortunately, it doesn’t look like the glyph PBF format preserves any of this information. It would be essential for #50 but possibly out of scope for a quick fix for combining diacritics. Even so, in my opinion, allowing the diacritic to run into the base letter is still better than spacing it out, which looks a lot rawer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a JSON of codepoint 0x0313 "combining comma above". Any idea where the anchor data is?
font0313.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This JSON file contains TrueType and OpenType tables. Most of the tables are the usual things like glyph data and horizontal metrics that we’re already making use of, but GPOS
contains advanced glyph positioning metrics. It looks like the key to accessing the attachment points, not only for combining diacritics but also for some complex text forms, such as in Urdu. Also, OS/2
contains various character metrics that Windows uses but other platforms do not, which might be useful for reconstructing the metrics we need, but aren’t guaranteed to be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, something with some &^%$#@! documentation.
const xPos = x + retreat * section.scale; | ||
positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect}); | ||
x += metrics.advance * section.scale; | ||
if (!isCombiningCharacter(codePoint)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code might need to be updated to reflect the changes you’re making to the advance. Since it doesn’t seem to be affecting the glyph layout in your examples, I’m guessing it’s only used for line justification/line wrapping, but that does play into symbol collision:
maplibre-gl-js/src/symbol/shaping.ts
Lines 374 to 377 in 6d55d5b
const positions = glyphMap[section.fontStack]; | |
const glyph = positions && positions[codePoint]; | |
if (!glyph) return 0; | |
return glyph.metrics.advance * section.scale + spacing; |
Don't forget to add render tests with all the great examples here - these are great for ATDD (Acceptance test driven development tests)! :-) |
Don't get too excited until we actually get this thing working 😅 |
Fair enough :-) |
Just an update that I'm still working on this. I've got some test code that is successfully parsing a font file using the OpenType spec in order to extract just the shaping information (using Google Noto regular). I would then produce a JSON file that we could parse in order to do shaping. In theory this approach would work "perfectly" if you happen to use Google Noto regular but be slightly off for any other font series (but should still be better than present). So far this approach seems feasible (but still not a replacement for proper font support). I would like to ask if there is any issue from a licensing perspective from extracting font metric information from these font stacks and incorporating them into this repository. |
That’s a good question. The architecture around fontstacks is overkill compared to leveraging Web fonts on each platform, but I assume the design limits the impact of licensing issues. In the U.S., typefaces and their characters themselves are ineligible for copyright protection, but a scalable font is copyrightable. So if you have enough rights to use the font yourself but not redistribute it, then you could apply the font on the server side and distribute a screenshot of each glyph that the client would only be responsible for piecing together. On the other hand, if the PBFs were to contain the vector data of each glyph’s paths, then the font’s license would be enforceable. Your question might boil down to whether distributing the font metrics similarly ventures into scalable font territory. It’s probably a good idea to limit the metrics to just what’s necessary, if not for licensing reasons then for performance. I don’t know if anyone is taking advantage of this workaround in the fontstack architecture, given the popularity of fonts like Noto that allow redistribution under the SIL Open Font License. |
Is there a plan to push this PR forward? |
Unfortunately this ended up being more complicated than I expected and it's going to take more work to investigate and do it properly. I don't think it makes sense to implement what's essentially a hack that will work for some combiners but be wrong for others. So I'll close this for now. |
Fixes #2384
This PR implements unicode combining characters so that diacritic characters in the Unicode combiner block (0x300-0x36F) are rendered with preceding characters. This code tests to see if a character is combining, and if so, it inspects the dimensions of the prior character and renders it in position accordingly.
Posting as draft for collaboration. I'm not sure if this introduces side effects and I haven't look into unit testing options.
Note that this does not address #50 which requires a more comprehensive solution.
Samples:
name: sməqʷəʔelə həw̓aləm̓ew̓txʷ (OSM location)
Before:
After:
Location: https://zelonewolf.github.io/openstreetmap-americana/#map=18/49.02475/-123.10267&language=mul