Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Unicode combining characters #2388

Closed
wants to merge 4 commits into from

Conversation

ZeLonewolf
Copy link
Contributor

Fixes #2384

This PR implements unicode combining characters so that diacritic characters in the Unicode combiner block (0x300-0x36F) are rendered with preceding characters. This code tests to see if a character is combining, and if so, it inspects the dimensions of the prior character and renders it in position accordingly.

Posting as draft for collaboration. I'm not sure if this introduces side effects and I haven't look into unit testing options.

Note that this does not address #50 which requires a more comprehensive solution.

Samples:

name: sməqʷəʔelə həw̓aləm̓ew̓txʷ (OSM location)

Before:
image

After:
image

Location: https://zelonewolf.github.io/openstreetmap-americana/#map=18/49.02475/-123.10267&language=mul

@HarelM
Copy link
Collaborator

HarelM commented Apr 13, 2023

THANKS! looks pretty straight forward.
Are the combining characters value is a convention (saw ilthat you added it as hard coded 300 to 36f)?
Other than that, a unit test and a render test would be needed here to make sure we keep this behavior correct in the future.
Would also be interested to see what happens for non straight text and RTL text such as:
שָׁלוֹם.
Regardless, I think this is a good step forward even if it doesn't cover all cases.

@ZeLonewolf
Copy link
Contributor Author

Are the combining characters value is a convention (saw ilthat you added it as hard coded 300 to 36f)?

Yes, it's a specific Unicode blocks for combining diacritics. I initially tried to infer this from negative font metrics, which resulted in some comical results such as:

image

That's because the cursive "f" has a tail that hangs into negative space.

Perhaps what I should have done is check for metrics.advance==0 but I'm increasingly convinced that the safest thing for this narrow issue to explicitly test for characters that we know for absolute certain are supposed to be handled in this way.

@ZeLonewolf
Copy link
Contributor Author

ZeLonewolf commented Apr 13, 2023

שָׁלוֹם

Based on unicode inspector, that text doesn't have any combining diacriticals. I'll run an overpass query and see if we can find some other examples.

@ZeLonewolf
Copy link
Contributor Author

Overpass query for unicode combining characters:

[out:csv(::id, name)][timeout:2500];
node[name~"[\u0300-\u036E]"];
out;

I'm having trouble locating RTL examples (at least on objects that Americana renders).

@HarelM
Copy link
Collaborator

HarelM commented Apr 13, 2023

Interesting! Thanks for the info.
The "nikud" in Hebrew definitely falls into that category, but it's not common as it is related to Hebrew, which is not a commonly used language, I guess.
Would be interesting to see if this range can be configured so that it can support other characters and have a default range maybe, IDK, just a thought...

@HarelM
Copy link
Collaborator

HarelM commented Apr 13, 2023

As far as I know in Hebrew these common characters are not used, and I also think Arabic has different characters as well (tick above, below, a 9 on the side), so it would make sense you won't find them, also they are usually omitted when writing in both Arabic and Hebrew (they are not "mandatory", so they kinda get neglected, I think).
For example, the above word is usually written as שלום,
Without the extra nikud characters.

@ZeLonewolf
Copy link
Contributor Author

Yep, I found the list of Hebrew and Arabic combiners. They're called something different in unicode (accents, points, etc) but the concept seems the same. This approach doesn't work out of the box with the RTL code points on initial testing, so I think I need to add some RTL detection and flip the direction.

@HarelM
Copy link
Collaborator

HarelM commented Apr 13, 2023

Feel free to split this into two if it makes sense from your end - the split can be without RTL support and with RTL support (assuming not supporting RTL doesn't make things worse).
Thanks!!

Comment on lines +586 to +588
function isCombiningCharacter(codePoint: number): boolean {
return codePoint >= 0x0300 && codePoint <= 0x036F;
}
Copy link
Contributor

@1ec5 1ec5 Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid these magic numbers by checking whether the codepoint is in one of these ranges:

const unicodeBlockLookup: UnicodeBlockLookup = {

As discussed above, this is only one of the Unicode blocks dedicated to combining characters, but other combining characters are scattered elsewhere in other blocks. To detect a combining character more reliably, check whether the character’s general property in the Unicode character database is Mark or a subcategory thereof. Note that some marks are spacing, unlike the characters we’ve considered so far.

If this library’s minimum supported browser configuration supports the Unicode flag in regular expressions, you can simply match on a property escape:

Suggested change
function isCombiningCharacter(codePoint: number): boolean {
return codePoint >= 0x0300 && codePoint <= 0x036F;
}
function isCombiningCharacter(codePoint: number): boolean {
return !/\P{M}/u.test(String.fromCodePoint(codePoint));
}

Copy link
Contributor

@1ec5 1ec5 Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, move this utility function to src/util/script_detection.ts, which has lots of similar character range tests. I’m pretty sure that file can be simplified considerably if we’re able to use the Unicode flag in regular expressions. When Mapbox first implemented some of this functionality, we still had to support some older browsers that lacked support for this flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This captures both LTR and RTL combing characters, but this PR only works for LTR so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could combine this check with an RTL check if necessary:

export function charInRTLScript(char: number) {
// Main blocks for Hebrew, Arabic, Thaana and other RTL scripts

With property escapes, this method can probably also be simplified to test for \p{Bidi_Class=Right_To_Left}. But either way, you’d have to test the surrounding text, not just the character in question, because some characters are used in both LTR and RTL text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With property escapes, this method can probably also be simplified to test for \p{Bidi_Class=Right_To_Left}.

Apparently Bidi_Class is not exposed in ECMAScript yet, even though there are some other simplification opportunities along these lines: #4541.

src/symbol/shaping.ts Outdated Show resolved Hide resolved
let retreat = 0;

if (isCombiningCharacter(codePoint)) {
retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line should retreat if left is negative, even if the character is non-combining. This is useful for spacing characters that overshoot the standard bounds. For example, in many fonts, the letter “f” has a descender that extends to the left of the logical origin. Without an affordance for this case, the kerning gets wonky in italic text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds logical but it doesn't produce the expected result.
Before/After:
image
image

Copy link
Contributor

@1ec5 1ec5 Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left should also affect the current glyph’s position, not just the line’s advance. I haven’t tried it myself, but I’m guessing the before screenshot above looks OK because there’s enough letter spacing to obscure the poor kerning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the after screen shot above, left only affects the glyph's position, not the line's advance. The advance is taken directly from font metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The after screenshot above doesn’t make sense to me. Why would adding a negative left to the x position shift the glyph to the right? I agree that the rest of this conditional block would be irrelevant to a non-combining character; prevMetrics wouldn’t matter in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my current (uncommitted) code that's causing this effect:

            if (!vertical) {
                let yPos = y + baselineOffset;
                let xPos = x;

                if (isCombiningCharacter(codePoint)) {
                    const retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width;
                    xPos += retreat * section.scale - spacing;
                    yPos += section.scale * (metrics.height - prevMetrics?.height ?? metrics.height);
                } else {
                    xPos += metrics.left * section.scale;
                }

                positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect});
                if (metrics.advance > 0) {
                    x += metrics.advance * section.scale + spacing;
                }
            } else {

}

const xPos = x + retreat * section.scale;
positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect});
x += metrics.advance * section.scale + spacing;
} else {
shaping.verticalizable = true;
positionedGlyphs.push({glyph: codePoint, imageName, x, y: y + baselineOffset, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A negative offset matters in vertical text too. Depending whether the previous character has been rotated, this character needs to be offset either up or to the left.


if (isCombiningCharacter(codePoint)) {
retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width - spacing;
yPos += section.scale * (metrics.height - prevMetrics?.height ?? metrics.height);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we can assume that a combining character is always on top of the base letter by the height of the diacritic. Many diacritics go under the base letter or overlap it, and this font (Noto) places its superscript diacritics closer to the base letter.

Fonts usually come with information about where to place diacritics relative to a given base letter glyph. If we need to keep superscript diacritics from touching the base letter, we might have to hard-code the specific ones to shift upward and by how much…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I'm seeing for glyph information.

Preceeding character:

{"rect":{"x":144,"y":165,"w":29,"h":22},"metrics":{"width":21,"height":14,"left":0,"top":-6,"advance":20}}

Combiner character:

{"rect":{"x":195,"y":165,"w":14,"h":15},"metrics":{"width":6,"height":7,"left":-3,"top":-2,"advance":0}}

Is there enough information here to position the combiner? If so, what should the logic be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrueType and OpenType fonts come with anchors on both the base letters and the combining diacritics, which presumably are used to join up the glyphs intelligently. I don’t know much more offhand about how that works. Here’s FontForge showing U+006D and U+0301 in Noto Sans Regular:

U+006D

U+0301

The anchor positions vary by glyph and by font. Unfortunately, it doesn’t look like the glyph PBF format preserves any of this information. It would be essential for #50 but possibly out of scope for a quick fix for combining diacritics. Even so, in my opinion, allowing the diacritic to run into the base letter is still better than spacing it out, which looks a lot rawer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a JSON of codepoint 0x0313 "combining comma above". Any idea where the anchor data is?
font0313.txt

Copy link
Contributor

@1ec5 1ec5 Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This JSON file contains TrueType and OpenType tables. Most of the tables are the usual things like glyph data and horizontal metrics that we’re already making use of, but GPOS contains advanced glyph positioning metrics. It looks like the key to accessing the attachment points, not only for combining diacritics but also for some complex text forms, such as in Urdu. Also, OS/2 contains various character metrics that Windows uses but other platforms do not, which might be useful for reconstructing the metrics we need, but aren’t guaranteed to be present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, something with some &^%$#@! documentation.

image

const xPos = x + retreat * section.scale;
positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect});
x += metrics.advance * section.scale;
if (!isCombiningCharacter(codePoint)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code might need to be updated to reflect the changes you’re making to the advance. Since it doesn’t seem to be affecting the glyph layout in your examples, I’m guessing it’s only used for line justification/line wrapping, but that does play into symbol collision:

const positions = glyphMap[section.fontStack];
const glyph = positions && positions[codePoint];
if (!glyph) return 0;
return glyph.metrics.advance * section.scale + spacing;

@HarelM
Copy link
Collaborator

HarelM commented Apr 14, 2023

Don't forget to add render tests with all the great examples here - these are great for ATDD (Acceptance test driven development tests)! :-)

@ZeLonewolf
Copy link
Contributor Author

Don't forget to add render tests with all the great examples here - these are great for ATDD (Acceptance test driven development tests)! :-)

Don't get too excited until we actually get this thing working 😅

@HarelM
Copy link
Collaborator

HarelM commented Apr 14, 2023

Fair enough :-)

@ZeLonewolf
Copy link
Contributor Author

ZeLonewolf commented May 2, 2023

Just an update that I'm still working on this. I've got some test code that is successfully parsing a font file using the OpenType spec in order to extract just the shaping information (using Google Noto regular). I would then produce a JSON file that we could parse in order to do shaping. In theory this approach would work "perfectly" if you happen to use Google Noto regular but be slightly off for any other font series (but should still be better than present). So far this approach seems feasible (but still not a replacement for proper font support). I would like to ask if there is any issue from a licensing perspective from extracting font metric information from these font stacks and incorporating them into this repository.

@1ec5
Copy link
Contributor

1ec5 commented May 2, 2023

I would like to ask if there is any issue from a licensing perspective from extracting font metric information from these font stacks and incorporating them into this repository.

That’s a good question. The architecture around fontstacks is overkill compared to leveraging Web fonts on each platform, but I assume the design limits the impact of licensing issues. In the U.S., typefaces and their characters themselves are ineligible for copyright protection, but a scalable font is copyrightable. So if you have enough rights to use the font yourself but not redistribute it, then you could apply the font on the server side and distribute a screenshot of each glyph that the client would only be responsible for piecing together. On the other hand, if the PBFs were to contain the vector data of each glyph’s paths, then the font’s license would be enforceable. Your question might boil down to whether distributing the font metrics similarly ventures into scalable font territory. It’s probably a good idea to limit the metrics to just what’s necessary, if not for licensing reasons then for performance.

I don’t know if anyone is taking advantage of this workaround in the fontstack architecture, given the popularity of fonts like Noto that allow redistribution under the SIL Open Font License.

@HarelM
Copy link
Collaborator

HarelM commented Jun 28, 2023

Is there a plan to push this PR forward?
I would like to clean up the PR list so I can concentrate on those the require my attention.
You can always open a new PR if needed.

@ZeLonewolf
Copy link
Contributor Author

Unfortunately this ended up being more complicated than I expected and it's going to take more work to investigate and do it properly. I don't think it makes sense to implement what's essentially a hack that will work for some combiners but be wrong for others. So I'll close this for now.

@ZeLonewolf ZeLonewolf closed this Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Combining characters rendering in wrong place
3 participants