Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode math properties #2425

Merged
merged 10 commits into from
Oct 10, 2024
Merged

Unicode math properties #2425

merged 10 commits into from
Oct 10, 2024

Conversation

brucemiller
Copy link
Owner

Here's an interesting, perhaps controversial, PR: It enhances decodeMathChar to also lookup math properties (role, meaning,...) for the decoded glyph from the Unicode module. This provides some parity between symbols defined "semantically" by bindings and low-level \mathchar, in the simple cases. This covers most of what TeX's plain defines, as it defines them!

Copy link
Collaborator

@dginev dginev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually quite like the direction of the PR. Maybe I don't yet understand enough to register the controversial piece?

I left a few minor comments, and I can't say I understand the full details of the TeX interplay yet, but the grammar-near pieces seem encouraging.

I especially like the changes in the tests, where the math XML markup has become significantly more enhanced.

# When $mathglyph is provided, it is the unicode corresponding to the \mathchar of $value
sub new {
my ($class, $cs, $value, $mathglyph, %traits) = @_;
my ($class, $cs, $mode, $value) = @_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make a note of this in the Changefile for anyone who may have used CharDef in an external binding. This is breaking change to its method signature.
Could you actually add such a note to this PR? Would avoid forgetting about it later (which is easy to do...)

Which is fine, I think we still need the ability to make such changes.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; I started a new section of Changes, as we'll need it eventually.

($local ? Tokens(T_CS('\char'), $value->revert, T_CS('\relax')) : $$self{cs})); }
else { # Else math mode, mathDecode!
my ($glyph, $f, $rev, %props) = LaTeXML::Package::decodeMathChar($nvalue);
if (!defined $props{name}) { # TEMPORARY HACK ?????
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything to add to that comment? E.g. what is missing for the hack to go away?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified; thanks!

$whatsit->setProperty(font => LookupValue('font')->specialize($glyph)) if $glyph;
my ($glyph, $f, $rev, %props) = decodeMathChar($n);
$whatsit->setProperty(glyph => $glyph) if $glyph;
$whatsit->setProperties(%props);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but you could also guard that setter call:

$whatsit->setProperties(%props) if %props;

DefConstructor('\mathclose Digested', "?#1(<ltx:XMWrap role='CLOSE'>#1</ltx:XMWrap>)()", bounded => 1);
DefConstructor('\mathpunct Digested', "?#1(<ltx:XMWrap role='PUNCT'>#1</ltx:XMWrap>)()", bounded => 1);
DefConstructor('\mathinner Digested', "?#1(<ltx:XMWrap role='ATOM'>#1</ltx:XMWrap>)()", bounded => 1);
our %mathclass_subclass = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested hash has the feeling of a mini-grammar... Interesting.

How about three other ideas:

  • PUNCT => { PERIOD => 1 }
  • ID => { NUMBER => 1 }
  • BIGOP => { SUMOP => 1, INTOP => 1, LIMITOP => 1, DIFFOP => 1 }

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!, yep, that's the idea!

my $font = $curfont->merge(%$fontinfo);
$fontinfo = $defn && $defn->isFontDef;
if ($$basefontinfo{size} != $curfont->getSize) { # If we've gotten an explicit font SIZE change; Adjust!
$fontinfo = {%$fontinfo}; $$fontinfo{size} = $curfont->getSize; } }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small theoretical chance $fontinfo is undef here, so maybe default it to {} ?

 $fontinfo = ($defn && $defn->isFontDef) || {};

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

$fontinfo = {%$fontinfo}; $$fontinfo{size} = $curfont->getSize; } }
my $font = $curfont->merge(%$fontinfo);
if ($downsize > 0) { $font = $curfont->merge(scripted => 1); }
if ($downsize > 1) { $font = $curfont->merge(scripted => 1); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These merge calls have the same value? Do you intend it to do the same merge twice when >1 ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, twice for scriptscript

#======================================================================
UTF(0x5C) => { role => 'ADDOP', meaning => 'set-minus' }, # \backslash
UTF(0xAC) => { role => 'BIGOP', meaning => 'not' }, # \lnot
UTF(0xAC) => { role => 'BIGOP', meaning => 'not' }, # \neg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0xAC is given 3 separate times here? Unsure how that happened. There may be other cases, I see 0x5C is also there twice.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

editing mistake. fixed.

"ker" => { role => 'OPFUNCTION', meaning => 'kernel' }, # \ker #
"lg" => { role => 'OPFUNCTION' }, # \lg #
"lim" => { role => 'LIMITOP', meaning => 'limit', need_scriptpos => 1 }, # \lim #
"lim inf" => { role => 'LIMITOP', meaning => 'limit-infimum', need_scriptpos => 1 }, # \liminf #
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, first I assume single characters, then single words (no spaces), but I see some entries even have a space.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, these don't actually get fired, yet; it would need to be hooked into the math ligature that combines "s,i,n" into "sin"; and then looked up! I suspect "lim inf" will likely fail that test. I kinda left them all in as a reminder...

@brucemiller brucemiller merged commit 9f07a49 into master Oct 10, 2024
26 checks passed
@brucemiller brucemiller deleted the unicode-math-properties branch October 10, 2024 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants