Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove variation selector workaround #682

Merged
merged 6 commits into from
Jul 26, 2017

Conversation

rolandwalker
Copy link
Contributor

@rolandwalker rolandwalker commented Jul 21, 2017

Remove workaround for VARIATION SELECTOR-16 which was added in #679, and move failing part of test to a TODO. This is accomplished in part by adding a --subshell flag to test_case.

About variation selectors in general:

  • a VS changes the character/s before it, like a combining character
  • but a combining character can be composed with any other character, and each possible VS modification is a rule defined separately and explicitly in Unicode
  • for the case of a successful modification it should be clear that from the POV of unicode_width() the VS "disappears" and has width zero
  • when a VS is ineffective, the Unicode spec also defines it as having width zero (http://unicode.org/faq/unsup_char.html#3)
  • due to the complexity of VS sequences, unicode_width() simply can't get all of the cases correct (and tig shouldn't worry much about that). This is b/c unicode_width() statelessly considers a single codepoint, but VS sequences are of varying length
  • example complex VS sequence: PERSON WITH BLOND HAIR EMOJI MODIFIER FITZPATRICK TYPE-5 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16, which forms the single emoji 👱🏾‍♂️
  • yes, it is nuts that a glyph can be specified by a variable-length sequence of codepoints, and that the codepoints themselves may be given by variable-length UTF-8 encoding

edit: the reason that an approach likewcwidth() can still get the job done is that sequences are artfully defined: in the one above only 2 of the codepoints have width, and the final glyph is meant to correspond to the sum of the widths. But inevitably there are edge cases and rendering issues.

@rolandwalker rolandwalker force-pushed the rm-var-sel-workaround branch from c4bbbde to 76b0934 Compare July 21, 2017 10:42
@jonas
Copy link
Owner

jonas commented Jul 26, 2017

So one fix might be to keep a context between calls to unicode_width()? I agree that it is better to remove the workaround and keep the test failure as a future improvement.

@jonas jonas merged commit 7e9afc7 into jonas:master Jul 26, 2017
@rolandwalker
Copy link
Contributor Author

The bullets were meant to argue against chasing perfection. unicode_width is efficient. If deeper knowledge about Unicode strings was really needed a library would be better.

And as to this bug, I suspect that the locus of the issue will turn out to be in glibc or similar.

@rolandwalker rolandwalker deleted the rm-var-sel-workaround branch July 26, 2017 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants