Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode character seems to swallow other characters during a round-trip conversion from HTML to RTF and back #8264

Closed
huyz opened this issue Aug 31, 2022 · 7 comments

Comments

@huyz
Copy link

huyz commented Aug 31, 2022

Explain the problem.

When I have a non-breaking space in HTML and convert to RTF and then back to HTML, this causes adjoining characters to be swallowed:

printf '<span>\xC2\xA0</span>curious\n' | pandoc -f html -t rtf | pandoc -f rtf -t html
<p> urious</p>

I feel that the c of curious shouldn't be destroyed somehow during this round-trip.

Pandoc version?
pandoc 2.19.2
macOS 12.5.1 (Monterey) Apple M1 Max (ARM)

@huyz huyz added the bug label Aug 31, 2022
@jgm
Copy link
Owner

jgm commented Aug 31, 2022

Simple repro:

% pandoc -f rtf -t html
{\pard \ql \f0 \sa180 \li0 \fi0 a\u160?c\par}
<p>a </p>

The c disappears.

@jgm
Copy link
Owner

jgm commented Aug 31, 2022

Relevant part of tokenization:

Tok (line 1, column 33) (UnformattedText "a")
Tok (line 1, column 34) (ControlWord "u" (Just 160))
Tok (line 1, column 40) (UnformattedText "c")
Tok (line 1, column 41) (ControlWord "par" Nothing)

@jgm
Copy link
Owner

jgm commented Aug 31, 2022

Relevant parts of the spec:

\uN
This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.As with all RTF keywords, a keyword-terminating space may be present (before the ANSI characters) which is not counted in the characters to skip. While this is not likely to occur (or recommended), a \bin keyword, its argument, and the binary data that follows are considered one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or closing brace) is encountered while scanning skippable data, the skippable data is considered to be ended before the delimiter. This makes it possible for a reader to perform some rudimentary error recovery. To include an RTF delimiter in skippable data, it must be represented using the appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control word or symbol is considered a single character for the purposes of counting skippable characters.An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative numbers.

\ucN
This keyword represents the number of bytes corresponding to a given \uN Unicode character. This keyword may be used at any time, and values are scoped like character properties. That is, a \ucN keyword applies only to text following the keyword, and within the same (or deeper) nested braces. On exiting the group, the previous \uc value is restored. The reader must keep a stack of counts seen and use the most recent one to skip the appropriate number of characters when it encounters a \uN keyword. When leaving an RTF group which specified a \uc value, the reader must revert to the previous value. A default of 1 should be assumed if no \uc keyword has been seen in the current or outer scopes.A common practice is to emit no ANSI representation for Unicode characters within a Unicode destination context (that is, inside a \ud destination.). Typically, the destination will contain a \uc0 control sequence. There is no need to reset the count on leaving the \ud destination as the scoping rules will ensure the previous value is restored.

@jgm
Copy link
Owner

jgm commented Aug 31, 2022

In this case sEatChars is being set to 1.

@jgm
Copy link
Owner

jgm commented Aug 31, 2022

So it's eating the UnformattedText "c".
I think we should have an UnformattedText "?" in there for it to eat. So the problem may be in tokenization.

@jgm
Copy link
Owner

jgm commented Aug 31, 2022

Maybe it's a RTF writer issue? The parameter (160) is supposed to have a delimiter, which is a space or nonalphabetic, nonnumeric character. Here that's going to be the '?', which I think is actually meant to stand in for the character if it can't render the unicode character. Putting a space before the ? in the RTF code fixes the issue.

@jgm jgm closed this as completed in b133a8b Aug 31, 2022
@huyz
Copy link
Author

huyz commented Sep 1, 2022

Wow that was quick. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants