unicode character seems to swallow other characters during a round-trip conversion from HTML to RTF and back #8264

huyz · 2022-08-31T14:44:45Z

Explain the problem.

When I have a non-breaking space in HTML and convert to RTF and then back to HTML, this causes adjoining characters to be swallowed:

❯ printf '<span>\xC2\xA0</span>curious\n' | pandoc -f html -t rtf | pandoc -f rtf -t html
<p> urious</p>

I feel that the c of curious shouldn't be destroyed somehow during this round-trip.

Pandoc version?
pandoc 2.19.2
macOS 12.5.1 (Monterey) Apple M1 Max (ARM)

The text was updated successfully, but these errors were encountered:

jgm · 2022-08-31T15:19:08Z

Simple repro:

% pandoc -f rtf -t html
{\pard \ql \f0 \sa180 \li0 \fi0 a\u160?c\par}
<p>a </p>

The c disappears.

jgm · 2022-08-31T15:32:58Z

Relevant part of tokenization:

Tok (line 1, column 33) (UnformattedText "a")
Tok (line 1, column 34) (ControlWord "u" (Just 160))
Tok (line 1, column 40) (UnformattedText "c")
Tok (line 1, column 41) (ControlWord "par" Nothing)

jgm · 2022-08-31T15:44:19Z

Relevant parts of the spec:

\uN
This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.As with all RTF keywords, a keyword-terminating space may be present (before the ANSI characters) which is not counted in the characters to skip. While this is not likely to occur (or recommended), a \bin keyword, its argument, and the binary data that follows are considered one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or closing brace) is encountered while scanning skippable data, the skippable data is considered to be ended before the delimiter. This makes it possible for a reader to perform some rudimentary error recovery. To include an RTF delimiter in skippable data, it must be represented using the appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control word or symbol is considered a single character for the purposes of counting skippable characters.An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative numbers.

\ucN
This keyword represents the number of bytes corresponding to a given \uN Unicode character. This keyword may be used at any time, and values are scoped like character properties. That is, a \ucN keyword applies only to text following the keyword, and within the same (or deeper) nested braces. On exiting the group, the previous \uc value is restored. The reader must keep a stack of counts seen and use the most recent one to skip the appropriate number of characters when it encounters a \uN keyword. When leaving an RTF group which specified a \uc value, the reader must revert to the previous value. A default of 1 should be assumed if no \uc keyword has been seen in the current or outer scopes.A common practice is to emit no ANSI representation for Unicode characters within a Unicode destination context (that is, inside a \ud destination.). Typically, the destination will contain a \uc0 control sequence. There is no need to reset the count on leaving the \ud destination as the scoping rules will ensure the previous value is restored.

jgm · 2022-08-31T15:48:08Z

In this case sEatChars is being set to 1.

jgm · 2022-08-31T15:49:19Z

So it's eating the UnformattedText "c".
I think we should have an UnformattedText "?" in there for it to eat. So the problem may be in tokenization.

jgm · 2022-08-31T15:56:07Z

Maybe it's a RTF writer issue? The parameter (160) is supposed to have a delimiter, which is a space or nonalphabetic, nonnumeric character. Here that's going to be the '?', which I think is actually meant to stand in for the character if it can't render the unicode character. Putting a space before the ? in the RTF code fixes the issue.

huyz · 2022-09-01T02:34:41Z

Wow that was quick. Thanks!

huyz added the bug label Aug 31, 2022

jgm added format:RTF reader labels Aug 31, 2022

jgm closed this as completed in b133a8b Aug 31, 2022

d0rianb mentioned this issue Apr 27, 2024

target two-character string modification d0rianb/rtf-parser#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode character seems to swallow other characters during a round-trip conversion from HTML to RTF and back #8264

unicode character seems to swallow other characters during a round-trip conversion from HTML to RTF and back #8264

huyz commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022 •

edited

Loading

huyz commented Sep 1, 2022

unicode character seems to swallow other characters during a round-trip conversion from HTML to RTF and back #8264

unicode character seems to swallow other characters during a round-trip conversion from HTML to RTF and back #8264

Comments

huyz commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022

jgm commented Aug 31, 2022 • edited Loading

huyz commented Sep 1, 2022

jgm commented Aug 31, 2022 •

edited

Loading