html output break utf-8 #275

Dushistov · 2023-05-05T20:43:49Z

I have lines like this in my codebase:

 if !matches!(it.next(), Some('A'..='Z') | Some('А'..='Я')) {
        return false;
 }

Generating html with such command cargo llvm-cov --html test.

For letter 'Я' resulted html code looks like this:

=&apos;\320<span class='tooltip-content'>5</span></div>\257&apos;

so html generator breaks valid utf-8 sequence for 'Я' (\320\257) into two parts and insert html code between,
making it invalid.

The text was updated successfully, but these errors were encountered:

coverage: `llvm-cov` expects column numbers to be bytes, not code points Normally the compiler emits column numbers as a 1-based number of Unicode code points. But when we embed coverage mappings for `-Cinstrument-coverage`, those mappings will ultimately be read by the `llvm-cov` tool. That tool assumes that column numbers are 1-based numbers of *bytes*, and relies on that assumption when slicing up source code to apply highlighting (in HTML reports, and in text-based reports with colour). For the very common case of all-ASCII source code, bytes and code points are the same, so the difference isn't noticeable. But for code that contains non-ASCII characters, emitting column numbers as code points will result in `llvm-cov` slicing strings in the wrong places, producing mangled output or fatal errors. (See taiki-e/cargo-llvm-cov#275 as an example of what can go wrong.)

Rollup merge of rust-lang#119033 - Zalathar:unicode, r=davidtwco coverage: `llvm-cov` expects column numbers to be bytes, not code points Normally the compiler emits column numbers as a 1-based number of Unicode code points. But when we embed coverage mappings for `-Cinstrument-coverage`, those mappings will ultimately be read by the `llvm-cov` tool. That tool assumes that column numbers are 1-based numbers of *bytes*, and relies on that assumption when slicing up source code to apply highlighting (in HTML reports, and in text-based reports with colour). For the very common case of all-ASCII source code, bytes and code points are the same, so the difference isn't noticeable. But for code that contains non-ASCII characters, emitting column numbers as code points will result in `llvm-cov` slicing strings in the wrong places, producing mangled output or fatal errors. (See taiki-e/cargo-llvm-cov#275 as an example of what can go wrong.)

Zalathar · 2024-01-09T02:25:27Z

This should now be fixed upstream in nightly as of rust-lang/rust#119033.

taiki-e · 2024-01-09T02:27:55Z

@Zalathar Great! Thanks for fixing this!

taiki-e added the C-upstream-bug Category: This is a bug of compiler or dependencies (the fix may require action in the upstream) label May 5, 2023

Zalathar mentioned this issue Dec 17, 2023

coverage: llvm-cov expects column numbers to be bytes, not code points rust-lang/rust#119033

Merged

taiki-e closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html output break utf-8 #275

html output break utf-8 #275

Dushistov commented May 5, 2023 •

edited

Loading

Zalathar commented Jan 9, 2024

taiki-e commented Jan 9, 2024

html output break utf-8 #275

html output break utf-8 #275

Comments

Dushistov commented May 5, 2023 • edited Loading

Zalathar commented Jan 9, 2024

taiki-e commented Jan 9, 2024

Dushistov commented May 5, 2023 •

edited

Loading