Skip to content

Add libhtml #13896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Add libhtml #13896

wants to merge 6 commits into from

Conversation

lilyball
Copy link
Contributor

@lilyball lilyball commented May 2, 2014

libhtml provides escaping/unescaping of HTML entities. It matches the
HTML5 parsing rules as closely as possible. It provides convenience
functions to escape/unescape, helpers to perform the escaping/unescaping
during Show, and Writers that can escape/unescape in a streaming manner.

References:
http://www.w3.org/html/wg/drafts/html/CR/syntax.html

@lilyball
Copy link
Contributor Author

lilyball commented May 2, 2014

This is a rewrite of PR #13831.

@lilyball
Copy link
Contributor Author

lilyball commented May 2, 2014

This is not quite ready to be merged yet. It needs more tests. But it passes all the ones it currently has, as well as make check on my machine.

fn bench_unescape(b: &mut test::Bencher) {
let s = "<script src="evil.domain?foo&" type='baz'>";
b.iter(|| unescape(s));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added a couple benches for checking worse case scenarios, I'm curious how they fared with your rewrite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I grabbed the latest test_unescape tests from your code but didn't check if the others were updated.

@lilyball
Copy link
Contributor Author

lilyball commented May 2, 2014

@seanmonstar I just pulled in your benchmarks.

test tests::bench_escape                 ... bench:      1020 ns/iter (+/- 17)
test tests::bench_longest_entity         ... bench:      1116 ns/iter (+/- 66)
test tests::bench_longest_non_entity     ... bench:      1249 ns/iter (+/- 98)
test tests::bench_short_entity_long_tail ... bench:      1051 ns/iter (+/- 17)
test tests::bench_unescape               ... bench:      3199 ns/iter (+/- 48)

@seanmonstar
Copy link
Contributor

Excellent, the worst-case scenarios are mostly indistinguishable. Something I had done for the bench_unescape was to add in some "fast common" cases. Basically, check the 5 entities that escape outputs, before resorting to a bsearch. It got that bench down to ~1300ns.

@lilyball
Copy link
Contributor Author

lilyball commented May 2, 2014

I can't do that with my approach, because I start walking the ENTITIES list the moment I find the first alphabetic character (e.g. as soon as I've seen "&a"). Unfortunately I have to do a linear walk of the list for each subsequent character. But I don't know how much time is actually spent in that part of the code.

I also don't know how comparable our numbers are, because we're on different machines. It would also be worth having benchmarks of various different types of unescapes, from ones using just the 5 basic characters, to ones using esoteric named characters, to ones using numeric escapes.

@seanmonstar
Copy link
Contributor

For the numbers: your bench_escape is the same as it was on my machine. The worst-case scenarios were seeing ~4000ns, because of the rewinding.

It doesn't look like there are any entities that are shorter than 2 characters. Could it be possible to delay the lookup until the second character, and special case if the chars are am,lt,gt,ap,qu? Just a thought for speeding up the most common escapes.

@lilyball
Copy link
Contributor Author

lilyball commented May 2, 2014

Hmm, interesting suggestion. It will complicate the code a little (obviously), but it might improve the speed. I'll give it a shot later.

@seanmonstar
Copy link
Contributor

Also, here's Python 3's html entities test case: http://hg.python.org/cpython/file/82caec3865e3/Lib/test/test_html.py

@lilyball
Copy link
Contributor Author

lilyball commented May 2, 2014

Thanks. Stealing tests from other entity libraries is a good idea.

@lilyball
Copy link
Contributor Author

lilyball commented May 2, 2014

Ugh, why did my local make check not hit the make tidy errors?

@lilyball
Copy link
Contributor Author

lilyball commented May 3, 2014

@seanmonstar I just ported the python tests, and they uncovered two bugs (now fixed).

@seanmonstar
Copy link
Contributor

\o/

pub enum EscapeMode {
/// The general-purpose mode. Escapes ``&<>"'`.
EscapeDefault,
/// Escapes characters for text nodes. Escapes `&<>`.,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errant comma at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye.

@seanmonstar
Copy link
Contributor

@kballard what more do you think this needs? More tests? Or can this be reviewed and merged?

@lilyball
Copy link
Contributor Author

lilyball commented May 7, 2014

@seanmonstar At this point I think it just needs more tests. I've been trying to find the time to write some, but I've been busy trying to shove my other PRs through the pipe. I'm not comfortable committing this until it at least has tests that exercise the different modes.

@seanmonstar
Copy link
Contributor

By modes, you mean the EscapeModes? That's functionality you were mentioning servo could want, right? I can try to write some tests; what's the best way to get them to you? A gist?

@lilyball
Copy link
Contributor Author

lilyball commented May 7, 2014

Yeah, the EscapeModes. I also want a test for with_allowed_char(), although since that's largely a no-op when used correctly (e.g. when used with >), the only real way to test it is by giving it a character that's normally valid.

Gist would work, or you could just make a commit on top of this branch, push it to your own repo, and comment with the URL to your branch. I can pull it into this (and you'll retain authorship of the commit).

@chris-morgan
Copy link
Member

I am opposed to this being called html if it only does escaping. htmlescape?

I would prefer something like https://github.com/kmcallister/html5 to be able to use the name html.

@seanmonstar
Copy link
Contributor

Precedent in other languages is that "html" provides escaping. "xml" tends
to be the place where parsing exists, with "dom" and "sax" modules.

@lilyball
Copy link
Contributor Author

@chris-morgan My thought was that once we try to incorporate full html5 parsing it could also go in the html module. Hopefully it would coexist with the existing APIs.

@lilyball
Copy link
Contributor Author

I've rebased on top of latest master (to include the fix for the match bug and the change to core::fmt). Still need to write tests for the various escaping modes.

@lilyball
Copy link
Contributor Author

Incidentally, it turns out most HTML libraries use &#x27; instead of &apos; because &apos; didn't exist in HTML4. It showed up in XHTML 1.0 (because it comes from XML), and then was added to HTML5.

I suppose this means we should probably encode using &#x27;, because even though we are an HTML5 entity decoder, using &#x27; is more generally useful.

@lilyball
Copy link
Contributor Author

On another note, it also turns out that my attempt to allow you to encode all non-ASCII characters has a serious problem. HTML5 maps a bunch of characters from 0x80-0x9F to other codepoints, which means there's no way to encode U+0080 as an HTML entity that will decode back into U+0080.

3 possible approaches:

  1. Redefine EscapeAll mode to not try and encode these characters. This means it is no longer guaranteed to produce ASCII output, which is unfortunate.
  2. Emit &#xFFFD; for these. If we can't round-trip, turning into U+FFFD may be the next best thing.
  3. Fail. And by that I mean return an IoError if one of these characters is encountered. That kind of sucks, though, and there's no way to customize the error (we'd have to use OtherIoError and put a textual description in the desc field).

We could also just get rid of EscapeAll, but it is potentially useful.

Incidentally, in python, "\x80".encode("ascii", "xmlcharrefreplace") produces &#128;, but it's using XML rules instead of HTML5 rules.

@seanmonstar Do you have any bright ideas here? Or know of any precedent where an HTML escaping library will escape non-ASCII characters (and what it does with e.g. "\x80")?

@lilyball
Copy link
Contributor Author

I believe this is now ready.

@emberian
Copy link
Member

cc @kmcallister

@lilyball
Copy link
Contributor Author

Regarding @kmcallister's html5 work, I'm hoping that, once that's ready, it should be able to take over the libhtml name. Hopefully it should not collide with the existing functionality here, but renaming some API as a [breaking-change] should be doable if necessary.

Worst case, if we stabilize this for Rust 1.0 before html5 is ready, it could just become libhtml5. But hopefully if we get to the point of stabilizing this, we can discuss with @kmcallister any API renaming to be done to ensure we won't have a problem later.

@huonw
Copy link
Member

huonw commented May 17, 2014

Presumably we can just not stabilize this library for 1.0.

@kmcallister
Copy link
Contributor

For reference, here's my entity parsing code, and the procedural macro which generates the table of entities as a Rust-PHF map. I'd be happy to work together on this stuff.

@lilyball
Copy link
Contributor Author

@kmcallister I would be happy to work together on this stuff. I'm concerned that trying to port this existing libhtml PR on top of your existing work is, well, a lot of work. Besides the effort of porting it, this would also make it more difficult for you to modify that code if you determine that you need to make changes.

Unless your html5 library is ready for merging in the very near future, my inclination is to say we should go ahead with this libhtml implementation of entity parsing as-is. As long as it conforms to the HTML5 entity parsing algorithm (and it was designed to do just that), we should be able to rewrite the existing API on top of your code at such time as html5 is ready to be merged without breaking any clients.

So I guess I have two questions for you about this PR that I would like you to comment on:

  1. Is the API ok? As in, can it be made to work with your code in the future? Note that I'm not marking this API as #[stable], so we reserve the right to change it in the future. But the general approach here should be fine, I assume?
  2. Is the algorithm actually correct? I believe it is, and I have a limited test suite. The only pre-existing test suite I included was a port of the one from python's html library. I think that, plus the hand-written tests, exercises the library sufficiently to be comfortable saying it's correct.

If you're comfortable with the API, and you don't spot any obvious flaws with the algorithm, then my preference is to merge it as-is, and worry about porting it on top of your code later.

lilyball added 6 commits May 18, 2014 17:26
libhtml provides escaping/unescaping of HTML entities. It matches the
HTML5 parsing rules as closely as possible. It provides convenience
functions to escape/unescape, helpers to perform the escaping/unescaping
during Show, and Writers that can escape/unescape in a streaming manner.

References:
http://www.w3.org/html/wg/drafts/html/CR/syntax.html
Add a bunch of tests taken from cpython's html module, along with a
couple of other homegrown ones.

Add support for &#X123; entities, with the capital X, which was
forgotten before.
When a named entity is aborted, we need to backtrack to the longest
prefix that doesn't require a semicolon.
EscapeAll tries to escape all non-ASCII characters. Unfortunately, HTML5
numeric entities can't represent most codepoints between U+0080 and
U+009F. The only way to handle those is to use XML entity rules, but
this is an HTML5 entity library.

Also change &apos; to &rust-lang#39;. It turns out &apos; isn't part of HTML
until HTML5, so using &rust-lang#39; is more compatible with pre-HTML5 parsers.
@kmcallister
Copy link
Contributor

Is the API ok? As in, can it be made to work with your code in the future?

I took a quick look and it seems fine. Is there anything particularly tricky you'd like me to consider?

Is the algorithm actually correct? I believe it is, and I have a limited test suite.

I'm using the html5lib tokenizer tests. Some of those files are specifically about character entities. Or you could take all the files and filter for the tests that just contain character tokens.

my preference is to merge it as-is, and worry about porting it on top of your code later.

That's fine with me.

@lilyball
Copy link
Contributor Author

@kmcallister Nothing in particular. I just wanted to make sure you're ok with it before I push to get it merged.

@alexcrichton
Copy link
Member

Closing due to inactivity, but this seems like a nice library to have!

bors pushed a commit to rust-lang-ci/rust that referenced this pull request Jan 28, 2025
I opened rust-lang/rust-clippy#13896 before.
However, I found that there're more cases where Clippy suggests to use
modules that belong to the `std` crate even in a `no_std` environment.
Therefore, this PR include the changes I've made in rust-lang#13896 and new
changes to fix cases I found this time to prevent wrong suggestions in
`no_std` environments as well.

changelog: [`redundant_closure`]: correct suggestion in `no_std`
changelog: [`repeat_vec_with_capacity`]: correct suggestion in `no_std`
changelog: [`single_range_in_vec_init`]: don't emit suggestion to use
`Vec` in `no_std`
changelog: [`drain_collect`]: correct suggestion in `no_std`
changelog: [`map_with_unused_argument_over_ranges`]: correct suggestion
in `no_std`

also close rust-lang#13895
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants