Add libhtml #13896

lilyball · 2014-05-02T16:50:10Z

libhtml provides escaping/unescaping of HTML entities. It matches the
HTML5 parsing rules as closely as possible. It provides convenience
functions to escape/unescape, helpers to perform the escaping/unescaping
during Show, and Writers that can escape/unescape in a streaming manner.

References:
http://www.w3.org/html/wg/drafts/html/CR/syntax.html

lilyball · 2014-05-02T16:50:20Z

This is a rewrite of PR #13831.

lilyball · 2014-05-02T16:51:21Z

This is not quite ready to be merged yet. It needs more tests. But it passes all the ones it currently has, as well as make check on my machine.

seanmonstar · 2014-05-02T16:51:53Z

src/libhtml/lib.rs

+    fn bench_unescape(b: &mut test::Bencher) {
+        let s = "&lt;script src=&quot;evil.domain?foo&amp;&quot; type=&#39;baz&#39;&gt;";
+        b.iter(|| unescape(s));
+    }


I had added a couple benches for checking worse case scenarios, I'm curious how they fared with your rewrite.

Good question. I grabbed the latest test_unescape tests from your code but didn't check if the others were updated.

lilyball · 2014-05-02T16:58:37Z

@seanmonstar I just pulled in your benchmarks.

test tests::bench_escape                 ... bench:      1020 ns/iter (+/- 17)
test tests::bench_longest_entity         ... bench:      1116 ns/iter (+/- 66)
test tests::bench_longest_non_entity     ... bench:      1249 ns/iter (+/- 98)
test tests::bench_short_entity_long_tail ... bench:      1051 ns/iter (+/- 17)
test tests::bench_unescape               ... bench:      3199 ns/iter (+/- 48)

seanmonstar · 2014-05-02T17:05:35Z

Excellent, the worst-case scenarios are mostly indistinguishable. Something I had done for the bench_unescape was to add in some "fast common" cases. Basically, check the 5 entities that escape outputs, before resorting to a bsearch. It got that bench down to ~1300ns.

lilyball · 2014-05-02T17:12:29Z

I can't do that with my approach, because I start walking the ENTITIES list the moment I find the first alphabetic character (e.g. as soon as I've seen "&a"). Unfortunately I have to do a linear walk of the list for each subsequent character. But I don't know how much time is actually spent in that part of the code.

I also don't know how comparable our numbers are, because we're on different machines. It would also be worth having benchmarks of various different types of unescapes, from ones using just the 5 basic characters, to ones using esoteric named characters, to ones using numeric escapes.

seanmonstar · 2014-05-02T17:22:33Z

For the numbers: your bench_escape is the same as it was on my machine. The worst-case scenarios were seeing ~4000ns, because of the rewinding.

It doesn't look like there are any entities that are shorter than 2 characters. Could it be possible to delay the lookup until the second character, and special case if the chars are am,lt,gt,ap,qu? Just a thought for speeding up the most common escapes.

lilyball · 2014-05-02T17:24:53Z

Hmm, interesting suggestion. It will complicate the code a little (obviously), but it might improve the speed. I'll give it a shot later.

seanmonstar · 2014-05-02T17:50:39Z

Also, here's Python 3's html entities test case: http://hg.python.org/cpython/file/82caec3865e3/Lib/test/test_html.py

lilyball · 2014-05-02T17:54:33Z

Thanks. Stealing tests from other entity libraries is a good idea.

lilyball · 2014-05-02T17:55:21Z

Ugh, why did my local make check not hit the make tidy errors?

lilyball · 2014-05-03T07:07:06Z

@seanmonstar I just ported the python tests, and they uncovered two bugs (now fixed).

seanmonstar · 2014-05-04T04:04:44Z

\o/

richo · 2014-05-07T07:59:28Z

src/libhtml/escape.rs

+pub enum EscapeMode {
+    /// The general-purpose mode. Escapes ``&<>"'`.
+    EscapeDefault,
+    /// Escapes characters for text nodes. Escapes `&<>`.,


Errant comma at the end?

seanmonstar · 2014-05-07T17:09:36Z

@kballard what more do you think this needs? More tests? Or can this be reviewed and merged?

lilyball · 2014-05-07T17:10:47Z

@seanmonstar At this point I think it just needs more tests. I've been trying to find the time to write some, but I've been busy trying to shove my other PRs through the pipe. I'm not comfortable committing this until it at least has tests that exercise the different modes.

seanmonstar · 2014-05-07T17:22:49Z

By modes, you mean the EscapeModes? That's functionality you were mentioning servo could want, right? I can try to write some tests; what's the best way to get them to you? A gist?

lilyball · 2014-05-07T17:38:22Z

Yeah, the EscapeModes. I also want a test for with_allowed_char(), although since that's largely a no-op when used correctly (e.g. when used with >), the only real way to test it is by giving it a character that's normally valid.

Gist would work, or you could just make a commit on top of this branch, push it to your own repo, and comment with the URL to your branch. I can pull it into this (and you'll retain authorship of the commit).

chris-morgan · 2014-05-10T08:15:23Z

I am opposed to this being called html if it only does escaping. htmlescape?

I would prefer something like https://github.com/kmcallister/html5 to be able to use the name html.

seanmonstar · 2014-05-10T14:53:03Z

Precedent in other languages is that "html" provides escaping. "xml" tends
to be the place where parsing exists, with "dom" and "sax" modules.

lilyball · 2014-05-10T19:37:38Z

@chris-morgan My thought was that once we try to incorporate full html5 parsing it could also go in the html module. Hopefully it would coexist with the existing APIs.

lilyball · 2014-05-16T23:47:06Z

I've rebased on top of latest master (to include the fix for the match bug and the change to core::fmt). Still need to write tests for the various escaping modes.

lilyball · 2014-05-17T08:18:16Z

Incidentally, it turns out most HTML libraries use ' instead of ' because ' didn't exist in HTML4. It showed up in XHTML 1.0 (because it comes from XML), and then was added to HTML5.

I suppose this means we should probably encode using ', because even though we are an HTML5 entity decoder, using ' is more generally useful.

lilyball · 2014-05-17T08:23:49Z

On another note, it also turns out that my attempt to allow you to encode all non-ASCII characters has a serious problem. HTML5 maps a bunch of characters from 0x80-0x9F to other codepoints, which means there's no way to encode U+0080 as an HTML entity that will decode back into U+0080.

3 possible approaches:

Redefine EscapeAll mode to not try and encode these characters. This means it is no longer guaranteed to produce ASCII output, which is unfortunate.
Emit � for these. If we can't round-trip, turning into U+FFFD may be the next best thing.
Fail. And by that I mean return an IoError if one of these characters is encountered. That kind of sucks, though, and there's no way to customize the error (we'd have to use OtherIoError and put a textual description in the desc field).

We could also just get rid of EscapeAll, but it is potentially useful.

Incidentally, in python, "\x80".encode("ascii", "xmlcharrefreplace") produces , but it's using XML rules instead of HTML5 rules.

@seanmonstar Do you have any bright ideas here? Or know of any precedent where an HTML escaping library will escape non-ASCII characters (and what it does with e.g. "\x80")?

lilyball · 2014-05-17T21:25:58Z

I believe this is now ready.

emberian · 2014-05-17T21:27:20Z

cc @kmcallister

lilyball · 2014-05-17T21:36:43Z

Regarding @kmcallister's html5 work, I'm hoping that, once that's ready, it should be able to take over the libhtml name. Hopefully it should not collide with the existing functionality here, but renaming some API as a [breaking-change] should be doable if necessary.

Worst case, if we stabilize this for Rust 1.0 before html5 is ready, it could just become libhtml5. But hopefully if we get to the point of stabilizing this, we can discuss with @kmcallister any API renaming to be done to ensure we won't have a problem later.

huonw · 2014-05-17T23:35:20Z

Presumably we can just not stabilize this library for 1.0.

kmcallister · 2014-05-18T21:37:30Z

For reference, here's my entity parsing code, and the procedural macro which generates the table of entities as a Rust-PHF map. I'd be happy to work together on this stuff.

lilyball · 2014-05-18T23:22:30Z

@kmcallister I would be happy to work together on this stuff. I'm concerned that trying to port this existing libhtml PR on top of your existing work is, well, a lot of work. Besides the effort of porting it, this would also make it more difficult for you to modify that code if you determine that you need to make changes.

Unless your html5 library is ready for merging in the very near future, my inclination is to say we should go ahead with this libhtml implementation of entity parsing as-is. As long as it conforms to the HTML5 entity parsing algorithm (and it was designed to do just that), we should be able to rewrite the existing API on top of your code at such time as html5 is ready to be merged without breaking any clients.

So I guess I have two questions for you about this PR that I would like you to comment on:

Is the API ok? As in, can it be made to work with your code in the future? Note that I'm not marking this API as #[stable], so we reserve the right to change it in the future. But the general approach here should be fine, I assume?
Is the algorithm actually correct? I believe it is, and I have a limited test suite. The only pre-existing test suite I included was a port of the one from python's html library. I think that, plus the hand-written tests, exercises the library sufficiently to be comfortable saying it's correct.

If you're comfortable with the API, and you don't spot any obvious flaws with the algorithm, then my preference is to merge it as-is, and worry about porting it on top of your code later.

libhtml provides escaping/unescaping of HTML entities. It matches the HTML5 parsing rules as closely as possible. It provides convenience functions to escape/unescape, helpers to perform the escaping/unescaping during Show, and Writers that can escape/unescape in a streaming manner. References: http://www.w3.org/html/wg/drafts/html/CR/syntax.html

Add a bunch of tests taken from cpython's html module, along with a couple of other homegrown ones. Add support for &#X123; entities, with the capital X, which was forgotten before.

When a named entity is aborted, we need to backtrack to the longest prefix that doesn't require a semicolon.

EscapeAll tries to escape all non-ASCII characters. Unfortunately, HTML5 numeric entities can't represent most codepoints between U+0080 and U+009F. The only way to handle those is to use XML entity rules, but this is an HTML5 entity library. Also change ' to &rust-lang#39;. It turns out ' isn't part of HTML until HTML5, so using &rust-lang#39; is more compatible with pre-HTML5 parsers.

kmcallister · 2014-05-22T00:39:19Z

Is the API ok? As in, can it be made to work with your code in the future?

I took a quick look and it seems fine. Is there anything particularly tricky you'd like me to consider?

Is the algorithm actually correct? I believe it is, and I have a limited test suite.

I'm using the html5lib tokenizer tests. Some of those files are specifically about character entities. Or you could take all the files and filter for the tests that just contain character tokens.

my preference is to merge it as-is, and worry about porting it on top of your code later.

That's fine with me.

lilyball · 2014-05-22T00:39:58Z

@kmcallister Nothing in particular. I just wanted to make sure you're ok with it before I push to get it merged.

alexcrichton · 2014-06-16T06:55:23Z

Closing due to inactivity, but this seems like a nice library to have!

I opened rust-lang/rust-clippy#13896 before. However, I found that there're more cases where Clippy suggests to use modules that belong to the `std` crate even in a `no_std` environment. Therefore, this PR include the changes I've made in rust-lang#13896 and new changes to fix cases I found this time to prevent wrong suggestions in `no_std` environments as well. changelog: [`redundant_closure`]: correct suggestion in `no_std` changelog: [`repeat_vec_with_capacity`]: correct suggestion in `no_std` changelog: [`single_range_in_vec_init`]: don't emit suggestion to use `Vec` in `no_std` changelog: [`drain_collect`]: correct suggestion in `no_std` changelog: [`map_with_unused_argument_over_ranges`]: correct suggestion in `no_std` also close rust-lang#13895

seanmonstar reviewed May 2, 2014
View reviewed changes

lilyball mentioned this pull request May 2, 2014

Wrong matching with enums and overlapping ranges #13867

Closed

richo reviewed May 7, 2014
View reviewed changes

lilyball added 6 commits May 18, 2014 17:26

libhtml: Add a few more benchmarks

24b73d7

libhtml: Add tests, support &#X

4ad5be4

Add a bunch of tests taken from cpython's html module, along with a couple of other homegrown ones. Add support for &#X123; entities, with the capital X, which was forgotten before.

libhtml: Fix edge case in entity parsing

63d3b2f

When a named entity is aborted, we need to backtrack to the longest prefix that doesn't require a semicolon.

Add tests for the various escaping modes

fff0b8e

japaric mentioned this pull request Jun 2, 2014

Rust Playpen Integration rust-lang/rust-by-example#71

Closed

alexcrichton closed this Jun 16, 2014

Add libhtml #13896

Add libhtml #13896

Uh oh!

Conversation

lilyball commented May 2, 2014

Uh oh!

lilyball commented May 2, 2014

Uh oh!

lilyball commented May 2, 2014

Uh oh!

seanmonstar May 2, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball May 2, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball commented May 2, 2014

Uh oh!

seanmonstar commented May 2, 2014

Uh oh!

lilyball commented May 2, 2014

Uh oh!

seanmonstar commented May 2, 2014

Uh oh!

lilyball commented May 2, 2014

Uh oh!

seanmonstar commented May 2, 2014

Uh oh!

lilyball commented May 2, 2014

Uh oh!

lilyball commented May 2, 2014

Uh oh!

lilyball commented May 3, 2014

Uh oh!

seanmonstar commented May 4, 2014

Uh oh!

richo May 7, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball May 7, 2014

Choose a reason for hiding this comment

Uh oh!

seanmonstar commented May 7, 2014

Uh oh!

lilyball commented May 7, 2014

Uh oh!

seanmonstar commented May 7, 2014

Uh oh!

lilyball commented May 7, 2014

Uh oh!

chris-morgan commented May 10, 2014

Uh oh!

seanmonstar commented May 10, 2014

Uh oh!

lilyball commented May 10, 2014

Uh oh!

lilyball commented May 16, 2014

Uh oh!

lilyball commented May 17, 2014

Uh oh!

lilyball commented May 17, 2014

Uh oh!

lilyball commented May 17, 2014

Uh oh!

emberian commented May 17, 2014

Uh oh!

lilyball commented May 17, 2014

Uh oh!

huonw commented May 17, 2014

Uh oh!

kmcallister commented May 18, 2014

Uh oh!

lilyball commented May 18, 2014

Uh oh!

kmcallister commented May 22, 2014

Uh oh!

lilyball commented May 22, 2014

Uh oh!

alexcrichton commented Jun 16, 2014

Uh oh!