-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Add libhtml #13896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add libhtml #13896
Conversation
This is a rewrite of PR #13831. |
This is not quite ready to be merged yet. It needs more tests. But it passes all the ones it currently has, as well as |
fn bench_unescape(b: &mut test::Bencher) { | ||
let s = "<script src="evil.domain?foo&" type='baz'>"; | ||
b.iter(|| unescape(s)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added a couple benches for checking worse case scenarios, I'm curious how they fared with your rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I grabbed the latest test_unescape
tests from your code but didn't check if the others were updated.
@seanmonstar I just pulled in your benchmarks.
|
Excellent, the worst-case scenarios are mostly indistinguishable. Something I had done for the |
I can't do that with my approach, because I start walking the I also don't know how comparable our numbers are, because we're on different machines. It would also be worth having benchmarks of various different types of unescapes, from ones using just the 5 basic characters, to ones using esoteric named characters, to ones using numeric escapes. |
For the numbers: your It doesn't look like there are any entities that are shorter than 2 characters. Could it be possible to delay the lookup until the second character, and special case if the chars are |
Hmm, interesting suggestion. It will complicate the code a little (obviously), but it might improve the speed. I'll give it a shot later. |
Also, here's Python 3's html entities test case: http://hg.python.org/cpython/file/82caec3865e3/Lib/test/test_html.py |
Thanks. Stealing tests from other entity libraries is a good idea. |
Ugh, why did my local |
@seanmonstar I just ported the python tests, and they uncovered two bugs (now fixed). |
\o/ |
pub enum EscapeMode { | ||
/// The general-purpose mode. Escapes ``&<>"'`. | ||
EscapeDefault, | ||
/// Escapes characters for text nodes. Escapes `&<>`., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errant comma at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye.
@kballard what more do you think this needs? More tests? Or can this be reviewed and merged? |
@seanmonstar At this point I think it just needs more tests. I've been trying to find the time to write some, but I've been busy trying to shove my other PRs through the pipe. I'm not comfortable committing this until it at least has tests that exercise the different modes. |
By modes, you mean the |
Yeah, the Gist would work, or you could just make a commit on top of this branch, push it to your own repo, and comment with the URL to your branch. I can pull it into this (and you'll retain authorship of the commit). |
I am opposed to this being called I would prefer something like https://github.com/kmcallister/html5 to be able to use the name |
Precedent in other languages is that "html" provides escaping. "xml" tends |
@chris-morgan My thought was that once we try to incorporate full html5 parsing it could also go in the |
I've rebased on top of latest master (to include the fix for the match bug and the change to |
Incidentally, it turns out most HTML libraries use I suppose this means we should probably encode using |
On another note, it also turns out that my attempt to allow you to encode all non-ASCII characters has a serious problem. HTML5 maps a bunch of characters from 3 possible approaches:
We could also just get rid of Incidentally, in python, @seanmonstar Do you have any bright ideas here? Or know of any precedent where an HTML escaping library will escape non-ASCII characters (and what it does with e.g. |
I believe this is now ready. |
cc @kmcallister |
Regarding @kmcallister's html5 work, I'm hoping that, once that's ready, it should be able to take over the Worst case, if we stabilize this for Rust 1.0 before html5 is ready, it could just become libhtml5. But hopefully if we get to the point of stabilizing this, we can discuss with @kmcallister any API renaming to be done to ensure we won't have a problem later. |
Presumably we can just not stabilize this library for 1.0. |
For reference, here's my entity parsing code, and the procedural macro which generates the table of entities as a Rust-PHF map. I'd be happy to work together on this stuff. |
@kmcallister I would be happy to work together on this stuff. I'm concerned that trying to port this existing libhtml PR on top of your existing work is, well, a lot of work. Besides the effort of porting it, this would also make it more difficult for you to modify that code if you determine that you need to make changes. Unless your html5 library is ready for merging in the very near future, my inclination is to say we should go ahead with this libhtml implementation of entity parsing as-is. As long as it conforms to the HTML5 entity parsing algorithm (and it was designed to do just that), we should be able to rewrite the existing API on top of your code at such time as html5 is ready to be merged without breaking any clients. So I guess I have two questions for you about this PR that I would like you to comment on:
If you're comfortable with the API, and you don't spot any obvious flaws with the algorithm, then my preference is to merge it as-is, and worry about porting it on top of your code later. |
libhtml provides escaping/unescaping of HTML entities. It matches the HTML5 parsing rules as closely as possible. It provides convenience functions to escape/unescape, helpers to perform the escaping/unescaping during Show, and Writers that can escape/unescape in a streaming manner. References: http://www.w3.org/html/wg/drafts/html/CR/syntax.html
Add a bunch of tests taken from cpython's html module, along with a couple of other homegrown ones. Add support for ģ entities, with the capital X, which was forgotten before.
When a named entity is aborted, we need to backtrack to the longest prefix that doesn't require a semicolon.
EscapeAll tries to escape all non-ASCII characters. Unfortunately, HTML5 numeric entities can't represent most codepoints between U+0080 and U+009F. The only way to handle those is to use XML entity rules, but this is an HTML5 entity library. Also change ' to &rust-lang#39;. It turns out ' isn't part of HTML until HTML5, so using &rust-lang#39; is more compatible with pre-HTML5 parsers.
I took a quick look and it seems fine. Is there anything particularly tricky you'd like me to consider?
I'm using the html5lib tokenizer tests. Some of those files are specifically about character entities. Or you could take all the files and filter for the tests that just contain character tokens.
That's fine with me. |
@kmcallister Nothing in particular. I just wanted to make sure you're ok with it before I push to get it merged. |
Closing due to inactivity, but this seems like a nice library to have! |
I opened rust-lang/rust-clippy#13896 before. However, I found that there're more cases where Clippy suggests to use modules that belong to the `std` crate even in a `no_std` environment. Therefore, this PR include the changes I've made in rust-lang#13896 and new changes to fix cases I found this time to prevent wrong suggestions in `no_std` environments as well. changelog: [`redundant_closure`]: correct suggestion in `no_std` changelog: [`repeat_vec_with_capacity`]: correct suggestion in `no_std` changelog: [`single_range_in_vec_init`]: don't emit suggestion to use `Vec` in `no_std` changelog: [`drain_collect`]: correct suggestion in `no_std` changelog: [`map_with_unused_argument_over_ranges`]: correct suggestion in `no_std` also close rust-lang#13895
libhtml provides escaping/unescaping of HTML entities. It matches the
HTML5 parsing rules as closely as possible. It provides convenience
functions to escape/unescape, helpers to perform the escaping/unescaping
during Show, and Writers that can escape/unescape in a streaming manner.
References:
http://www.w3.org/html/wg/drafts/html/CR/syntax.html