-
Notifications
You must be signed in to change notification settings - Fork 13.4k
add libhtml #13831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add libhtml #13831
Conversation
I would expect a library named |
@@ -0,0 +1,145 @@ | |||
// Copyright 2013-2014 The Rust Project Developers. See the COPYRIGHT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's new code. It should just be Copyright 2014.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I copied this header from src/librustdoc/html/escape.rs
. The Escape
code was lifted directly from that file. So I figured it was 2013-2014.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it was? Ok then, I guess that's fine.
@kballard i don't see a problem with the small scope of the library. There is much smaller libraries: libflate, libfourcc, etc. Plus, the reference libraries (Go, Python) only both with escaping in their html libs. Other html related things could go here, as well. |
|
||
|
||
try!(write!(fmt.buf, "{}", esc)); | ||
last = i + n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to behave unexpectedly if you have malformed entity that looks something like &a&b;
. Where "behave unexpectedly" will likely manifest as a failure on pile.o_bits.slice(last, i)
where last > i
.
Instead, you should ditch the for
loop and spin the iterator manually. You can then just pass the iterator directly to unescape_entity()
instead of creating a new one, and it will consume everything up to (and including) the trailing ;
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would look something like
loop {
let (i,b) = match iter.next() {
None => break,
Some(x) => x
};
if b == '&' as u8 {
if last < i { try!(fmt.buf.write(pile_o_bits.slice(last, i).as_bytes())); }
try!(fmt.buf.write(unescape_entity(iter.by_ref()).as_slice()));
}
}
(where unescape_entity()
is changed to just return MaybeOwned
).
Edit: Changed the call to unescape_entity()
to take iter.by_ref()
. We can't move our iterator into the function without losing it. The alternative is to have unescape_entity()
take its iterator by-reference, but I don't think that's necessary.
Edit 2: This code doesn't handle last
properly. It would need some other way of signifying that last
needs to be set on the next byte. Either that or the iterator should be wrapped in Peekable
. But I haven't explored this as I've already proposed another approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I noted below, I've now decided unescape_entity()
should just be given a &str
instead of an iterator, to avoid unnecessary allocation. We still need the manual loop so we can spin the iterator again. It would look something like
loop {
let (i,b) = match iter.next() {
None => break,
Some(x) => x
};
if b == '&' as u8 {
if last < i { try!(fmt.buf.write(pile_o_bits.slice(last, i))); }
let start = i;
let end = match iter.find(|&(i,b)| b == ';' as u8) {
None => break, // Entity has no close
Some((i,_)) => i
};
try!(fmt.buf.write(unescape_entity(pile_o_bits.slice(start+1, end))));
last = end + 1;
}
}
Where unescape_entity()
is now
fn unescape_entity(s: &str) -> MaybeOwned
and the string argument does not contain the leading &
or trailing ;
.
@seanmonstar Eh, I withdraw my objection about calling this |
} | ||
} | ||
|
||
fn unescape_entity<I: Iterator<u8>>(iter: I) -> (MaybeOwned, uint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my suggestion above, this should just return MaybeOwned
. The caller will give it the same iterator it's using and therefore the bytes consumed here will be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon thinking about this some more, perhaps unescape_entity()
should actually just take a &str
, containing the full entity. Let the caller figure out where the entity ends. This way you can avoid allocating a StrBuf
internally and instead just slice the source string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let the caller figure out? so, Unescape::fmt
? That'd put a lot more information about entities inside that fmt
method...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already knows that &
starts an entity, it just has to learn that ;
ends the entity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are entities that don't end in ;
, but I'll still try this way out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No there aren't. Admittedly, I'm a bit confused at the table, http://www.w3.org/TR/html5/syntax.html#named-character-references does in fact list entity names with and without semicolons. However, it is a parse error to not have a ;
after a named character reference.
That said, from reading http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references, it turns out there's cases where the ampersand does not actually start a named character reference. Notably, there's a few characters that may follow the &
that indicate it's not a character reference. And there's also some weird backwards-compatibility thing about parsing inside attributes, but I don't know of any html entity encoder/decoder routine that handles that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've provided some sample code above for how this modified unescape_entity()
should be called. With that approach, this function is given the string containing just the entity name, without the &
or ;
, so it will be something like "apos"
or "#39"
.
I supposed it could also make sense as |
} | ||
} | ||
} | ||
return (Owned(from_byte(x)), len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire function needs to be rewritten if you take my proposed approach, but I just wanted to note here that parsing of the number can be simplified using std::num::from_str_radix()
. Assuming we have an arg s: &str
that contains e.g. "#39"
it would look like
if s.starts_with("#") {
let s = s.slice_from(1);
let n: Option<u32> = if s.starts_with("x") {
// hex
num::from_str_radix(s.slice_from(1), 16)
} else {
num::from_str_radix(s, 10)
};
let n = match n {
None => return Slice(""), // elide it completely for a parse error, I guess
Some(n) => n
};
let c = match char::from_u32(n) {
None => return Slice("\uFFFD"), // required according to HTML5 spec
Some(c) => c
};
return str::from_char(c);
}
Also note that this code handles non-ASCII numeric character references, whereas yours didn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I found num::from_str_radix
a few minutes ago, been rewriting to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, apparently it doesn't handle numeric character references quite right. http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references has a table of values (covering 0
and most of 0x80..0x9F
) that are considered parse errors and are mapped to specific characters (I'm not quite sure why it's a parse error if it has a character you're supposed to return, I guess to give the user agent the opportunity of aborting if it wants to). So that should actually be handled.
I've changed my mind about Given that, and given the desire to avoid allocation, I would suggest that Alternatively, it could take the Iterator again, but use a fixed-size stack buffer that is large enough to contain the longest valid named character reference. The iterator approach is rather elegant, except that in the parse error case it needs to avoid consuming characters, which is awkward (We can kind of do it with |
After considering this some more, I believe I know why the For our purposes, I think this can be simplified by removing all the duplicates from the |
//! ```rust | ||
//! use html; | ||
//! | ||
//! fn main() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these two examples could be merged to be like
use html;
let original = "<p>Dr. Jekyll & Mr. Hyde</p>";
let escaped = html::escape(original)
assert_eq!(escaped.as_slice(), "<p>Dr. Jekyll & Mr. Hyde<p>");
let unescaped = html::unescape(escaped);
assert_eq!(unescaped.as_slice(), original);
@kballard I believe I've addressed all your comments. No longer allocating improved the unescape bench from ~3000ns to ~1300ns. \o/ One of your comments: |
(travis appears to have a tidy error) |
@@ -86,6 +86,7 @@ DEPS_workcache := std serialize collections log | |||
DEPS_log := std sync | |||
DEPS_regex := std collections | |||
DEPS_regex_macros = syntax std regex | |||
DEPS_html := std collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you may be able to update this dependency (the collections dependency is gone)
@seanmonstar Ah hah, after reading more of the parsing docs, if the "consume character reference" step returns nothing, the parser emits a |
try!(fmt.buf.write(pile_o_bits.slice(last, i).as_bytes())); | ||
} | ||
let (esc, n) = unescape_entity(s.slice_from(i + 1)); | ||
try!(fmt.buf.write_str(esc.as_slice())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my most recent comment, if esc.is_empty()
then this should write a single &
character.
|
use entity; | ||
|
||
/// Wrapper struct which will emit the HTML-escaped version of the contained | ||
/// string when passed to a format string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc comment suggests this wraps a string. It should clarify that it wraps any Show
type.
Your parser still doesn't handle the |
Oh dang, I didn't notice that. So, with |
Escape will escape the 5 chars <, >, &, ', and ". Unescape will unescape a lot more characters. - fn html::escape(&str) -> ~str - fn html::unescape(&str) -> ~str - struct html::fmt::Escape(&str) - struct html::fmt::Unescape(&str) References: http://golang.org/pkg/html/ https://docs.python.org/3/library/html.html
got the |
I added an |
@seanmonstar As an experiment, I just constructed a basic trie from the entity list. Each node had to have an array of 62 children, to allow for [a-zA-Z0-9]. The resulting trie ended up taking roughly 3.9MB of memory. For comparison, the This is rather unfortunate. A basic trie structure would be really easy to work with, because you could trivially tell when you've found a character that doesn't fit. A radix trie (which is what So I guess the static sorted array is the best solution. Perhaps the code could keep a "cursor" into the array, pointing to the first element prefixed by the entity so far, and each new character would cause the "cursor" to walk forward until it finds a prefix-matching value, or hits an element that's sorted too high. |
It occurs to me that unescaping might actually makes more sense as a |
Along the reader/writer vein, I was thinking it'd be nice to add a
|
I'm not familiar with nodejs Transform streams, but from reading the API documentation, it seems that they are objects that are both |
assert_eq!(unescape(UnTest("&am", "p;")), "&".to_owned()); | ||
assert_eq!(unescape("&fakentity"), "&fakentity".to_owned()); | ||
assert_eq!(unescape("ætest"), "ætest".to_owned()); | ||
assert_eq!(unescape("�abc"), "abc".to_owned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is wrong. The result should be "\uFFFDabc"
.
I ended up rewriting your code entirely, because of the difficulty I've had in expressing some things about the HTML parsing. The rewritten version is in my repo. It passes all the existing tests, but we really do need to throw a whole bunch more tests at it to make sure it behaves properly. I haven't tried running How do you want to proceed? You could push my code to your PR, or I could open a new PR and let others decide which implementation they prefer ;). And I apologize for effectively hijacking your project. [1] It turns out the allowed additional character doesn't actually change the decoding behavior, because we don't error out on parse errors, but it is important if we ever need to add the ability to track parse errors. |
Awesome. I don't care how we proceed. Is it just easiest if I close this and you open a new one? I don't care about any hijacking. I worked on this because: 1) I'm liking Rust, and felt it could use an |
@seanmonstar Glad to hear it! I'll go ahead and submit as a separate PR and you can close this one. At the very least, that will get rid of all the clutter of old review comments :) |
Also, thanks for all the reviewing you've done. \o/ |
…t, r=Veykril fix(completion): remove bound insert of type in trait Fixed rust-lang/rust-analyzer#13819
Escape will escape the 5 chars <, >, &, ', and ".
Unescape will unescape a lot more characters.
References:
http://golang.org/pkg/html/
https://docs.python.org/3/library/html.html
The wiki claims this is desired.