add libhtml #13831

seanmonstar · 2014-04-28T23:14:01Z

Escape will escape the 5 chars <, >, &, ', and ".
Unescape will unescape a lot more characters.

fn html::escape(&str) -> ~str
fn html::unescape(&str) -> ~str
struct html::fmt::Escape(&str)
struct html::fmt::Unescape(&str)

References:
http://golang.org/pkg/html/
https://docs.python.org/3/library/html.html

The wiki claims this is desired.

lilyball · 2014-04-28T23:19:36Z

I would expect a library named libhtml to cover a lot more than just escaping/unescaping of special characters. Something that simple seems more appropriate as a module std::html, or maybe even serialize::html.

lilyball · 2014-04-28T23:20:10Z

src/libhtml/fmt.rs

@@ -0,0 +1,145 @@
+// Copyright 2013-2014 The Rust Project Developers. See the COPYRIGHT


ah, I copied this header from src/librustdoc/html/escape.rs. The Escape code was lifted directly from that file. So I figured it was 2013-2014.

Oh it was? Ok then, I guess that's fine.

seanmonstar · 2014-04-28T23:44:54Z

@kballard i don't see a problem with the small scope of the library. There is much smaller libraries: libflate, libfourcc, etc. Plus, the reference libraries (Go, Python) only both with escaping in their html libs.

Other html related things could go here, as well.

lilyball · 2014-04-28T23:46:46Z

src/libhtml/fmt.rs

+
+
+                    try!(write!(fmt.buf, "{}", esc));
+                    last = i + n;


This is likely to behave unexpectedly if you have malformed entity that looks something like &a&b;. Where "behave unexpectedly" will likely manifest as a failure on pile.o_bits.slice(last, i) where last > i.

Instead, you should ditch the for loop and spin the iterator manually. You can then just pass the iterator directly to unescape_entity() instead of creating a new one, and it will consume everything up to (and including) the trailing ;.

This would look something like

loop { let (i,b) = match iter.next() { None => break, Some(x) => x }; if b == '&' as u8 { if last < i { try!(fmt.buf.write(pile_o_bits.slice(last, i).as_bytes())); } try!(fmt.buf.write(unescape_entity(iter.by_ref()).as_slice())); } }

(where unescape_entity() is changed to just return MaybeOwned).

Edit: Changed the call to unescape_entity() to take iter.by_ref(). We can't move our iterator into the function without losing it. The alternative is to have unescape_entity() take its iterator by-reference, but I don't think that's necessary.

Edit 2: This code doesn't handle last properly. It would need some other way of signifying that last needs to be set on the next byte. Either that or the iterator should be wrapped in Peekable. But I haven't explored this as I've already proposed another approach.

As I noted below, I've now decided unescape_entity() should just be given a &str instead of an iterator, to avoid unnecessary allocation. We still need the manual loop so we can spin the iterator again. It would look something like

loop { let (i,b) = match iter.next() { None => break, Some(x) => x }; if b == '&' as u8 { if last break, // Entity has no close Some((i,_)) => i }; try!(fmt.buf.write(unescape_entity(pile_o_bits.slice(start+1, end)))); last = end + 1; } }

Where unescape_entity() is now

fn unescape_entity(s: &str) -> MaybeOwned

and the string argument does not contain the leading & or trailing ;.

lilyball · 2014-04-28T23:56:01Z

@seanmonstar Eh, I withdraw my objection about calling this libhtml. I still think it's slightly misleading to have an entire library named libhtml that turns out to contain nothing more than just escaping. And I worry a bit about whether this API will be compatible with the desired API of a future library that supports full HTML parsing/serialization. But as long as this library isn't marked #[stable] I suppose that's acceptable.

lilyball · 2014-04-28T23:57:42Z

src/libhtml/fmt.rs

+    }
+}
+
+fn unescape_entity<I: Iterator<u8>>(iter: I) -> (MaybeOwned, uint) {


As per my suggestion above, this should just return MaybeOwned. The caller will give it the same iterator it's using and therefore the bytes consumed here will be sufficient.

Upon thinking about this some more, perhaps unescape_entity() should actually just take a &str, containing the full entity. Let the caller figure out where the entity ends. This way you can avoid allocating a StrBuf internally and instead just slice the source string.

let the caller figure out? so, Unescape::fmt? That'd put a lot more information about entities inside that fmt method...

It already knows that & starts an entity, it just has to learn that ; ends the entity.

There are entities that don't end in ;, but I'll still try this way out.

No there aren't. Admittedly, I'm a bit confused at the table, http://www.w3.org/TR/html5/syntax.html#named-character-references does in fact list entity names with and without semicolons. However, it is a parse error to not have a ; after a named character reference.

That said, from reading http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references, it turns out there's cases where the ampersand does not actually start a named character reference. Notably, there's a few characters that may follow the & that indicate it's not a character reference. And there's also some weird backwards-compatibility thing about parsing inside attributes, but I don't know of any html entity encoder/decoder routine that handles that.

I've provided some sample code above for how this modified unescape_entity() should be called. With that approach, this function is given the string containing just the entity name, without the & or ;, so it will be something like "apos" or "#39".

seanmonstar · 2014-04-29T00:13:35Z

I supposed it could also make sense as serialize::html, but there's precedent for this being libhtml. Shrug.

lilyball · 2014-04-29T00:34:25Z

src/libhtml/fmt.rs

+                }
+            }
+        }
+        return (Owned(from_byte(x)), len);


This entire function needs to be rewritten if you take my proposed approach, but I just wanted to note here that parsing of the number can be simplified using std::num::from_str_radix(). Assuming we have an arg s: &str that contains e.g. "#39" it would look like

if s.starts_with("#") { let s = s.slice_from(1); let n: Option<u32> = if s.starts_with("x") { // hex num::from_str_radix(s.slice_from(1), 16) } else { num::from_str_radix(s, 10) }; let n = match n { None => return Slice(""), // elide it completely for a parse error, I guess Some(n) => n }; let c = match char::from_u32(n) { None => return Slice("\uFFFD"), // required according to HTML5 spec Some(c) => c }; return str::from_char(c); }

Also note that this code handles non-ASCII numeric character references, whereas yours didn't.

Yea, I found num::from_str_radix a few minutes ago, been rewriting to use it.

Actually, apparently it doesn't handle numeric character references quite right. http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references has a table of values (covering 0 and most of 0x80..0x9F) that are considered parse errors and are mapped to specific characters (I'm not quite sure why it's a parse error if it has a character you're supposed to return, I guess to give the user agent the opportunity of aborting if it wants to). So that should actually be handled.

lilyball · 2014-04-29T00:48:19Z

I've changed my mind about unescape_entity() again. For proper HTML5 parsing, invalid entities have well-defined rules for how many characters they should consume before the parse error. Since aborting is not really an option when we encounter a parse error, we should try to match its rules and only consume the correct number of characters. Basically, this means that &#1abc is a parse error but should only consume the &#1 for the entity. I'm not quite sure what the right parse error behavior is here (perhaps pretend there was a semicolon?). Similarly &#abc very explicitly should only consume the &, return nothing, and the unescaped string will end up as "#abc".

Given that, and given the desire to avoid allocation, I would suggest that unescape_entity() still receive a s: &str arg, but this arg should be a slice containing the entire rest of the input after the & (not including the &). It would then once again return how many characters it actually consumed.

Alternatively, it could take the Iterator again, but use a fixed-size stack buffer that is large enough to contain the longest valid named character reference. The iterator approach is rather elegant, except that in the parse error case it needs to avoid consuming characters, which is awkward (We can kind of do it with RandomAccessIterator but it ruins a lot of the elegance). Given that I think the approach of taking a &str containing the rest of the string makes sense.

lilyball · 2014-04-29T00:56:32Z

After considering this some more, I believe I know why the ENTITIES table has trailing semicolons on most values, but also contains some values that are duplicates of existing ones but without the trailing semicolon. This has to do with parse error behavior. Specifically, if an identifier without the semicolon is matched, this is a parse error, but the proper recovery behavior is to produce the corresponding entity value anyway. And by that I mean "&aeligtest" is a parse error but should evaluate to "ætest" anyway. However "&acytest", while also being a parse error, would evaluate to "acytest".

For our purposes, I think this can be simplified by removing all the duplicates from the ENTITIES table and instead adding a boolean flag indicating whether the parse error behavior should still consider this entity to be a match without the semicolon.

huonw · 2014-04-29T00:59:53Z

src/libhtml/lib.rs

+//! ```rust
+//! use html;
+//!
+//! fn main() {


Maybe these two examples could be merged to be like

use html; let original = "Dr. Jekyll & Mr. Hyde"; let escaped = html::escape(original) assert_eq!(escaped.as_slice(), "Dr. Jekyll & Mr. Hyde"); let unescaped = html::unescape(escaped); assert_eq!(unescaped.as_slice(), original);

seanmonstar · 2014-04-29T03:35:02Z

@kballard I believe I've addressed all your comments. No longer allocating improved the unescape bench from ~3000ns to ~1300ns. \o/

One of your comments: &#abc => #abc... Other implementations, Firefox and Go, both return unconsume the & as well, leaving the string as is.

alexcrichton · 2014-04-29T04:07:42Z

(travis appears to have a tidy error)

alexcrichton · 2014-04-29T04:08:58Z

mk/crates.mk

@@ -86,6 +86,7 @@ DEPS_workcache := std serialize collections log
 DEPS_log := std sync
 DEPS_regex := std collections
 DEPS_regex_macros = syntax std regex
+DEPS_html := std collections


It looks like you may be able to update this dependency (the collections dependency is gone)

lilyball · 2014-04-29T04:55:47Z

@seanmonstar Ah hah, after reading more of the parsing docs, if the "consume character reference" step returns nothing, the parser emits a & character.

lilyball · 2014-04-29T04:56:47Z

src/libhtml/fmt.rs

+                    try!(fmt.buf.write(pile_o_bits.slice(last, i).as_bytes()));
+                }
+                let (esc, n) = unescape_entity(s.slice_from(i + 1));
+                try!(fmt.buf.write_str(esc.as_slice()));


Based on my most recent comment, if esc.is_empty() then this should write a single & character.

lilyball · 2014-04-30T03:12:52Z

html::entity::lookup() docstring needs to explain that the input should be the entity name without the & or ;, e.g. "amp".

lilyball · 2014-04-30T03:13:50Z

src/libhtml/fmt.rs

+use entity;
+
+/// Wrapper struct which will emit the HTML-escaped version of the contained
+/// string when passed to a format string.


Doc comment suggests this wraps a string. It should clarify that it wraps any Show type.

lilyball · 2014-04-30T04:20:35Z

Your parser still doesn't handle the &aeligtest case correctly. In fact, handling that correctly is probably going to require using a Trie to hold the entities instead of the current flat list, as you're supposed to error out as soon as you find a character that doesn't match one of the entities in the list (and, going back to &aeligtest, some of entities behave differently than others if you find them without a trailing semicolon). Your current behavior of erroring out if you find too many characters is an approximation of this, but if I write!(unescapeWriter, "&aeligtest") I would expect it to have written "ætest" immediately instead of requiring a flush().

seanmonstar · 2014-04-30T06:31:05Z

Oh dang, I didn't notice that. So, with &aeligtest, it would need to lookup(s.slice_to(s.len() -1)) and then lookup(s.slice_to(s.len() -2) and so on...

Escape will escape the 5 chars <, >, &, ', and ". Unescape will unescape a lot more characters. - fn html::escape(&str) -> ~str - fn html::unescape(&str) -> ~str - struct html::fmt::Escape(&str) - struct html::fmt::Unescape(&str) References: http://golang.org/pkg/html/ https://docs.python.org/3/library/html.html

seanmonstar · 2014-05-01T23:09:47Z

got the &aeligtest working. right now, it's rewinding the buf until it finds a match. I've added some benches flexing the longest rewinding. It's not stellar. Looked into a Trie, but noticed that collections::trie is only for uints...

seanmonstar · 2014-05-01T23:12:32Z

I added an allow(unused_mut) to remove the warnings with a FIXME to remove it. The tests don't pass though, as it's wrongly matching a to an x, thanks to #13867

lilyball · 2014-05-02T02:34:35Z

@seanmonstar As an experiment, I just constructed a basic trie from the entity list. Each node had to have an array of 62 children, to allow for [a-zA-Z0-9]. The resulting trie ended up taking roughly 3.9MB of memory. For comparison, the &[(&str, &str, bool)] array structure takes roughly 110KB of memory (that bool is the record of whether the semicolon is optional, which I believe you still aren't representing in your code).

This is rather unfortunate. A basic trie structure would be really easy to work with, because you could trivially tell when you've found a character that doesn't fit. A radix trie (which is what TrieMap is, although that's unusable anyway because of uint keys) would be a lot smaller but would not be usable in the way a basic trie is.

So I guess the static sorted array is the best solution. Perhaps the code could keep a "cursor" into the array, pointing to the first element prefixed by the entity so far, and each new character would cause the "cursor" to walk forward until it finds a prefix-matching value, or hits an element that's sorted too high.

lilyball · 2014-05-02T03:18:28Z

It occurs to me that unescaping might actually makes more sense as a Reader than a Writer. The implementation would be more complex, though, which is unfortunate.

seanmonstar · 2014-05-02T03:34:58Z

Along the reader/writer vein, I was thinking it'd be nice to add a
Transform to std io, sort of like nodejs Transform streams. Implement a
transform method, and it can wrap a reader or writer.
On May 1, 2014 8:18 PM, "Kevin Ballard" notifications@github.com wrote:

It occurs to me that unescaping might actually makes more sense as a
Reader than a Writer. The implementation would be more complex, though,
which is unfortunate.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/13831#issuecomment-41981847
.

lilyball · 2014-05-02T03:52:27Z

I'm not familiar with nodejs Transform streams, but from reading the API documentation, it seems that they are objects that are both Readers and Writers`, and that data written to them is then transformed and available for reading. That's not quite the same sort of thing.

lilyball · 2014-05-02T06:48:34Z

src/libhtml/lib.rs

+        assert_eq!(unescape(UnTest("&am", "p;")), "&".to_owned());
+        assert_eq!(unescape("&fakentity"), "&fakentity".to_owned());
+        assert_eq!(unescape("&aeligtest"), "ætest".to_owned());
+        assert_eq!(unescape("&#0abc"), "abc".to_owned());


This test is wrong. The result should be "\uFFFDabc".

lilyball · 2014-05-02T07:26:29Z

I ended up rewriting your code entirely, because of the difficulty I've had in expressing some things about the HTML parsing. The rewritten version is in my repo. It passes all the existing tests, but we really do need to throw a whole bunch more tests at it to make sure it behaves properly. I haven't tried running make check yet, I only verified that libhtml compiles on its own. I also added in support for various features such as escaping all non-ASCII chars, or providing an allowed additional character to decoding[1]. These features have zero tests at the moment.

How do you want to proceed? You could push my code to your PR, or I could open a new PR and let others decide which implementation they prefer ;). And I apologize for effectively hijacking your project.

[1] It turns out the allowed additional character doesn't actually change the decoding behavior, because we don't error out on parse errors, but it is important if we ever need to add the ability to track parse errors.

seanmonstar · 2014-05-02T16:19:31Z

Awesome. I don't care how we proceed. Is it just easiest if I close this and you open a new one?

I don't care about any hijacking. I worked on this because: 1) I'm liking Rust, and felt it could use an html library, and 2) it seemed small enough to tackle, while still letting me learn more about how to write Rust. Both objectives are happening, I feel.

lilyball · 2014-05-02T16:48:32Z

@seanmonstar Glad to hear it! I'll go ahead and submit as a separate PR and you can close this one. At the very least, that will get rid of all the clutter of old review comments :)

seanmonstar · 2014-05-02T16:50:37Z

Also, thanks for all the reviewing you've done. \o/

…t, r=Veykril fix(completion): remove bound insert of type in trait Fixed rust-lang/rust-analyzer#13819

lilyball reviewed Apr 28, 2014
View reviewed changes

lilyball reviewed Apr 29, 2014
View reviewed changes

huonw reviewed Apr 29, 2014
View reviewed changes

alexcrichton reviewed Apr 29, 2014
View reviewed changes

lilyball reviewed Apr 29, 2014
View reviewed changes

lilyball reviewed Apr 30, 2014
View reviewed changes

This was referenced Apr 30, 2014

Overly aggressive mut lint in match patterns #13866

Closed

Wrong matching with enums and overlapping ranges #13867

Closed

lilyball reviewed May 2, 2014
View reviewed changes

lilyball mentioned this pull request May 2, 2014

Add libhtml #13896

Closed

seanmonstar closed this May 2, 2014

arcnmx pushed a commit to arcnmx/rust that referenced this pull request Jan 9, 2023

Auto merge of rust-lang#13831 - bvanjoi:fix_bound_type_in_trait_inser…

fb0db2a

…t, r=Veykril fix(completion): remove bound insert of type in trait Fixed rust-lang/rust-analyzer#13819

		@@ -0,0 +1,145 @@
		// Copyright 2013-2014 The Rust Project Developers. See the COPYRIGHT

add libhtml #13831

add libhtml #13831

Uh oh!

Conversation

seanmonstar commented Apr 28, 2014

Uh oh!

lilyball commented Apr 28, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanmonstar commented Apr 28, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilyball commented Apr 28, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanmonstar commented Apr 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilyball commented Apr 29, 2014

Uh oh!

lilyball commented Apr 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanmonstar commented Apr 29, 2014

Uh oh!

alexcrichton commented Apr 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilyball commented Apr 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilyball commented Apr 30, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilyball commented Apr 30, 2014

Uh oh!

seanmonstar commented Apr 30, 2014

Uh oh!

seanmonstar commented May 1, 2014

Uh oh!

seanmonstar commented May 1, 2014

Uh oh!

lilyball commented May 2, 2014

Uh oh!

lilyball commented May 2, 2014

Uh oh!

seanmonstar commented May 2, 2014

Uh oh!