Skip to content

add libhtml #13831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

add libhtml #13831

wants to merge 1 commit into from

Conversation

seanmonstar
Copy link
Contributor

Escape will escape the 5 chars <, >, &, ', and ".
Unescape will unescape a lot more characters.

  • fn html::escape(&str) -> ~str
  • fn html::unescape(&str) -> ~str
  • struct html::fmt::Escape(&str)
  • struct html::fmt::Unescape(&str)

References:
http://golang.org/pkg/html/
https://docs.python.org/3/library/html.html

The wiki claims this is desired.

@lilyball
Copy link
Contributor

I would expect a library named libhtml to cover a lot more than just escaping/unescaping of special characters. Something that simple seems more appropriate as a module std::html, or maybe even serialize::html.

@@ -0,0 +1,145 @@
// Copyright 2013-2014 The Rust Project Developers. See the COPYRIGHT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's new code. It should just be Copyright 2014.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I copied this header from src/librustdoc/html/escape.rs. The Escape code was lifted directly from that file. So I figured it was 2013-2014.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it was? Ok then, I guess that's fine.

@seanmonstar
Copy link
Contributor Author

@kballard i don't see a problem with the small scope of the library. There is much smaller libraries: libflate, libfourcc, etc. Plus, the reference libraries (Go, Python) only both with escaping in their html libs.

Other html related things could go here, as well.



try!(write!(fmt.buf, "{}", esc));
last = i + n;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to behave unexpectedly if you have malformed entity that looks something like &a&b;. Where "behave unexpectedly" will likely manifest as a failure on pile.o_bits.slice(last, i) where last > i.

Instead, you should ditch the for loop and spin the iterator manually. You can then just pass the iterator directly to unescape_entity() instead of creating a new one, and it will consume everything up to (and including) the trailing ;.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would look something like

loop {
    let (i,b) = match iter.next() {
        None => break,
        Some(x) => x
    };
    if b == '&' as u8 {
        if last < i { try!(fmt.buf.write(pile_o_bits.slice(last, i).as_bytes())); }
        try!(fmt.buf.write(unescape_entity(iter.by_ref()).as_slice()));
    }
}

(where unescape_entity() is changed to just return MaybeOwned).


Edit: Changed the call to unescape_entity() to take iter.by_ref(). We can't move our iterator into the function without losing it. The alternative is to have unescape_entity() take its iterator by-reference, but I don't think that's necessary.

Edit 2: This code doesn't handle last properly. It would need some other way of signifying that last needs to be set on the next byte. Either that or the iterator should be wrapped in Peekable. But I haven't explored this as I've already proposed another approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted below, I've now decided unescape_entity() should just be given a &str instead of an iterator, to avoid unnecessary allocation. We still need the manual loop so we can spin the iterator again. It would look something like

loop {
    let (i,b) = match iter.next() {
        None => break,
        Some(x) => x
    };
    if b == '&' as u8 {
        if last < i { try!(fmt.buf.write(pile_o_bits.slice(last, i))); }
        let start = i;
        let end = match iter.find(|&(i,b)| b == ';' as u8) {
            None => break, // Entity has no close
            Some((i,_)) => i
        };
        try!(fmt.buf.write(unescape_entity(pile_o_bits.slice(start+1, end))));
        last = end + 1;
    }
}

Where unescape_entity() is now

fn unescape_entity(s: &str) -> MaybeOwned

and the string argument does not contain the leading & or trailing ;.

@lilyball
Copy link
Contributor

@seanmonstar Eh, I withdraw my objection about calling this libhtml. I still think it's slightly misleading to have an entire library named libhtml that turns out to contain nothing more than just escaping. And I worry a bit about whether this API will be compatible with the desired API of a future library that supports full HTML parsing/serialization. But as long as this library isn't marked #[stable] I suppose that's acceptable.

}
}

fn unescape_entity<I: Iterator<u8>>(iter: I) -> (MaybeOwned, uint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my suggestion above, this should just return MaybeOwned. The caller will give it the same iterator it's using and therefore the bytes consumed here will be sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon thinking about this some more, perhaps unescape_entity() should actually just take a &str, containing the full entity. Let the caller figure out where the entity ends. This way you can avoid allocating a StrBuf internally and instead just slice the source string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let the caller figure out? so, Unescape::fmt? That'd put a lot more information about entities inside that fmt method...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already knows that & starts an entity, it just has to learn that ; ends the entity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are entities that don't end in ;, but I'll still try this way out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there aren't. Admittedly, I'm a bit confused at the table, http://www.w3.org/TR/html5/syntax.html#named-character-references does in fact list entity names with and without semicolons. However, it is a parse error to not have a ; after a named character reference.

That said, from reading http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references, it turns out there's cases where the ampersand does not actually start a named character reference. Notably, there's a few characters that may follow the & that indicate it's not a character reference. And there's also some weird backwards-compatibility thing about parsing inside attributes, but I don't know of any html entity encoder/decoder routine that handles that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've provided some sample code above for how this modified unescape_entity() should be called. With that approach, this function is given the string containing just the entity name, without the & or ;, so it will be something like "apos" or "#39".

@seanmonstar
Copy link
Contributor Author

I supposed it could also make sense as serialize::html, but there's precedent for this being libhtml. Shrug.

}
}
}
return (Owned(from_byte(x)), len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire function needs to be rewritten if you take my proposed approach, but I just wanted to note here that parsing of the number can be simplified using std::num::from_str_radix(). Assuming we have an arg s: &str that contains e.g. "#39" it would look like

if s.starts_with("#") {
    let s = s.slice_from(1);
    let n: Option<u32> = if s.starts_with("x") {
        // hex
        num::from_str_radix(s.slice_from(1), 16)
    } else {
        num::from_str_radix(s, 10)
    };
    let n = match n {
        None => return Slice(""), // elide it completely for a parse error, I guess
        Some(n) => n
    };
    let c = match char::from_u32(n) {
        None => return Slice("\uFFFD"), // required according to HTML5 spec
        Some(c) => c
    };
    return str::from_char(c);
}

Also note that this code handles non-ASCII numeric character references, whereas yours didn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I found num::from_str_radix a few minutes ago, been rewriting to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, apparently it doesn't handle numeric character references quite right. http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references has a table of values (covering 0 and most of 0x80..0x9F) that are considered parse errors and are mapped to specific characters (I'm not quite sure why it's a parse error if it has a character you're supposed to return, I guess to give the user agent the opportunity of aborting if it wants to). So that should actually be handled.

@lilyball
Copy link
Contributor

I've changed my mind about unescape_entity() again. For proper HTML5 parsing, invalid entities have well-defined rules for how many characters they should consume before the parse error. Since aborting is not really an option when we encounter a parse error, we should try to match its rules and only consume the correct number of characters. Basically, this means that &#1abc is a parse error but should only consume the &#1 for the entity. I'm not quite sure what the right parse error behavior is here (perhaps pretend there was a semicolon?). Similarly &#abc very explicitly should only consume the &, return nothing, and the unescaped string will end up as "#abc".

Given that, and given the desire to avoid allocation, I would suggest that unescape_entity() still receive a s: &str arg, but this arg should be a slice containing the entire rest of the input after the & (not including the &). It would then once again return how many characters it actually consumed.

Alternatively, it could take the Iterator again, but use a fixed-size stack buffer that is large enough to contain the longest valid named character reference. The iterator approach is rather elegant, except that in the parse error case it needs to avoid consuming characters, which is awkward (We can kind of do it with RandomAccessIterator but it ruins a lot of the elegance). Given that I think the approach of taking a &str containing the rest of the string makes sense.

@lilyball
Copy link
Contributor

After considering this some more, I believe I know why the ENTITIES table has trailing semicolons on most values, but also contains some values that are duplicates of existing ones but without the trailing semicolon. This has to do with parse error behavior. Specifically, if an identifier without the semicolon is matched, this is a parse error, but the proper recovery behavior is to produce the corresponding entity value anyway. And by that I mean "&aeligtest" is a parse error but should evaluate to "ætest" anyway. However "&acytest", while also being a parse error, would evaluate to "acytest".

For our purposes, I think this can be simplified by removing all the duplicates from the ENTITIES table and instead adding a boolean flag indicating whether the parse error behavior should still consider this entity to be a match without the semicolon.

//! ```rust
//! use html;
//!
//! fn main() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these two examples could be merged to be like

use html;
let original = "<p>Dr. Jekyll & Mr. Hyde</p>";

let escaped = html::escape(original)
assert_eq!(escaped.as_slice(), "&lt;p&gt;Dr. Jekyll &amp; Mr. Hyde&lt;p&gt;");

let unescaped = html::unescape(escaped);
assert_eq!(unescaped.as_slice(), original);

@seanmonstar
Copy link
Contributor Author

@kballard I believe I've addressed all your comments. No longer allocating improved the unescape bench from ~3000ns to ~1300ns. \o/

One of your comments: &#abc => #abc... Other implementations, Firefox and Go, both return unconsume the & as well, leaving the string as is.

@alexcrichton
Copy link
Member

(travis appears to have a tidy error)

@@ -86,6 +86,7 @@ DEPS_workcache := std serialize collections log
DEPS_log := std sync
DEPS_regex := std collections
DEPS_regex_macros = syntax std regex
DEPS_html := std collections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you may be able to update this dependency (the collections dependency is gone)

@lilyball
Copy link
Contributor

@seanmonstar Ah hah, after reading more of the parsing docs, if the "consume character reference" step returns nothing, the parser emits a & character.

try!(fmt.buf.write(pile_o_bits.slice(last, i).as_bytes()));
}
let (esc, n) = unescape_entity(s.slice_from(i + 1));
try!(fmt.buf.write_str(esc.as_slice()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my most recent comment, if esc.is_empty() then this should write a single & character.

@lilyball
Copy link
Contributor

html::entity::lookup() docstring needs to explain that the input should be the entity name without the & or ;, e.g. "amp".

use entity;

/// Wrapper struct which will emit the HTML-escaped version of the contained
/// string when passed to a format string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment suggests this wraps a string. It should clarify that it wraps any Show type.

@lilyball
Copy link
Contributor

Your parser still doesn't handle the &aeligtest case correctly. In fact, handling that correctly is probably going to require using a Trie to hold the entities instead of the current flat list, as you're supposed to error out as soon as you find a character that doesn't match one of the entities in the list (and, going back to &aeligtest, some of entities behave differently than others if you find them without a trailing semicolon). Your current behavior of erroring out if you find too many characters is an approximation of this, but if I write!(unescapeWriter, "&aeligtest") I would expect it to have written "ætest" immediately instead of requiring a flush().

@seanmonstar
Copy link
Contributor Author

Oh dang, I didn't notice that. So, with &aeligtest, it would need to lookup(s.slice_to(s.len() -1)) and then lookup(s.slice_to(s.len() -2) and so on...

Escape will escape the 5 chars <, >, &, ', and ".
Unescape will unescape a lot more characters.

- fn html::escape(&str) -> ~str
- fn html::unescape(&str) -> ~str
- struct html::fmt::Escape(&str)
- struct html::fmt::Unescape(&str)

References:
http://golang.org/pkg/html/
https://docs.python.org/3/library/html.html
@seanmonstar
Copy link
Contributor Author

got the &aeligtest working. right now, it's rewinding the buf until it finds a match. I've added some benches flexing the longest rewinding. It's not stellar. Looked into a Trie, but noticed that collections::trie is only for uints...

@seanmonstar
Copy link
Contributor Author

I added an allow(unused_mut) to remove the warnings with a FIXME to remove it. The tests don't pass though, as it's wrongly matching a to an x, thanks to #13867

@lilyball
Copy link
Contributor

lilyball commented May 2, 2014

@seanmonstar As an experiment, I just constructed a basic trie from the entity list. Each node had to have an array of 62 children, to allow for [a-zA-Z0-9]. The resulting trie ended up taking roughly 3.9MB of memory. For comparison, the &[(&str, &str, bool)] array structure takes roughly 110KB of memory (that bool is the record of whether the semicolon is optional, which I believe you still aren't representing in your code).

This is rather unfortunate. A basic trie structure would be really easy to work with, because you could trivially tell when you've found a character that doesn't fit. A radix trie (which is what TrieMap is, although that's unusable anyway because of uint keys) would be a lot smaller but would not be usable in the way a basic trie is.

So I guess the static sorted array is the best solution. Perhaps the code could keep a "cursor" into the array, pointing to the first element prefixed by the entity so far, and each new character would cause the "cursor" to walk forward until it finds a prefix-matching value, or hits an element that's sorted too high.

@lilyball
Copy link
Contributor

lilyball commented May 2, 2014

It occurs to me that unescaping might actually makes more sense as a Reader than a Writer. The implementation would be more complex, though, which is unfortunate.

@seanmonstar
Copy link
Contributor Author

Along the reader/writer vein, I was thinking it'd be nice to add a
Transform to std io, sort of like nodejs Transform streams. Implement a
transform method, and it can wrap a reader or writer.
On May 1, 2014 8:18 PM, "Kevin Ballard" notifications@github.com wrote:

It occurs to me that unescaping might actually makes more sense as a
Reader than a Writer. The implementation would be more complex, though,
which is unfortunate.


Reply to this email directly or view it on GitHubhttps://github.com//pull/13831#issuecomment-41981847
.

@lilyball
Copy link
Contributor

lilyball commented May 2, 2014

I'm not familiar with nodejs Transform streams, but from reading the API documentation, it seems that they are objects that are both Readers and Writers`, and that data written to them is then transformed and available for reading. That's not quite the same sort of thing.

assert_eq!(unescape(UnTest("&am", "p;")), "&".to_owned());
assert_eq!(unescape("&fakentity"), "&fakentity".to_owned());
assert_eq!(unescape("&aeligtest"), "ætest".to_owned());
assert_eq!(unescape("&#0abc"), "abc".to_owned());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is wrong. The result should be "\uFFFDabc".

@lilyball
Copy link
Contributor

lilyball commented May 2, 2014

I ended up rewriting your code entirely, because of the difficulty I've had in expressing some things about the HTML parsing. The rewritten version is in my repo. It passes all the existing tests, but we really do need to throw a whole bunch more tests at it to make sure it behaves properly. I haven't tried running make check yet, I only verified that libhtml compiles on its own. I also added in support for various features such as escaping all non-ASCII chars, or providing an allowed additional character to decoding[1]. These features have zero tests at the moment.

How do you want to proceed? You could push my code to your PR, or I could open a new PR and let others decide which implementation they prefer ;). And I apologize for effectively hijacking your project.

[1] It turns out the allowed additional character doesn't actually change the decoding behavior, because we don't error out on parse errors, but it is important if we ever need to add the ability to track parse errors.

@seanmonstar
Copy link
Contributor Author

Awesome. I don't care how we proceed. Is it just easiest if I close this and you open a new one?

I don't care about any hijacking. I worked on this because: 1) I'm liking Rust, and felt it could use an html library, and 2) it seemed small enough to tackle, while still letting me learn more about how to write Rust. Both objectives are happening, I feel.

@lilyball
Copy link
Contributor

lilyball commented May 2, 2014

@seanmonstar Glad to hear it! I'll go ahead and submit as a separate PR and you can close this one. At the very least, that will get rid of all the clutter of old review comments :)

@lilyball lilyball mentioned this pull request May 2, 2014
@seanmonstar
Copy link
Contributor Author

Also, thanks for all the reviewing you've done. \o/

@seanmonstar seanmonstar closed this May 2, 2014
arcnmx pushed a commit to arcnmx/rust that referenced this pull request Jan 9, 2023
…t, r=Veykril

fix(completion): remove bound insert of type in trait

Fixed rust-lang/rust-analyzer#13819
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants