-
Notifications
You must be signed in to change notification settings - Fork 660
Match elements on qualified names, not just local names #286
Conversation
There are many places where the HTML spec says to look for an HTML element (i.e., an element in the HTML namespace) with a given tag, but where we were only looking at tag names and ignoring the namespace name. Now we always use qualified names where required. This is implemented using a new GumboQualName type, which is just a bit field that combines a GumboNamespaceEnum with a GumboTag. A series of macros make it easy to construct and inspect these bit fields. It's currently represented by a uintptr_t. This is larger than necessary; only 9 bits are required. At first I attempted to use a `short` but the compiler didn't like that being used with varargs, so then I tried an `unsigned int` but the compiler didn't like casting that to a pointer, so here we are. I also tried `typedef enum { QNSIZE = SHORT_MAX } GumboQualName` to induce more compiler warnings when mistakenly passing a tag to a function that expects a qualified name. This worked nicely but generated warnings about missing cases in switch statements. I'm not sure what the best option is, though I am quite tempted by the extra warnings the enum provides.
FYI: I am an outsider here so my opinion means nothing! ;-) But I wish you would have not used uintptr_t for your GumboQualName type. It really is an unsigned int or int type and not a pointer of any sort. Anything under an int in size (such as shorts) would get automatically promoted to an int when used with varargs so you might as well use an int. The whole casting problem comes from this hack in parser.c // This is a bit of a hack to stack-allocate a one-element GumboVector name where they are tricking a GumboVector to store integers and not void * which is technically implementation defined and not a good idea in general IMHO. But if you look at vector.c/.h you can see how easy it would be to create a variable length/growing list or array of integers using this code as a basis. If we do that instead we could use it as the basis of a integer "set type", and build it always sorted and allow binary search to check for set inclusion or not to speed up the parser. We could also use a binary search on the string name strings to get the enum. That would involve sorting the tag names in alphabetical order. There are of course alternatives to variable length integer sets, we could use bit arrays - look at X11's xtrapbits.h for a simple pure C header only implementation of bit arrays to handle the 150 or so tags we have, and the like. Alternatively we could pass around a pointer to a string of 150 bytes (each either a "0" or a "1") and use simple pointer arithmetic or array syntax to very very quickly check if a tag is in a particular set of not. I kind of hoped that someone official would have decided on what basic speedup approach makes the most sense first before we move to encode namespace information since it may change all of the macros and things you have set up depending on how they move forward. Or at least do everything at the same time to prevent two API (order of tags) and ABI changes at one time. Anyway, thanks for you patch! Something like this is needed - I just wish we could have some direction on better handling of sets of integers and testing for inclusion first or at least simultaneously. |
I guess I'm that someone official :-) Sorry I haven't had a chance to review this patch yet; it's been a hectic week, but I've got a few hours now, so I can comment and hopefully get around to checking out the code. I agree that the tagset representation needs help. It actually wasn't a bottleneck last time I profiled (back when I was still at Google), but now that UTF-8 and entity decoding have both been sped up, it may become one. I also always thought it was a bit of an ugly hack, and had planned to replace it with a bitset implementation like xtrapbits.h when I had a chance. However, I like to have separate pull requests for new features, bug/correctness changes, and performance improvements. This lets us benchmark the performance improvements without conflating them with new features, it lets us get the bugfixes out to existing & new users ASAP, and it makes it easier to debug & test the new features. So I'd like to get this patch in first, as soon as I can verify that it's sane and won't introduce any additional correctness/stability issues, and then someone can worry about performance in a separate patch. It looks like all the implementation is hidden in the GumboQualName type and _QN macros, so it shouldn't be too hard to change afterwards. The main requirement when coming up with a new representation for tag sets is that the declaration of the tag set occur at the same point in the source code that the spec clause uses it. Gumbo has in general tried to make the source code mirror the spec to the greatest extent possible; this was a huge help for code review within Google, and it's continued to be a help as the spec has changed underneath the library. My inclination is to create a struct that's basically 3 arrays (one for each namespace, each about 2-3 words long, 1 bit for each GUMBO_TAG), and then use some preprocessor magic to construct the appropriate bitsets. I suspect that the compiler could optimize the calls to tag_in/tag_is out entirely then, becoming simple bitvector operations against stack-allocated temporaries. Benchmarking is necessary to know for sure. As for the initial questions about this PR: I think a uintptr_t is fine for now. I like the extra type-safety in general, but I don't want to introduce compiler warnings to others, and with a possible revamp of the tag set mechanism in the future, it's not worth worrying about now. |
Under Issues, I have added an alternative approach that uses binary search and keeps tag sets right in the code. The more I look at it, testing on namespace ang tag enum was not the idea in the spec. The spec saw that there were only 5 overlap tags in svg with html and none at all with mathml. So the tag name is enough. Please take a look at the commits mentioned in the very latest issue. It can provide the necessary speed, keeps the tag lists right in the code and can easily be used to deal with the ns tag mismatch issues without changing almost anything in the original design. Kevin |
Couple questions after looking through this:
|
Yes, it was clean except for the switch statement. (There's only one switch: in
I misspoke a bit earlier: the warning wasn't about a missing case, it was about the values in the case statement not being members of the src/parser.c:586:10: warning: case value not in enumerated type 'GumboQualName' [-Wswitch]
case HTML_QN(HTML):
^
src/parser.c:47:22: note: expanded from macro 'HTML_QN'
#define HTML_QN(tag) QUAL_NAME(GUMBO_NAMESPACE_HTML, GUMBO_TAG_ ## tag)
^
src/parser.c:43:35: note: expanded from macro 'QUAL_NAME'
#define QUAL_NAME(namespace, tag) (GumboQualName)(((namespace) << 8) | (tag))
^ These warnings go away if you cast to
|
Match elements on qualified names, not just local names
FWIW, I threw afl back at the branch (given it's what hit #278 originally) in case it found anything else with that mess fixed, and it didn't find anything that didn't exist in master (primarily #276). |
Hey! Sorry to interrupt in this already merged PR. I've been working with @aroben on the issue of qualified names.
I'm glad to see we're on the same page, @nostrademons. After @aroben shared with me this patch internally, I went ahead and refactored it to use a "tag set" in order to fix the underlying issue of qualified names when parsing, to simplify the implementation (and get rid of the ugly Since I obviously wanted to keep all the tag lists inline in the function calls, my first attempt was to use bitsets, like you suggested (this was a couple weeks ago ;d). What I found is that it's non-trivial to initialize a bitset statically on the callsite without either being super-ugly, or super slow. If all the bits could fit in a single word, then the initialization syntax would be just fine, but with a bitset composed of So I went with plan B, which was using a byteset (namely, a byte array). IMHO this has many advantages over a bitset, in both simplicity and performance. The implementation itself is pretty straightforward: typedef char gumbo_tagset[GUMBO_TAG_LAST];
#define TAG(tag) [GUMBO_TAG_##tag] = (1 << GUMBO_NAMESPACE_HTML)
#define TAG_SVG(tag) [GUMBO_TAG_##tag] = (1 << GUMBO_NAMESPACE_SVG)
#define TAG_MATHML(tag) [GUMBO_TAG_##tag] = (1 << GUMBO_NAMESPACE_MATHML)
#define TAGSET_INCLUDES(tagset, namespace, tag) \
(tag < GUMBO_TAG_LAST && \
tagset[(int)tag] == (1 << (int)namespace))
static bool node_tag_in_set(const GumboNode* node, const gumbo_tagset tags)
{
return node->type == GUMBO_NODE_ELEMENT &&
TAGSET_INCLUDES(tags,
node->v.element.tag_namespace,
node->v.element.tag);
} As you can see, preprocessor magic is limited to the minimum, and it instead uses C99 features (designated initializers and compound literals) to set the marked tags natively and efficiently. It uses less than 200 bytes (one for each tag, obviously), which is more than a bitset (as it wastes 4 bits-per-byte), but is much simpler to initialize and much faster to query. The actual calls look like this: static bool is_mathml_integration_point(const GumboNode* node) {
return node_tag_in_set(node, (gumbo_tagset) {
TAG_MATHML(MI), TAG_MATHML(MO), TAG_MATHML(MN),
TAG_MATHML(MS), TAG_MATHML(MTEXT)});
}
// The list of nodes in the "special" category:
// http://www.whatwg.org/specs/web-apps/current-work/complete/parsing.html#special
static bool is_special_node(const GumboNode* node) {
assert(node->type == GUMBO_NODE_ELEMENT);
return node_tag_in_set(node, (gumbo_tagset) {
/* HTML namespace */
TAG(ADDRESS), TAG(APPLET), TAG(AREA),
TAG(ARTICLE), TAG(ASIDE), TAG(BASE),
TAG(BASEFONT), TAG(BGSOUND), TAG(BLOCKQUOTE),
TAG(BODY), TAG(BR), TAG(BUTTON), TAG(CAPTION),
TAG(CENTER), TAG(COL), TAG(COLGROUP),
TAG(MENUITEM), TAG(DD), TAG(DETAILS), TAG(DIR),
TAG(DIV), TAG(DL), TAG(DT), TAG(EMBED),
TAG(FIELDSET), TAG(FIGCAPTION), TAG(FIGURE),
TAG(FOOTER), TAG(FORM), TAG(FRAME),
TAG(FRAMESET), TAG(H1), TAG(H2), TAG(H3),
TAG(H4), TAG(H5), TAG(H6), TAG(HEAD),
TAG(HEADER), TAG(HGROUP), TAG(HR), TAG(HTML),
TAG(IFRAME), TAG(IMG), TAG(INPUT), TAG(ISINDEX),
TAG(LI), TAG(LINK), TAG(LISTING), TAG(MARQUEE),
TAG(MENU), TAG(META), TAG(NAV), TAG(NOEMBED),
TAG(NOFRAMES), TAG(NOSCRIPT), TAG(OBJECT),
TAG(OL), TAG(P), TAG(PARAM), TAG(PLAINTEXT),
TAG(PRE), TAG(SCRIPT), TAG(SECTION), TAG(SELECT),
TAG(STYLE), TAG(SUMMARY), TAG(TABLE), TAG(TBODY),
TAG(TD), TAG(TEXTAREA), TAG(TFOOT), TAG(TH),
TAG(THEAD), TAG(TITLE), TAG(TR), TAG(UL),
TAG(WBR), TAG(XMP),
/* MathML namespace */
TAG_MATHML(MI), TAG_MATHML(MO), TAG_MATHML(MN), TAG_MATHML(MS),
TAG_MATHML(MTEXT), TAG_MATHML(ANNOTATION_XML),
/* SVG namespace */
TAG_SVG(FOREIGNOBJECT), TAG_SVG(DESC)});
} It essentially looks like the Overall I'm pretty happy with how this ended up looking in our fork of Goomba. Now, of course, I'm writing all this rant to ask: do you think this direction makes sense to you? Can you think of any drawbacks? Would you consider merging this upstream if I backport these changes from our fork? Thanks for your consideration! :) |
Hi Vicent, I am not official in any way so my point of view it truly meaningless, but this is exactly what I was looking for. This is even better than the binary search aproach I scrounged together since order is now meaningless. I would still rewrite tag.c to have tags in alphabetical order so that tag strings can be easily looked up to find tag enums using a tree or binary search. It nicely even allows the fake math:td possibilities that people here claim are valid ;-) If I have a vote, I would give this implementation all of my votes! I started programming just over 39 years ago, and this is the very first time someone has dropped a perfect solution into my lap! Nicely done! I hope you don't mind but if it does not get accepted in gumbo, I will still pull it into my fork as this does check all of my boxes and I think theirs too. Thank you! Kevin
|
Hi, Have you thought about adding the following so that the tag_in can use the same tagsets approach?
and then
|
I like this general approach, and am impressed by how clean the code came out afterwards. I have a couple concerns/nitpicks first, but in general yes I would consider merging this approach to upstream.
|
Hi, I may be able to answer a few of your questions about this approach. I have actually implemented it over the top of your latest master. I have included his idea of a gumbo_tagset and also added its use for tag_in as well using the additional macros I mentioned. That said, the reason I implemented it over your current master and not over the earlier master is that I think we still need some way to handle comparing a single tag and namespace against a single node so the QUAL_NAME stuff is still needed (I think?). So we still would need the uintptr_t hack too. Of course, perhaps he has other approaches to fix those areas. . Of course I still have the expanded set of recognized tags and xhtml parsing additions you would not want but someone might easily want to grab pieces of this to and use it to help fix the almost 70 uses of node_qualname_in and tag_in which takes a lot of effort. My master is here: https://github.com/kevinhendricks/gumbo-parser I will incorporate this into my Windows build and Sigil (via a VM) and see if I can get the initializer as argument thing to fly with the Windows compiler. I can also run some before and after size and timing runs if you want. Kevin |
Hi, Ah, I bet he used the same gumbo_tagset approach in has_element_in_specific_scope() too which I neglected to think of. I fix that up yet tonight. Kevin |
In case it helps, Added in the use of gumbo_tagset for has_an_element_in_specific_scope() and related and longer need to use uintptr_t for QualName. All pushed to my master if anyone wants to steal pieces of it to fix up the now 70+ uses of node_tag_in, tag_in and specific scope removing all use of varargs I think as well in the main gumbo tree. |
@kevinhendricks: Thanks for taking the time to work on your own branch. I see a few issues on your implementation (mainly, the fact that it's still using the ugly I'm afraid I don't have time this week to prepare PRs, but I've pushed our Please feel free to take a look and grab as much stuff as you need. I'm hoping you'll agree that all the design decisions are sound, and that we've solved all your design concerns rather satisfactorily -- it's a significant reduction on code and simplicity, the diff is minimal (because we've added small wrappers to all the @nostrademons: Please feel free to look at the Gist too (FWIW, GitHub has signed Google's CLA, although I don't know if that applies now that you're no longer at Google) and see how you feel. Following up on your questions now:
This is bad news. I'm afraid none of this code can possibly work with MSVC. I saw on the README that this was a C99 library, so I gave this no further thought. Compound literals are a C99 feature that MSVC does not implement (obviously, as it is widely known as the worst C compiler in existence), and what's worse, that is not supported in any of the C++ standards, and hence cannot be used under MSVC even when building in C++ mode. What's worse, anything you come up with to keep the tagsets next to the function call is not going to work with MSVC (unless you stick to the current implementation using
We considered this, but this would require a completely different implementation of all the
I don't have accurate benchmarks for this feature because our branch of Gumbo has gone through several optimization phases, and it's currently between 2x and 3x faster than upstream -- so any benchmarks would be synthetic and not really accurate. My hunch would be that this will make most parsing around 20% faster because of the
No noticeable changes. If anything, the size is slightly smaller for the largest Tag Sets (as the previous
Absolutely. You can see in the previous Gist that this is part of our implementation.
We renamed it to make the API more obvious -- the old API was not kept around. We can of course rename it back.
Thanks for the heads up. I don't foresee any complications merging this either up or down. |
Hi Vicent, Thanks for posting it. My latest master has all of the tag_in and specific_scope changes, and uses unsigned int now for QualName but I did not get a chance to push it until late last night. Your version is actually much cleaner and I will definitely use parts from it. So Thank You! (again!) That said, my project (Sigil) is cross-platform and Windows is one of our targets. So we will either have to use one of the gcc compiler ports to Windows to deal with this issue or simply create all of the tagset definitions outside of the arguments in some way. I am not going backwards as I need a larger tagset and so your speedups are very important to us. Kevin |
https://msdn.microsoft.com/en-us/library/hh409293 "What's New for Visual C++ in Visual Studio 2013" does include "Supports these ISO C99 language features: Compound literals." (in addition to a few others) so maybe this particular feature can be used in the latest VS? I'm happy to test small code samples in VS2013. |
Oh shit, looks like Microsoft actually got their stuff together in Visual C++ 2013, with support for both compound literals and designated initializers. I see no reason why this patch wouldn't fully work as-is on MSVC, with no additional changes (even though I cannot test it right now). Props to Microsoft I guess! This will certainly make this whole process much more straightforward. |
@twpol
It is not exactly the same as used here but will test the compound initializer when passed as an argument. |
Hi vmg, There are a lot of very nice changes in your parser.c besides the obvious tag_in_set, node_tag_in_set, and tagset
Very Nice! (this is one I will be pulling in). I could never understand why they decided to pass parser all over the place just to allocate and deallocate memory. if someone wanted to use their own memory allocation/deallocation it would be simpler to just wrapper the gumbo ones or wrapper malloc and free themselves to catch it and redirect it.
I would bet your removal of passing parser all over the place where not needed, really cleans up the code, and makes many of the node / element creation pieces much easier to read and available for calling routines to modify the tree for special cases from outside the parser itself Very nice. I had already started the process of removing "parser" from everyplace it was not needed (ie. when only memory allocation/deallocation was needed). I gave up until things get more stable but I really like these ideas and will be pulling almost all of them into our tree in one form or another. Kevin |
@kevinhendricks: Looks like things are good for VS2013. |
@nostrademons, Just let me know if that is something that would help. |
@twpol |
There are many places where the HTML spec says to look for an HTML element (i.e., an element in the HTML namespace) with a given tag, but where we were only looking at tag names and ignoring the namespace name.
Now we always use qualified names where required. This is implemented using a new
GumboQualName
type, which is just a bit field that combines aGumboNamespaceEnum
with aGumboTag
. A series of macros make it easy to construct and inspect these bit fields. It's currently represented by auintptr_t
. This is larger than necessary; only 9 bits are required. At first I attempted to use ashort
but the compiler didn't like that being used with varargs, so then I tried anunsigned int
but the compiler didn't like casting that to a pointer, so here we are. I also triedtypedef enum { QNSIZE = SHORT_MAX } GumboQualName
to induce more compiler warnings when mistakenly passing a tag to a function that expects a qualified name. This worked nicely but generated warnings about missing cases in switch statements. I'm not sure what the best option is, though I am quite tempted by the extra warnings the enum provides.Fixes #278
/cc @gsnedders