Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Match elements on qualified names, not just local names #286

Merged
merged 1 commit into from
Feb 9, 2015

Conversation

aroben
Copy link
Contributor

@aroben aroben commented Jan 30, 2015

There are many places where the HTML spec says to look for an HTML element (i.e., an element in the HTML namespace) with a given tag, but where we were only looking at tag names and ignoring the namespace name.

Now we always use qualified names where required. This is implemented using a new GumboQualName type, which is just a bit field that combines a GumboNamespaceEnum with a GumboTag. A series of macros make it easy to construct and inspect these bit fields. It's currently represented by a uintptr_t. This is larger than necessary; only 9 bits are required. At first I attempted to use a short but the compiler didn't like that being used with varargs, so then I tried an unsigned int but the compiler didn't like casting that to a pointer, so here we are. I also tried typedef enum { QNSIZE = SHORT_MAX } GumboQualName to induce more compiler warnings when mistakenly passing a tag to a function that expects a qualified name. This worked nicely but generated warnings about missing cases in switch statements. I'm not sure what the best option is, though I am quite tempted by the extra warnings the enum provides.

Fixes #278

/cc @gsnedders

There are many places where the HTML spec says to look for an HTML
element (i.e., an element in the HTML namespace) with a given tag, but
where we were only looking at tag names and ignoring the namespace name.

Now we always use qualified names where required. This is implemented
using a new GumboQualName type, which is just a bit field that combines
a GumboNamespaceEnum with a GumboTag. A series of macros make it easy to
construct and inspect these bit fields. It's currently represented by a
uintptr_t. This is larger than necessary; only 9 bits are required. At
first I attempted to use a `short` but the compiler didn't like that
being used with varargs, so then I tried an `unsigned int` but the
compiler didn't like casting that to a pointer, so here we are. I also
tried `typedef enum { QNSIZE = SHORT_MAX } GumboQualName` to induce more
compiler warnings when mistakenly passing a tag to a function that
expects a qualified name. This worked nicely but generated warnings
about missing cases in switch statements. I'm not sure what the best
option is, though I am quite tempted by the extra warnings the enum
provides.
@kevinhendricks
Copy link
Contributor

FYI: I am an outsider here so my opinion means nothing! ;-)

But I wish you would have not used uintptr_t for your GumboQualName type. It really is an unsigned int or int type and not a pointer of any sort. Anything under an int in size (such as shorts) would get automatically promoted to an int when used with varargs so you might as well use an int.

The whole casting problem comes from this hack in parser.c

// This is a bit of a hack to stack-allocate a one-element GumboVector name
// 'varname' containing the 'from_var' variable, since it's used in nearly all
// the subsequent helper functions. Note the use of void* and casts instead of
// GumboTag; this is so the alignment requirements are the same as GumboVector
// and the data inside it can be freely accessed as if it were a normal
// GumboVector.
#define DECLARE_ONE_ELEMENT_GUMBO_VECTOR(varname, from_var)
void* varname ## _tmp_array[1] = { (void*) from_var };
GumboVector varname = { varname ## _tmp_array, 1, 1 }

where they are tricking a GumboVector to store integers and not void * which is technically implementation defined and not a good idea in general IMHO.

But if you look at vector.c/.h you can see how easy it would be to create a variable length/growing list or array of integers using this code as a basis.

If we do that instead we could use it as the basis of a integer "set type", and build it always sorted and allow binary search to check for set inclusion or not to speed up the parser. We could also use a binary search on the string name strings to get the enum. That would involve sorting the tag names in alphabetical order.

There are of course alternatives to variable length integer sets, we could use bit arrays - look at X11's xtrapbits.h for a simple pure C header only implementation of bit arrays to handle the 150 or so tags we have, and the like.

Alternatively we could pass around a pointer to a string of 150 bytes (each either a "0" or a "1") and use simple pointer arithmetic or array syntax to very very quickly check if a tag is in a particular set of not.

I kind of hoped that someone official would have decided on what basic speedup approach makes the most sense first before we move to encode namespace information since it may change all of the macros and things you have set up depending on how they move forward. Or at least do everything at the same time to prevent two API (order of tags) and ABI changes at one time.

Anyway, thanks for you patch! Something like this is needed - I just wish we could have some direction on better handling of sets of integers and testing for inclusion first or at least simultaneously.

@nostrademons
Copy link
Contributor

I guess I'm that someone official :-) Sorry I haven't had a chance to review this patch yet; it's been a hectic week, but I've got a few hours now, so I can comment and hopefully get around to checking out the code.

I agree that the tagset representation needs help. It actually wasn't a bottleneck last time I profiled (back when I was still at Google), but now that UTF-8 and entity decoding have both been sped up, it may become one. I also always thought it was a bit of an ugly hack, and had planned to replace it with a bitset implementation like xtrapbits.h when I had a chance.

However, I like to have separate pull requests for new features, bug/correctness changes, and performance improvements. This lets us benchmark the performance improvements without conflating them with new features, it lets us get the bugfixes out to existing & new users ASAP, and it makes it easier to debug & test the new features.

So I'd like to get this patch in first, as soon as I can verify that it's sane and won't introduce any additional correctness/stability issues, and then someone can worry about performance in a separate patch. It looks like all the implementation is hidden in the GumboQualName type and _QN macros, so it shouldn't be too hard to change afterwards.

The main requirement when coming up with a new representation for tag sets is that the declaration of the tag set occur at the same point in the source code that the spec clause uses it. Gumbo has in general tried to make the source code mirror the spec to the greatest extent possible; this was a huge help for code review within Google, and it's continued to be a help as the spec has changed underneath the library. My inclination is to create a struct that's basically 3 arrays (one for each namespace, each about 2-3 words long, 1 bit for each GUMBO_TAG), and then use some preprocessor magic to construct the appropriate bitsets. I suspect that the compiler could optimize the calls to tag_in/tag_is out entirely then, becoming simple bitvector operations against stack-allocated temporaries. Benchmarking is necessary to know for sure.

As for the initial questions about this PR: I think a uintptr_t is fine for now. I like the extra type-safety in general, but I don't want to introduce compiler warnings to others, and with a possible revamp of the tag set mechanism in the future, it's not worth worrying about now.

@kevinhendricks
Copy link
Contributor

Under Issues, I have added an alternative approach that uses binary search and keeps tag sets right in the code. The more I look at it, testing on namespace ang tag enum was not the idea in the spec. The spec saw that there were only 5 overlap tags in svg with html and none at all with mathml.

So the tag name is enough. Please take a look at the commits mentioned in the very latest issue. It can provide the necessary speed, keeps the tag lists right in the code and can easily be used to deal with the ns tag mismatch issues without changing almost anything in the original design.

Kevin

@nostrademons
Copy link
Contributor

Couple questions after looking through this:

  1. When you tried using an enum type for QualName, did you revert it with all warnings clean except for the missing cases in switch statements, or were there still unexplained warnings? I think it's fine if it's just the switch statements (and we can fix those by adding a default case for the switch), but I'd like to know if there was anything else unexplained with the stronger typechecking.
  2. This doesn't change the tag_is/tag_in functions. Is that just out-of-scope for this change? It's fine if it is - we may replace that code anyway - but I'd like to make sure it's intentional.

@aroben
Copy link
Contributor Author

aroben commented Feb 8, 2015

When you tried using an enum type for QualName, did you revert it with all warnings clean except for the missing cases in switch statements, or were there still unexplained warnings?

Yes, it was clean except for the switch statement. (There's only one switch: in get_appropriate_insertion_mode.)

I think it's fine if it's just the switch statements (and we can fix those by adding a default case for the switch)

I misspoke a bit earlier: the warning wasn't about a missing case, it was about the values in the case statement not being members of the GumboQualName enumeration:

src/parser.c:586:10: warning: case value not in enumerated type 'GumboQualName' [-Wswitch]
    case HTML_QN(HTML):
         ^
src/parser.c:47:22: note: expanded from macro 'HTML_QN'
#define HTML_QN(tag) QUAL_NAME(GUMBO_NAMESPACE_HTML, GUMBO_TAG_ ## tag)
                     ^
src/parser.c:43:35: note: expanded from macro 'QUAL_NAME'
#define QUAL_NAME(namespace, tag) (GumboQualName)(((namespace) << 8) | (tag))
                                  ^

These warnings go away if you cast to unsigned int first: switch ((unsigned int)NODE_QUAL_NAME(node)) {.

This doesn't change the tag_is/tag_in functions. Is that just out-of-scope for this change? It's fine if it is - we may replace that code anyway - but I'd like to make sure it's intentional.

tag_is/tag_in are now only used to decide what kind of token we got from the tokenizer. Since tokens don't have a namespace, these functions didn't require any changes.

nostrademons added a commit that referenced this pull request Feb 9, 2015
Match elements on qualified names, not just local names
@nostrademons nostrademons merged commit 00b11e8 into google:master Feb 9, 2015
@aroben aroben deleted the qualnames branch February 9, 2015 22:25
@gsnedders
Copy link
Contributor

as soon as I can verify that it's sane and won't introduce any additional correctness/stability issues

FWIW, I threw afl back at the branch (given it's what hit #278 originally) in case it found anything else with that mess fixed, and it didn't find anything that didn't exist in master (primarily #276).

@vmg
Copy link
Contributor

vmg commented Feb 10, 2015

Hey! Sorry to interrupt in this already merged PR. I've been working with @aroben on the issue of qualified names.

The main requirement when coming up with a new representation for tag sets is that the declaration of the tag set occur at the same point in the source code that the spec clause uses it. Gumbo has in general tried to make the source code mirror the spec to the greatest extent possible; this was a huge help for code review within Google, and it's continued to be a help as the spec has changed underneath the library. My inclination is to create a struct that's basically 3 arrays (one for each namespace, each about 2-3 words long, 1 bit for each GUMBO_TAG), and then use some preprocessor magic to construct the appropriate bitsets.

I'm glad to see we're on the same page, @nostrademons. After @aroben shared with me this patch internally, I went ahead and refactored it to use a "tag set" in order to fix the underlying issue of qualified names when parsing, to simplify the implementation (and get rid of the ugly uintptr_t), and to increase performance.

Since I obviously wanted to keep all the tag lists inline in the function calls, my first attempt was to use bitsets, like you suggested (this was a couple weeks ago ;d). What I found is that it's non-trivial to initialize a bitset statically on the callsite without either being super-ugly, or super slow. If all the bits could fit in a single word, then the initialization syntax would be just fine, but with a bitset composed of N words, static initialization basically implies knowing what word each tag falls into.

So I went with plan B, which was using a byteset (namely, a byte array). IMHO this has many advantages over a bitset, in both simplicity and performance. The implementation itself is pretty straightforward:

typedef char gumbo_tagset[GUMBO_TAG_LAST];
#define TAG(tag) [GUMBO_TAG_##tag] = (1 << GUMBO_NAMESPACE_HTML)
#define TAG_SVG(tag) [GUMBO_TAG_##tag] = (1 << GUMBO_NAMESPACE_SVG)
#define TAG_MATHML(tag) [GUMBO_TAG_##tag] = (1 << GUMBO_NAMESPACE_MATHML)

#define TAGSET_INCLUDES(tagset, namespace, tag) \
  (tag < GUMBO_TAG_LAST && \
   tagset[(int)tag] == (1 << (int)namespace))

static bool node_tag_in_set(const GumboNode* node, const gumbo_tagset tags)
{
  return node->type == GUMBO_NODE_ELEMENT &&
      TAGSET_INCLUDES(tags,
        node->v.element.tag_namespace,
        node->v.element.tag);
}

As you can see, preprocessor magic is limited to the minimum, and it instead uses C99 features (designated initializers and compound literals) to set the marked tags natively and efficiently. It uses less than 200 bytes (one for each tag, obviously), which is more than a bitset (as it wastes 4 bits-per-byte), but is much simpler to initialize and much faster to query.

The actual calls look like this:

static bool is_mathml_integration_point(const GumboNode* node) {
  return node_tag_in_set(node, (gumbo_tagset) {
      TAG_MATHML(MI), TAG_MATHML(MO), TAG_MATHML(MN),
      TAG_MATHML(MS), TAG_MATHML(MTEXT)});
}

// The list of nodes in the "special" category:
// http://www.whatwg.org/specs/web-apps/current-work/complete/parsing.html#special
static bool is_special_node(const GumboNode* node) {
  assert(node->type == GUMBO_NODE_ELEMENT);
  return node_tag_in_set(node, (gumbo_tagset) {
      /* HTML namespace */
      TAG(ADDRESS), TAG(APPLET), TAG(AREA),
      TAG(ARTICLE), TAG(ASIDE), TAG(BASE),
      TAG(BASEFONT), TAG(BGSOUND), TAG(BLOCKQUOTE),
      TAG(BODY), TAG(BR), TAG(BUTTON), TAG(CAPTION),
      TAG(CENTER), TAG(COL), TAG(COLGROUP),
      TAG(MENUITEM), TAG(DD), TAG(DETAILS), TAG(DIR),
      TAG(DIV), TAG(DL), TAG(DT), TAG(EMBED),
      TAG(FIELDSET), TAG(FIGCAPTION), TAG(FIGURE),
      TAG(FOOTER), TAG(FORM), TAG(FRAME),
      TAG(FRAMESET), TAG(H1), TAG(H2), TAG(H3),
      TAG(H4), TAG(H5), TAG(H6), TAG(HEAD),
      TAG(HEADER), TAG(HGROUP), TAG(HR), TAG(HTML),
      TAG(IFRAME), TAG(IMG), TAG(INPUT), TAG(ISINDEX),
      TAG(LI), TAG(LINK), TAG(LISTING), TAG(MARQUEE),
      TAG(MENU), TAG(META), TAG(NAV), TAG(NOEMBED),
      TAG(NOFRAMES), TAG(NOSCRIPT), TAG(OBJECT),
      TAG(OL), TAG(P), TAG(PARAM), TAG(PLAINTEXT),
      TAG(PRE), TAG(SCRIPT), TAG(SECTION), TAG(SELECT),
      TAG(STYLE), TAG(SUMMARY), TAG(TABLE), TAG(TBODY),
      TAG(TD), TAG(TEXTAREA), TAG(TFOOT), TAG(TH),
      TAG(THEAD), TAG(TITLE), TAG(TR), TAG(UL),
      TAG(WBR), TAG(XMP),

      /* MathML namespace */
      TAG_MATHML(MI), TAG_MATHML(MO), TAG_MATHML(MN), TAG_MATHML(MS),
      TAG_MATHML(MTEXT), TAG_MATHML(ANNOTATION_XML),

      /* SVG namespace */
      TAG_SVG(FOREIGNOBJECT), TAG_SVG(DESC)});
}

It essentially looks like the varargs implementation, but it's noticeably faster in benchmarks, shorter, and hopefully correct when it comes to qualified tags.

Overall I'm pretty happy with how this ended up looking in our fork of Goomba. Now, of course, I'm writing all this rant to ask: do you think this direction makes sense to you? Can you think of any drawbacks? Would you consider merging this upstream if I backport these changes from our fork?

Thanks for your consideration! :)

@kevinhendricks
Copy link
Contributor

Hi Vicent,

I am not official in any way so my point of view it truly meaningless, but this is exactly what I was looking for. This is even better than the binary search aproach I scrounged together since order is now meaningless. I would still rewrite tag.c to have tags in alphabetical order so that tag strings can be easily looked up to find tag enums using a tree or binary search. It nicely even allows the fake math:td possibilities that people here claim are valid ;-)

If I have a vote, I would give this implementation all of my votes!

I started programming just over 39 years ago, and this is the very first time someone has dropped a perfect solution into my lap! Nicely done!

I hope you don't mind but if it does not get accepted in gumbo, I will still pull it into my fork as this does check all of my boxes and I think theirs too.

Thank you!

Kevin

On Feb 10, 2015, at 11:04 AM, Vicent Marti notifications@github.com wrote:

Hey! Sorry to interrupt in this already merged PR. I've been working with @aroben on the issue of qualified names.

The main requirement when coming up with a new representation for tag sets is that the declaration of the tag set occur at the same point in the source code that the spec clause uses it. Gumbo has in general tried to make the source code mirror the spec to the greatest extent possible; this was a huge help for code review within Google, and it's continued to be a help as the spec has changed underneath the library. My inclination is to create a struct that's basically 3 arrays (one for each namespace, each about 2-3 words long, 1 bit for each GUMBO_TAG), and then use some preprocessor magic to construct the appropriate bitsets.

I'm glad to see we're on the same page, @nostrademons. After @aroben shared with me this patch internally, I went ahead and refactored it to use a "tag set" in order to fix the underlying issue of qualified names when parsing, to simplify the implementation (and get rid of the ugly uintptr_t), and to increase performance.

Since I obviously wanted to keep all the tag lists inline in the function calls, my first attempt was to use bitsets, like you suggested (this was a couple weeks ago ;d). What I found is that it's non-trivial to initialize a bitset statically on the callsite without either being super-ugly, or super slow. If all the bits could fit in a single word, then the initialization syntax would be just fine, but with a bitset composed of N words, static initialization basically implies knowing what word each tag falls into.

So I went with plan B, which was using a byteset (namely, a byte array). IMHO this has many advantages over a bitset, in both simplicity and performance. The implementation itself is pretty straightforward:

typedef char
gumbo_tagset[GUMBO_TAG_LAST];

define TAG(tag) [GUMBO_TAG_##tag] = (1
<< GUMBO_NAMESPACE_HTML)

define TAG_SVG(tag) [GUMBO_TAG_##tag] = (1
<< GUMBO_NAMESPACE_SVG)

define TAG_MATHML(tag) [GUMBO_TAG_##tag] = (1
<< GUMBO_NAMESPACE_MATHML)

define TAGSET_INCLUDES(tagset, namespace, tag
)
(tag < GUMBO_TAG_LAST &&
tagset[(
int)tag] == (1 << (int
)namespace))

static bool node_tag_in_set(const GumboNode* node, const
gumbo_tagset tags)
{

return
node->type == GUMBO_NODE_ELEMENT &&

TAGSET_INCLUDES
(tags,
node->v.
element.tag_namespace
,
node->v.
element.tag
);
}

As you can see, preprocessor magic is limited to the minimum, and it instead uses C99 features (designated initializers and compound literals) to set the marked tags natively and efficiently. It uses less than 200 bytes (one for each tag, obviously), which is more than a bitset (as it wastes 4 bits-per-byte), but is much simpler to initialize and much faster to query.

The actual calls look like this:

static bool is_mathml_integration_point(const
GumboNode* node) {

return node_tag_in_set
(node, (gumbo_tagset) {

TAG_MATHML(MI), TAG_MATHML(MO), TAG_MATHML
(MN),

TAG_MATHML(MS), TAG_MATHML
(MTEXT)});
}

// The list of nodes in the "special" category:
// http://www.whatwg.org/specs/web-apps/current-work/complete/parsing.html#special
static bool is_special_node(const
GumboNode* node) {

assert
(node->type == GUMBO_NODE_ELEMENT);

return node_tag_in_set
(node, (gumbo_tagset) {

/* HTML namespace */

TAG(ADDRESS), TAG(APPLET), TAG
(AREA),

TAG(ARTICLE), TAG(ASIDE), TAG
(BASE),

TAG(BASEFONT), TAG(BGSOUND), TAG
(BLOCKQUOTE),

TAG(BODY), TAG(BR), TAG(BUTTON), TAG
(CAPTION),

TAG(CENTER), TAG(COL), TAG
(COLGROUP),

TAG(MENUITEM), TAG(DD), TAG(DETAILS), TAG
(DIR),

TAG(DIV), TAG(DL), TAG(DT), TAG
(EMBED),

TAG(FIELDSET), TAG(FIGCAPTION), TAG
(FIGURE),

TAG(FOOTER), TAG(FORM), TAG
(FRAME),

TAG(FRAMESET), TAG(H1), TAG(H2), TAG
(H3),

TAG(H4), TAG(H5), TAG(H6), TAG
(HEAD),

TAG(HEADER), TAG(HGROUP), TAG(HR), TAG
(HTML),

TAG(IFRAME), TAG(IMG), TAG(INPUT), TAG
(ISINDEX),

TAG(LI), TAG(LINK), TAG(LISTING), TAG
(MARQUEE),

TAG(MENU), TAG(META), TAG(NAV), TAG
(NOEMBED),

TAG(NOFRAMES), TAG(NOSCRIPT), TAG
(OBJECT),

TAG(OL), TAG(P), TAG(PARAM), TAG
(PLAINTEXT),

TAG(PRE), TAG(SCRIPT), TAG(SECTION), TAG
(SELECT),

TAG(STYLE), TAG(SUMMARY), TAG(TABLE), TAG
(TBODY),

TAG(TD), TAG(TEXTAREA), TAG(TFOOT), TAG
(TH),

TAG(THEAD), TAG(TITLE), TAG(TR), TAG
(UL),

TAG(WBR), TAG
(XMP),

/* MathML namespace */

TAG_MATHML(MI), TAG_MATHML(MO), TAG_MATHML(MN), TAG_MATHML
(MS),

TAG_MATHML(MTEXT), TAG_MATHML
(ANNOTATION_XML),

/* SVG namespace */

TAG_SVG(FOREIGNOBJECT), TAG_SVG
(DESC)});
}

It essentially looks like the varargs implementation, but it's noticeably faster in benchmarks, shorter, and hopefully correct when it comes to qualified tags.

Overall I'm pretty happy with how this ended up looking in our fork of Goomba. Now, of course, I'm writing all this rant to ask: do you think this direction makes sense to you? Can you think of any drawbacks? Would you consider merging this upstream if I backport these changes from our fork?

Thanks for your consideration! :)


Reply to this email directly or view it on GitHub.

@kevinhendricks
Copy link
Contributor

Hi,

Have you thought about adding the following so that the tag_in can use the same tagsets approach?

#define GUMBO_NAMESPACE_UNKNOWN 3
#define TOKEN(tag) [GUMBO_TAG_##tag] = (1 << GUMBO_NAMESPACE_UNKNOWN)

and then

// Returns true if the specified token is either a start or end tag (specified                                            
// by is_start) with one of the tag types in the list.                                                                    
static bool tag_in(const GumboToken* token, bool is_start, const gumbo_tagset tags ) {
  GumboTag token_tag;
  if (is_start && token->type == GUMBO_TOKEN_START_TAG) {
    token_tag = token->v.start_tag.tag;
  } else if (!is_start && token->type == GUMBO_TOKEN_END_TAG) {
    token_tag = token->v.end_tag;
  } else {
    return false;
  }
  return (tag < GUMBO_TAG_LAST && tags[(int)tag] != 0);
}

@nostrademons
Copy link
Contributor

I like this general approach, and am impressed by how clean the code came out afterwards. I have a couple concerns/nitpicks first, but in general yes I would consider merging this approach to upstream.

  1. I'm a bit worried about how this might affect Visual Studio users. Visual Studio's C99 support is pretty abysmal, but I suspect that Gumbo is in use by some of them as I occasionally get Visual Studio patches from users, who have rather heroically seemed to have made it work. IIRC they may be compiling in C++ mode, using the C subset. I don't have a Windows machine so I can't test - @fishioon, @bugparty, you've submitted VS patches before - would the C99 initializer lists here break anything on Windows?
  2. If there are problems on Windows, perhaps we could change gumbo_tagset from a typedef to a macro to hide the initializer list syntax. So it'd be node_tag_in_set(node, GUMBO_TAGSET(TAG(HTML), TAG(BODY), ...)). Not sure about this - all other things being equal, I'd prefer the typedef over the macro, but it depends what VS can support.
  3. What were your benchmark results? I'd like to get a sense of what sort of speedup might be expected from this.
  4. Similarly, how did it affect the size of the shared library? I suspect it's probably not much, but one thing we noticed at Google was that bloating the binary size would both increase link times and cause instruction-cache misses that could cause generalized slowdowns that weren't reflected in benchmarks.
  5. I do also like the idea of using tagsets for tag_in, like what Kevin posted.
  6. It wasn't clear whether you have it named node_tag_in_set just for illustration or to avoid conflicts with node_tag_in (and keep the latter around)...but it's fine to completely replace node_tag_in with this construct. I'd rather not have two completely separate ways of doing the same thing.
  7. Beware the template branch (Template tag #274), which has added a lot more calls to these functions. It's currently sitting in review, slated to go in for 0.10 (since it has API additions); if you can get a solid pull request together for tag sets I'd probably prefer to merge it before that, for 0.9.4, but if it waits until afterwards you'll need to update those as well.

@kevinhendricks
Copy link
Contributor

Hi,

I may be able to answer a few of your questions about this approach. I have actually implemented it over the top of your latest master. I have included his idea of a gumbo_tagset and also added its use for tag_in as well using the additional macros I mentioned.

That said, the reason I implemented it over your current master and not over the earlier master is that I think we still need some way to handle comparing a single tag and namespace against a single node so the QUAL_NAME stuff is still needed (I think?). So we still would need the uintptr_t hack too. Of course, perhaps he has other approaches to fix those areas. .

Of course I still have the expanded set of recognized tags and xhtml parsing additions you would not want but someone might easily want to grab pieces of this to and use it to help fix the almost 70 uses of node_qualname_in and tag_in which takes a lot of effort.

My master is here: https://github.com/kevinhendricks/gumbo-parser

I will incorporate this into my Windows build and Sigil (via a VM) and see if I can get the initializer as argument thing to fly with the Windows compiler. I can also run some before and after size and timing runs if you want.

Kevin

@kevinhendricks
Copy link
Contributor

Hi,

Ah, I bet he used the same gumbo_tagset approach in has_element_in_specific_scope() too which I neglected to think of. I fix that up yet tonight.

Kevin

@kevinhendricks
Copy link
Contributor

In case it helps,

Added in the use of gumbo_tagset for has_an_element_in_specific_scope() and related and longer need to use uintptr_t for QualName.

All pushed to my master if anyone wants to steal pieces of it to fix up the now 70+ uses of node_tag_in, tag_in and specific scope removing all use of varargs I think as well in the main gumbo tree.

@vmg
Copy link
Contributor

vmg commented Feb 11, 2015

@kevinhendricks: Thanks for taking the time to work on your own branch. I see a few issues on your implementation (mainly, the fact that it's still using the ugly uintptr_t bitset), and that the tag_is sets have not been ported (I do have those ported on my own set).

I'm afraid I don't have time this week to prepare PRs, but I've pushed our parser.c implementation to a Gist for your convenience: https://gist.github.com/vmg/70f08631a0a48b25441a

Please feel free to take a look and grab as much stuff as you need. I'm hoping you'll agree that all the design decisions are sound, and that we've solved all your design concerns rather satisfactorily -- it's a significant reduction on code and simplicity, the diff is minimal (because we've added small wrappers to all the node_tag_is helpers that use the HTML namespace), and we've removed as many hacks as we could while keeping the library performant (most noticeably, the allocation of a static, one-entry-array for many API calls, and of a lot of intermediate allocations that were previously used for checks).

@nostrademons: Please feel free to look at the Gist too (FWIW, GitHub has signed Google's CLA, although I don't know if that applies now that you're no longer at Google) and see how you feel. Following up on your questions now:

  1. I'm a bit worried about how this might affect Visual Studio users

This is bad news. I'm afraid none of this code can possibly work with MSVC. I saw on the README that this was a C99 library, so I gave this no further thought. Compound literals are a C99 feature that MSVC does not implement (obviously, as it is widely known as the worst C compiler in existence), and what's worse, that is not supported in any of the C++ standards, and hence cannot be used under MSVC even when building in C++ mode.

What's worse, anything you come up with to keep the tagsets next to the function call is not going to work with MSVC (unless you stick to the current implementation using va_args). The C89 in MSVC simply doesn't have provisions for passing any kind of non-trivial literal in a function call. I'm afraid I'm at a loss here -- maybe somebody smarter than me can come up with a great idea, but as far as I can tell, you're going to have to declare all the tag sets (regardless of how they are implemented; byte arrays, a single bitset, several bitsets, several arrays of tags...) in their own variables if you want them to work in MSVC.

  1. If there are problems on Windows, perhaps we could change gumbo_tagset from a typedef to a macro to hide the initializer list syntax.

We considered this, but this would require a completely different implementation of all the tag_is APIs only for MSVC (as they would still need to use va_args -- they can't even use an array of tags), and the complexity is daunting (I would advise to have the macro expand to a namespace, tag pair on Windows, and then use va_args loading each pair and checking them sequentially). If you're set on making this a C89 library-with-C99-features-but-not-on-Windows, this is probably your only option, although I'm afraid I won't be able to assist with that implementation (we're committed to modern compilers so we can write simple and more efficient code). :'(

  1. What were your benchmark results? I'd like to get a sense of what sort of speedup might be expected from this.

I don't have accurate benchmarks for this feature because our branch of Gumbo has gone through several optimization phases, and it's currently between 2x and 3x faster than upstream -- so any benchmarks would be synthetic and not really accurate. My hunch would be that this will make most parsing around 20% faster because of the tag_is sets, which are the ones actually consuming time. Regardless, I wrote this for elegance (and simplicity), not for speed... I'm pretty sure we could optimize it further if you're interested.

  1. Similarly, how did it affect the size of the shared library?

No noticeable changes. If anything, the size is slightly smaller for the largest Tag Sets (as the previous va_arg implementations were pushing 4 bytes per tag to the stack). Do note that the tag-sets in compound literals are not really stored in static storage but allocated in the stack on function call (which makes this extra nice for sparse-sets, in both size and speed, and provides roughly the same results for dense sets).

  1. I do also like the idea of using tagsets for tag_in, like what Kevin posted.

Absolutely. You can see in the previous Gist that this is part of our implementation.

  1. It wasn't clear whether you have it named node_tag_in_set just for illustration or to avoid conflicts with node_tag_in (and keep the latter around)...but it's fine to completely replace node_tag_in with this construct.

We renamed it to make the API more obvious -- the old API was not kept around. We can of course rename it back.

  1. Beware the template branch (Template tag #274), which has added a lot more calls to these functions. It's currently sitting in review, slated to go in for 0.10 (since it has API additions);

Thanks for the heads up. I don't foresee any complications merging this either up or down.

@kevinhendricks
Copy link
Contributor

Hi Vicent,

Thanks for posting it. My latest master has all of the tag_in and specific_scope changes, and uses unsigned int now for QualName but I did not get a chance to push it until late last night. Your version is actually much cleaner and I will definitely use parts from it. So Thank You! (again!)

That said, my project (Sigil) is cross-platform and Windows is one of our targets. So we will either have to use one of the gcc compiler ports to Windows to deal with this issue or simply create all of the tagset definitions outside of the arguments in some way. I am not going backwards as I need a larger tagset and so your speedups are very important to us.

Kevin

@twpol
Copy link

twpol commented Feb 11, 2015

Compound literals are a C99 feature that MSVC does not implement

https://msdn.microsoft.com/en-us/library/hh409293 "What's New for Visual C++ in Visual Studio 2013" does include "Supports these ISO C99 language features: Compound literals." (in addition to a few others) so maybe this particular feature can be used in the latest VS?

I'm happy to test small code samples in VS2013.

@vmg
Copy link
Contributor

vmg commented Feb 11, 2015

Oh shit, looks like Microsoft actually got their stuff together in Visual C++ 2013, with support for both compound literals and designated initializers.

I see no reason why this patch wouldn't fully work as-is on MSVC, with no additional changes (even though I cannot test it right now). Props to Microsoft I guess! This will certainly make this whole process much more straightforward.

@kevinhendricks
Copy link
Contributor

@twpol
Try compiling this very simplistic test case

KevinsiMac:Desktop kbhend$ cat test.c
#include <stdio.h>

static void printme(int * ap, int cnt)
{
    for (int i=0; i < cnt; i++) {
        fprintf(stdout, "value is %d\n", ap[i]);
    }
}


int main(int argc, char ** argv)
{
    printme((int []){ 1, 5, 8, 9 }, 4);
    printme((int []) { 100, 99, 9 }, 3);
    return 0;
}
KevinsiMac:Desktop kbhend$ clang -o test test.c
KevinsiMac:Desktop kbhend$ ./test
value is 1
value is 5
value is 8
value is 9
value is 100
value is 99
value is 9

It is not exactly the same as used here but will test the compound initializer when passed as an argument.

@kevinhendricks
Copy link
Contributor

Hi vmg,

There are a lot of very nice changes in your parser.c besides the obvious tag_in_set, node_tag_in_set, and tagset

  1. adding gumbo_malloc and gumbo_free to replace gumbo_parser_allocate pieces so you don't have to pass a parser all over the place when all you want to do is allocate or deallocate memory (and removing it from the options). That allows you to have vector implementations that no longer needs to pass in parser (making it useful in other places) and have create_element, clone_node, vector add and pop, etc and others all no longer need to pass around "parser" as well.

Very Nice! (this is one I will be pulling in). I could never understand why they decided to pass parser all over the place just to allocate and deallocate memory. if someone wanted to use their own memory allocation/deallocation it would be simpler to just wrapper the gumbo ones or wrapper malloc and free themselves to catch it and redirect it.

  1. adding some support for fragment parsing (How complete is it?)
  2. no need for the Qualname macros anyplace! This makes the code much cleaner as well.
  3. recognizing and removing redundancies in the "in_scope" family (I used the tagset approach here as well)
  4. use of more generally understandable gumbo_strdup instead of gumbo_copy_stringz. Why re-invent the wheel!

I would bet your removal of passing parser all over the place where not needed, really cleans up the code, and makes many of the node / element creation pieces much easier to read and available for calling routines to modify the tree for special cases from outside the parser itself

Very nice. I had already started the process of removing "parser" from everyplace it was not needed (ie. when only memory allocation/deallocation was needed). I gave up until things get more stable but I really like these ideas and will be pulling almost all of them into our tree in one form or another.

Kevin

@twpol
Copy link

twpol commented Feb 11, 2015

E:\Documents\temp>more <con >gumbo-parser-286.c
#include <stdio.h>

static void printme(int * ap, int cnt)
{
    for (int i=0; i < cnt; i++) {
        fprintf(stdout, "value is %d\n", ap[i]);
    }
}


int main(int argc, char ** argv)
{
    printme((int []){ 1, 5, 8, 9 }, 4);
    printme((int []) { 100, 99, 9 }, 3);
    return 0;
}
^C
E:\Documents\temp>cl gumbo-parser-286.c
Microsoft (R) C/C++ Optimizing Compiler Version 18.00.31101 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

gumbo-parser-286.c
Microsoft (R) Incremental Linker Version 12.00.31101.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:gumbo-parser-286.exe
gumbo-parser-286.obj

E:\Documents\temp>gumbo-parser-286.exe
value is 1
value is 5
value is 8
value is 9
value is 100
value is 99
value is 9

@kevinhendricks: Looks like things are good for VS2013.

@kevinhendricks
Copy link
Contributor

@nostrademons,
I have incorporated the basics from vmg's parser.c (the tagset approach) without any of the other major improvements he has made. If it would help speed adoption of his changes given they will work for Windows, I would be happy to strip out all of my unrelated changes and create a new branch and add in just those changes and create a pull request (or a patch if you just want a patch).

Just let me know if that is something that would help.

@kevinhendricks
Copy link
Contributor

@twpol
That is good news! Thanks for testing it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants