Skip to content

tag_defs + AdjustTags() and ResetTags() during parsing is not thread-safe (tags.c) #816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
russianfool opened this issue Apr 24, 2019 · 24 comments

Comments

@russianfool
Copy link

russianfool commented Apr 24, 2019

The usage of the static tags.c table during parsing is not thread safe.

During parsing, tags are installed from a global, non-const table static Dict tag_defs[] into a hash in the doc structure using tagsInstall(). During parsing, when certain conditions are met, the parser will call AdjustTags().

AdjustTags() will:

  1. Use LookupTagDef() to grab the pointer to the value, regardless of whether or not it has been cached in the hash table (table contains only pointers to global static).
  2. Modify the entry through the pointer (either parsing options for the tag, or parser callback functions).

This adjustment affects other threads, and calling ResetTags() is not enough as those threads continue processing while this global state is modified with T1's state.

Maybe related as a root cause to #790.

How I'm using Tidy: As a library, with many parsing threads hitting it simultaneously (5.6.x, but bug is present in latest).
How this ticket manifests itself: Unexplained parsing differences; out-of-stack crashes around "button" tag parsing as we hit some parsing infinite loop; TSAN and similar tools warning on the structure.

Quick solution: Have made the tags structure thread_local in my builds; this solves my issues.
Other (more involved) solutions:

  • (Dumb) This is a critical section, so ... lock it. RIP MT parsing performance.
  • (A little smarter) Copy the table completely into the doc. Set table itself back to const. Easy to implement.
  • (Smarter) Allow hash table to control memory of the node (extra field, "owns_mem"), alloc in adjust (if changing), free in reset. Set table itself back to const.

Let me know what you think about the above problem and which solution seems most inline with your goals for Tidy (e.g. whether you'd prefer (little smarter, with some doc ++memory_overhead) or (smarter), or something else).

@geoffmcl
Copy link
Contributor

@russianfool thanks for raising this issue... yes, the library is not thread safe... it should be... in this case, the table should be set back to const... only some form of in-doc allocated memory can be modified...

So while the dumb critical section is always available to libTidy users, you have given 2 other options - one I understand, but falter on the second -

  • Copy table to doc - the current doc is about 13K... Need to check what will a table copy cost? Maybe an easy, quick solution?
  • Allocate adjusted to doc - not sure I understand control memory of the node ... That needs to be explained more... but like the idea... I think...

Need to check more into the libTidy hashing...

I like an idea that only for legacy document parsing, is a set of tags hashed, and modified, in an init, AdjustTags() step... all hashes are freed when the doc is freed... there should be no such thing as ResetTags()... but still to hash (silly pun!) this out in my head... maybe test some code...

Any solution is very likely to cost more memory allocation... so let's go for the smartest solution available...

Look forward to further feedback, even patches, PR, of a workable solution... thanks...

@russianfool
Copy link
Author

russianfool commented Apr 30, 2019

@geoffmcl: Thanks for the quick response! From my testing, this (and the setlocale shenanigans) were the only bits that weren't MT-safe, so probably it's pretty close.

Allocate adjusted to doc - not sure I understand control memory of the node ...

Add "free_memory" flag to the DictHash structure. When AdjustTags() runs, it calls a new LookupCreateTag() or something instead, which allocates modified nodes from the global nodes; this is the only caller that sets that flag. ResetTags() can then free these from the hash table when it sees that it controls the memory.

... or something similar; the focus is on memory overhead, since you're only allocating memory for the new node. Otherwise, could just have const global entries for the modified node configurations as well, just re-point the hash table to those (no extra overhead per-doc).

... I might also have to look at bit more into how the declare tags work here though.

Any solution is very likely to cost more memory allocation... so let's go for the smartest solution available...

That's true, but there's time and stuff to consider. Aka the most straightforward solutions are (generally) easiest to support. This table has ~154 entries, at 64 bytes a pop (so like 10K). So, never mind :) Copying the whole table is undesirable if your doc size is on the same order of magnitude.

there should be no such thing as ResetTags()

Re-using a doc object is something that happens in some runmodes (as part of the tidy utility's main, as it processes multiple documents? looked into this at some point, don't remember the details). If that's the case, it is necessary to set the doc back to it's initial state so that this use-case is supported.


What're your thoughts on using something like https://www.gnu.org/software/gperf/ to generate a perfect hash table at compile-time, at least for the global tags? Or, well, there are two ways to run this:

  • At compile-time, generating the .h table; this causes gperf to be needed as a build dependency.
  • Anytime the tags_table.h.in is modified, contributor modifying needs to rerun gperf update and commit the resulting struct + function as well (encoded as some make operation for ease-of-use, probably).

The second approach has less dependencies for build, but you could (mechanically) end up with a commit that brings the two out of sync.

@russianfool
Copy link
Author

Anyway, you can ignore a lot of the "theoretical" above and look at the PR. I would love your thoughts on gperf application still - there's a large amount of overhead sunk into the O(N) table lookups when there object stream is mainly small documents. Also for the other couple of O(N) stuff that doesn't have hash tables installed (like messages).

@geoffmcl
Copy link
Contributor

geoffmcl commented May 1, 2019

@russianfool still reviewing PR #817, ie your fork, issue-816 branch... but it is looking good...

Simply, add a new static const table for not html5, and use that to setup a set of 4 hashes, IFF document not html5... great!

Should have thought of this myself, when setting up this mess... ;=))

Only cost is this new static html4 table, 4 x 64 bytes... and the hash table memory for these 4, for every non-html5 document, whether used or not... seems a reasonable cost...

Running tidy, built with MEMORY diags ON, on a given sample input, in_815-1.html, get only a tiny, understandable, increase in memory-

@REM Options to ENABLE very noisy, memory DEBUG info
@set TMPOPTS=%TMPOPTS% -DENABLE_DEBUG_LOG:BOOL=ON
@set TMPOPTS=%TMPOPTS% -DENABLE_MEMORY_DEBUG:BOOL=ON
@set TMPOPTS=%TMPOPTS% -DENABLE_ALLOC_DEBUG:BOOL=ON
@set TMPOPTS=%TMPOPTS% -DENABLE_CRTDBG_MEMORY:BOOL=ON

I then have a perl script to read the output, and summarize the memory use seen, in the temptidy.txt output file...

Using: Tidy 5.7.24.mem2
Found 139 alloc, 0 realloc, and 139 frees...
Total mem 114365, Max mem 37748
Using: Tidy 5.7.24.I816
Found 142 alloc, 0 realloc, and 142 frees...
Total mem 114413, Max mem 37796

Shows just 3 additional allocs, 48 bytes... great... so far...

Did you deliberately skip adding back the const, or was this just an oversight? Seems we can now put back -

-static Dict tag_defs[] =
+static const Dict tag_defs[] =

Still question, or do not fully understand, yet!, the need for TY_(ResetTags), especially in TY_(AdjustTags) and TY_(DocParseStream)... need to explore more...

The tidyCreate() doc starts in default html5 mode... no reset required...

ONLY if tidy decodes another DOCTYPE, is TY_(AdjustTags) called... to do what it now does... setup 4 hashes using the alternate html4 table... which are all cleared on doc release...

Maybe we should discourage the re-use of the tidy doc! But console tidy.c can do it... So maybe the ResetTags is needed! not sure... need to explore more... What do you, or others, think about this?

Did I read somewhere you had done regression testing with no adverse results... and ASAN testing... this sounds good...

Have only just started to read about gperf... My immediate concern would be about Windows... but maybe this should be moved to a separate issue...

Will also continue testing... but further feedback appreciated... thanks...

@russianfool
Copy link
Author

russianfool commented May 1, 2019

Did you deliberately skip adding back the const

Woops, no that was on accident, fixed (dropped it while juggling files).

The tidyCreate() doc starts in default html5 mode... no reset required...

Is that the case? We zero the memory in the the constructor (tidyDocCreate calls TidyClearMemory, but doesn't touch the variable), and the variable is HTML5Mode; when that gets zeroed, it takes the value "no" as that's the first value in the enum.

When the parsing pass starts,ResetTags() sets this guy to yes for the doc, unless we hit the right DOCTYPE during parsing (which calls AdjustTags() to set HTML5Mode to no).

Did I read somewhere you had done regression testing with no adverse results... and ASAN testing... this sounds good...

Yes on the regression, ./testall.sh from tidy-html5-tests. No on ASAN (I can do that too); I did valgrind --leak-check=full --track-origins=yes, slower than ASAN but should catch the same out-of-bounds access (+ leaks when I was allocing memory).

... maybe makes sense to pipe a larger html corpus into here though, not for any out-of-bounds access, but behavioral changes that maybe the regression samples don't cover yet.

Have only just started to read about gperf... My immediate concern would be about Windows... but maybe this should be moved to a separate issue...

Sure, I'll spawn another issue when I'm ready, we can continue conversation there. Win compatibility (if committing the intermediate result files as well) is just Windows dev has to run gperf-on-windows to generate the intermediate file, when changing these tables.

Will also continue testing...

Yeah. See comments/questions on the PR review itself as well :) I'll commit the WIP part that tries to address my ("well what if this changes the larger behavior") by making things consistent for the LookupTagDef callers, so that they get the expected version of the tag definition, but that patch definitely doesn't work. Either I'm missing something, or I'm hitting some legacy issues that need to be resolved at the same time. The uh ... some of the changes to the HTML4 nodes didn't seem too consistent (e.g. change parser for button tag, but not CM_ property?); maybe that's it. With the extra patch regression crashes or blows up the stack.

There are also some direct node comparisons to double-check, e.g.

src/istack.c:170:        if (lexer->istack[i].tag == node->tag)
src/istack.c:188:        if (lexer->istack[lexer->istacksize - 1].tag == node->tag) {
src/istack.c:359:            if (lexer->istack[i].tag == element->tag) {
...
src/parser.c:1593:        if (node->tag == element->tag && node->type == EndTag)
src/parser.c:1646:             && node->tag == element->tag
src/parser.c:1802:                     && (node->tag != element->tag)
...
src/parser.c:1868:            if ( node->tag == element->tag )
src/parser.c:1894:        /* if (node->tag == doc->tags.tag_a && !node->implicit && TY_(IsPushed)(doc, node)) */
src/parser.c:2033:                if (node->tag == parent->tag)
... and so on

Previously it didn't matter if we swapped from html4 to html5, those tags would have the same address. Now that's not the case, so these comparisons can work differently.

@geoffmcl
Copy link
Contributor

geoffmcl commented May 2, 2019

@russianfool thanks for the quick fix on the const...

On reviewing d75c822, Oct 14 2015, I now see/remember! the reason fot the reset...

Every new file, or snippet, passed into DocParseStream, deserves a default library state... that is html5... until the new stream indicates otherwise... so a reverse, a complete 180, on this... sorry for the noise...

Your PR seems to 100% do this... no probs...

... some of the changes to the HTML4 nodes didn't seem too consistent...

As you can seen in the comments, these changes simply grew from tests... there was no analysis... sometimes this, or that, or both... surprised it sort of petered out at 4... expected more, I suppose...

Your idea to create a separate table for this seems the way to go...

Yeah. See comments/questions on the PR review itself as well :)

I hope I have addressed everything raised there... maybe I missed something... I prefer the discussion be in the issue addressed by the PR... not in the possible solution... but please repeat anything I missed, or that you would like an answer on... thanks...

There are also some direct node comparisons to double-check, e.g.

Is not everything in the lexer stack, or as used in the parser, the current pointer? Usually to a hash entry...

At least in one test I did, there is an istack use... But will try to look at this concern further...

Yes, open another issue on gperf... if you want to further this... thanks...

Meantime, continuing testing... especially your change to LookupTagDef... not sure exactly what is being addressed here...

If this is WIP, then it should be done in a new branch... otherwise git bundles them into the PR...

Further feedback appreciated... thanks...

@russianfool
Copy link
Author

russianfool commented May 2, 2019

I prefer the discussion be in the issue addressed by the PR... not in the possible solution...
especially your change to LookupTagDef... not sure exactly what is being addressed here...

Well, sure, but the discussion is relevant to the solution. I.e. it's easier for me to point at the code that I'm modifying instead of trying to get that across here. For the LookupTagDef - I was hoping you could help explain the bigger picture here since I don't know how all this source ties together. In this the patch is useful.

Specifically ... for the LookupTagDef, those are all the users that could get entries returned that don't match their doctype (html4 vs. html5). I think that runs as the cleanup step after the parsing step, and assumes that it'd be in the same mode as the document. Right?

Is not everything in the lexer stack, or as used in the parser, the current pointer? Usually to a hash entry... If this is WIP, then it should be done in a new branch... otherwise git bundles them into the PR...

Yes, that's the intent (to show off the WIP and get feedback, and bundle them into the PR; with this new info you definitely shouldn't merge it just yet. This isn't as straightforward as flipping the table to a thread local... or the initial proposed split).

But back to the tag comparison: sure, yes, that's exactly what they do (... oh wait ... not the global table?). Now imagine our case; we've created extra entries for the same tag (e.g. has an html5 entry and an html4 entry). Well, if you can flip between modes in the middle of such a tag (any tag that has multiple entries), the begin/end comparison no longer matches properly. You'll compare two pointers, but they won't match; one points at html4, one at html5.

That last bit probably requires an extra patch on top of the WIP to fix; not to mention that straightening out whether folks get the html4 or html5 version causes parsing issues ... potentially because of the flipping case I mention above).

@geoffmcl
Copy link
Contributor

geoffmcl commented May 2, 2019

@russianfool thanks for the feedback...

Concerning your change to LookupTagDef, I think that should be discussed here first, before you add it to PR...

If desired, by all means, demonstrate it in actual working code, but this should be in its own branch... devs can still check it out, and provide feedback...

Yes, with your current table change/split, there is the possibility that a caller to this service would, at present, only get the html5 entry. Does that matter? It depends on what is done with this returned pointer...

It seems now used some 7 times - assuming the first part of your PR is applied -

..\..\src\clean.c
     60 <    const Dict* dict = TY_(LookupTagDef)( tid );>
    990 <        node->tag = TY_(LookupTagDef)( TidyTag_DIV );>
  2,032 <                    const Dict* tag = TY_(LookupTagDef)( listType );>

..\..\src\lexer.c
  2,183 <    const Dict* dict = TY_(LookupTagDef)(id);>

..\..\src\parser.c
     75 <    const Dict* tag = TY_(LookupTagDef)(tid);>
  1,153 <                        node->tag = TY_(LookupTagDef)( TidyTag_TH );>
  1,741 <            node->tag = TY_(LookupTagDef)( TidyTag_BR );>

Some we can discount since they are obtaining a tag which we know are not in the current 4, of TidyTag_A, TidyTag_CAPTION, TidyTag_OBJECT, and TidyTag_BUTTON...

But yes, others users of the service can be one of those tag... like in CoerceNode, which can include OBJECT...

But again it depends on what is done with this pointer, and then what information is used from this structure...

In 2 cases, CAPTION and BUTTON, only the parser is changed, and I do not think that parser would ever be used... so no problem here...

Maybe the model changes could sometimes be important... but not sure...

Yes, one way to overcome this, is as you have suggested, extend the LookupTagDef to include say a Bool html5 parameter, or a TidyDocImpl* doc - I prefer the latter - and search this 4 first if not doc->HTML5Mode... but still to convince myself of the need...

We certainly should not go back to declaring a static Bool HTML5Mode = yes;, as in ef50d8e ...

Have yet to fully explore the mentioned istack pointer comparisons... is there a problem here...

To me, right now, you have sort of messed up, what was looking like a very good PR, with a not yet discussed, or agreed, WIP...

Still exploring... look forward to further feedback... thanks...

@geoffmcl
Copy link
Contributor

geoffmcl commented May 2, 2019

@russianfool oops, forgot to mention a second small patch to your initial PR... minor...

diff --git a/src/tags.c b/src/tags.c
index ccfc991..ab4ce2d 100644
--- a/src/tags.c
+++ b/src/tags.c
@@ -807,10 +807,10 @@ void TY_(FreeDeclaredTags)( TidyDocImpl* doc, UserTagType tagType )
 \*/
 void TY_(AdjustTags)( TidyDocImpl *doc )
 {
-    TY_(ResetTags)( doc ); /* Reset tags ensures blank slate for tags we care about */
-
     TidyTagImpl* tags = &doc->tags;
     size_t i;
+    TY_(ResetTags)( doc ); /* Reset tags ensures blank slate for tags we care about */
+
     for (i = 0; i < sizeof(html4_tag_defs)/sizeof(html4_tag_defs[0]); ++i)
     {
         tagsInstall(doc, tags, &html4_tag_defs[i]);

In general, but not always strictly enforced, is like in K&R C, that all local context variables must be before any other code... or something like that...

Have not found a way using gcc, to show this warning/error... but my MSVC10, 2010, flags it as an error...

In this case just moves your ResetTags down two lines... seems a small cost to pay for maintaining this backward compatibility... not a biggie...

I have now looked at some of the pointer comparisons... but in each case checked so far the pointer is to the hash of the tag... That means it is already html 4 or 5... so should all work...

Can you point me to a specific case where this would not be so? Thanks...

@russianfool
Copy link
Author

russianfool commented May 2, 2019

Concerning your change to LookupTagDef, I think that should be discussed here first, before you add it to PR...
If desired, by all means, demonstrate it in actual working code, but this should be in its own branch... devs can still check it out, and provide feedback...

I'm not convinced the PR works without the extra changes. It's more useful to show the extra code and explore it with you. Don't panic though, this is git -- we can get it back to the exact state once we're convinced I'm chasing ghosts. The global goal has stayed the same: transition the tags_def table to something that's MT-safe without introducing regression issues.

Some we can discount since they are obtaining a tag which we know are not in the current 4, of TidyTag_A, TidyTag_CAPTION, TidyTag_OBJECT, and TidyTag_BUTTON...
But yes, others users of the service can be one of those tag... like in CoerceNode, which can include OBJECT...

I'm hesitant to just rely on that first bit; it leaves the code in a more error-prone state (folks need to be careful not to add html4 tags for those). For the second bit, yes, CoerceNode and InferredTag are the main folks I'm worried about not getting the right "hash entry" (global entry in tags_def or the html4 version).

I have now looked at some of the pointer comparisons... but in each case checked so far the pointer is to the hash of the tag... That means it is already html 4 or 5... so should all work...

You're sure we don't (where pointer here means "pointer into the global tags table"; we keep saying "hash table" pointers, but I think we're talking about the same thing: pointers at a specific tags_def entry):

  1. Use one of these lookups to get a tag and cache the pointer.
  2. a. Hit the doctype switch, and use the same lookup to get the same tag from a different table and cache the pointer.
    b. Use a LookupTagDef that doesn't take the new document state into account.
  3. Compare the pointers.

It's a bit of work for me to work through the parser under debug (don't know those bits too well), but I'll try to get a more fleshed out patch next week, if I find there are actually problems in the course of this debugging. If there are, it should be possible to craft a sample for regression, at which point I can maybe explain better with examples (or prove myself wrong).

We certainly should not go back to declaring a static Bool HTML5Mode = yes;, as in ef50d8e ...

It's compatibility for an external API that mimics the previous behavior. In that, this is the closest implementation. I know the symbol is marked as "prvTidy", but that does nothing for exporting it from the .so (I think, unless that's changed in 5.7).

Probably ok-ish to let them hit only the html5 table, but who knows what kind of crazy users are out there. We'd want to deprecate/remove the API asap - just doesn't make sense to break compatibility for it.

but my MSVC10, 2010

Sure, that's fine, I'll move declarations. Seems to be --std=c89 -pedantic is what we want here, or maybe with -pedantic-errors instead. That's provided that these flags kinda match what's being built with now and don't throw other warnings.

@geoffmcl
Copy link
Contributor

geoffmcl commented May 3, 2019

OFF TOPIC:

@russianfool a small point about edited posts...

Of course, correcting some small spelling mistake, grammar, etc, etc, is always no problem... appreciated even...

But adding to paragraphs, adding or changing their meaning, ie substantial modifications, can cause some confusion...

Maybe you do not know it, but we - members, watchers, and @ ids - only get one email notice of your initial post... nothing more about edits... or at least I don't...

That email is what I first get, read, and commence to think about the problems, and maybe get into some tests... form some of the reply... at least mentally...

Thankfully, in replying, I seldom reply directly by email...

In fact I use an editor, to copy your post, and pre-prepare my reply, usually with at least spelling corrections, so when online, to post, see the edited marker, read the final, and it is surprising, at least, if I must now reply to other, new things...

Anyway, this is not a BIG thing!

Just be aware that only the initial comment is notified... If you do have other, additional, thoughts, ideas, etc, - and we do want these! - then they can just as easily be posted as a new comment... like we would have to do if this was all email sent system... no ability to edit... thanks...

Maybe we should have something about this in our docs... would certainly consider it if someone proposes something... maybe an README/ISSUES.md, or... as a new issue...

In searching around about this found isaacs/github#310, on the need for more notifications, which dozens reacted to, and there are probably many others, on a similar vein... But in general, I am ok with the current notifications...

Just please try to keep the edits-of-substance down... thanks...

After more testing, will get around to replying, ON TOPIC, separately...

@geoffmcl
Copy link
Contributor

geoffmcl commented May 3, 2019

@russianfool well in this case, it was your later edits, which gives me a clue to some of your thinking, I think!

This TY_(LookupTagDef) is not part of the Public API.

Yes, no one has yet come up with a way to exclude this, well all these many, internal functions, all named "prvTidyXXXX", all formed by the same TY_(<name>) macro, from the .so library... sad...

They are not exported in the windows shared DLL/LIB, and likewise should not be exported by the unix libTidy... read #743, and maybe others... they are not in any tidy installed header... they are private, just between library modules...

So the only user to use this/these functions are within the library. Does this narrow down the ghost chasing? ;=)) Or did I miss the idea?

Yes, in reading the stream, a tag is encountered, you will get a warning of missing <!DOCTYPE> declaration... then a doctype is encountered... another warning <!DOCTYPE> isn't allowed after elements... and tidy will correct the problem...

Yes, iff it is a html4 doctype, and iff the tag is one of the 4, then maybe there could also later be a compare of a stored pointer problem...

But I am still not convinced there can be such a problem... given the order of things... parsing, then cleaning, then output... where, and how used, etc, etc...

If there is still some doubt, then we could make that 2nd message an error, and force the user to fix the html first...

Yes, had been previously alerted to --std=c89 -pedantic, but this throws up many other warnings... seems a burden, and not what is desired... and maybe should drop the MSVC10 2010 compiler support...

So at this stage would prefer to go back to the original PR... thanks...

Will continue to explore, as time permits, and look forward to further feedback... thanks...

@geoffmcl
Copy link
Contributor

@russianfool well, where are we with this?

It seems the best solution is too add the table, 9K, to the initial doc allocation, 13K... seems a *big hit, to a 23K doc... but perhaps the only way to fully guarantee thread safety (MT)...

Will try to find time to experiment with this... may use some pieces from PR #817... like all table lookups have to include the doc... etc...

Meantime, look forward to further feedback... thanks...

@russianfool
Copy link
Author

russianfool commented Jun 24, 2019

@geoffmcl : Right, I just haven't had a chance to come back and fully understand the parsing logic. The more I look at it, the more I'm convinced it's not quite right. E.g. I think you'd want to determine doctype and charset before you start your main parsing pass, otherwise you have to go back and correct already-parsed nodes as their meaning between html5 and html4 could have changed.

It seems the best solution is too add the table, 9K, to the initial doc allocation, 13K...

It's the easiest in terms of understanding side effects, in that there are none. Another such "works as it does today" is to use __declspec( thread ) and __thread_local (you gotta plex to the compiler though, but the overhead is 9k per thread in the program).

@geoffmcl
Copy link
Contributor

@russianfool thank you for the feedback...

I am afraid I do not share your concern about the current parsing logic...

If tidy first encounters multiple html elements in the stream, then encounters a DOCTYPE, you will be given two warnings, like -

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 2 column 1 - Warning: <!DOCTYPE> isn't allowed after elements

Now that second warning could be expanded to something like ... Those pre-doctype element(s) have been parsed using the default html5... etc, etc... ie a fuller explanation... whatever... ...

But warnings already implies you need to fix the problems in your document, to get correct parsing...

table-to-doc: The initial doc allocation, 13K... seems a *big hit, to 23K per doc... but maybe not really significant given it is over 10 years later... where MT is important... given cpu's speed/power, memory, etc, etc...

As stated, will try to find time to experiment with this... maybe as an optional, like... #ifdef ENABLE_THREAD_SAFETY special build... or something...

Meantime, look forward to further feedback... thanks...

@russianfool
Copy link
Author

@geoffmcl: Sure, that's a reasonable explanation for the most part. You'd still expect not to see side-effects of the strange variety (e.g. html4-button not registering a now-html5-button close or something).

For the extra memory:
What're we trying to optimize exactly? Just per-doc resource usage, thinking that for a user with many such docs this is going to be a significant penalty?

E.g. for the utility ...

Command being timed: "tidy --help"
...
Percent of CPU this job got: 100%
    ...
Maximum resident set size (kbytes): 2308

@russianfool
Copy link
Author

As stated, will try to find time to experiment with this... maybe as an optional, like... #ifdef ENABLE_THREAD_SAFETY special build... or something...

That's not a bad compromise since "advanced" users are anyway fiddling with the compilation. Would it maybe make sense as a run-time config of tidy overall? E.g. pre-some-2.x version of glib they'd require a call to g_thread_init(); I'm not sure what it used to do, but in the Tidy implementation it'd toggle a global flag to instead allocate memory for doctables.

@geoffmcl
Copy link
Contributor

@russianfool as usual, thank you for the feedback...

Sure, if you let your imagination run wild, it is possible to see side-effects of the strange variety, if you present tidy with a document, that does not start with a DOCTYPE declaration, and, after 1, or more, html elements, later slip in a DOCTYPE, of a legacy type... AND you ignore the warnings...

Maybe the second should be an error, rather than a warning... to encourage the user to fix it first... it is bad to misplace a DOCTYPE declaration...

In reality, this seems a very, VERY, slim possibility, if at all...

So far my testing of some current tidy's I have, has not produced an instance... I think we can close that idea... and maybe let it be a new issue... if a reliable sample html to reproduce can be given... and a credible use case...

For the extra memory:
What're we trying to optimize exactly?

No optimize, NO!

I thought this issue is about library thread safety? to explore code, costs, etc to fix that... if pos...

At present, we have removed the const modifier, from the tag_defs static table, and presently modifier that table, IFF a legacy DOCTYPE is encountered... this is not thread safe! but works great...

You presented some code, to put back the const, and split the modified entries into their own second html4 table... added as hash table entries, on switch to html4 mode... IFF leg.doc found...

This still seems better than what we have... even if not yet perfect...

But still exploring, and listening to, options for this thread safety...

Not sure what you are trying to show with $ /usr/bin/time --verbose tidy --help output? With say -v, I saw 4%, for the CPU, and the 2300 RSS seems in line with other small command line processes, like ls -l, ... please explain...

As previously indicated, for tidy memory consumption, I use a special tidy-mem build - add `-DTIDY_RC_NUMBER=mem2 -DENABLE_DEBUG_LOG:BOOL=ON -DENABLE_MEMORY_DEBUG:BOOL=ON -DENABLE_ALLOC_DEBUG:BOOL=ON [-DENABLE_CRTDBG_MEMORY:BOOL=ON]' to cmake... sample output -

...\tidy-html5\build\temp-mem2>release\tidy -v
1: alloc   MEM 00000170723FFF30, size 13456
2: alloc   MEM 00000170723F5340, size 64
HTML Tidy for Windows version 5.7.27.mem2
Root node 00000170723FFF30
1: free    MEM 00000170723F5340
2: free    MEM 00000170723FFF30

I think an option like #ifdef ENABLE_THREAD_SAFETY has to be a COMPILE time build... and that's what I will try to explore... as/when I get the time... unless someone beats me to it...

But hey, look forward to other feedback... thanks...

@russianfool
Copy link
Author

Sure, if you let your imagination run wild, it is possible to see side-effects of the strange variety ...
In reality, this seems a very, VERY, slim possibility, if at all...

It's the legacy version of fuzzing the program :) "Never happens with normal program execution" and "might happen on some input" are pretty different, too, although I agree with you that it's probably not too important to figure out which one given your reasoning above.


No optimize, NO!
Not sure what you are trying to show with $ /usr/bin/time --verbose tidy --help output?

Well, not optimize - but we are trying to gauge the impact of some change on some deployment, right? Which deployment? Using tidy on embedded platform vs. using tidy as a utility vs. using tidy as MT library all have different considerations.

The demonstration was for that: utility only has one TidyDoc instance, and on a Linux system total RSS for program is way larger than whatever alloc is going to take place. But maybe some folks have 100's of TidyDoc instances alive at any given time and can't eat the memory cost? Or the utility/lib doesn't eat nearly that much on embedded systems as stack is smaller or (shared libs?).


I think an option like #ifdef ENABLE_THREAD_SAFETY has to be a COMPILE time build...

The only argument/suggestion I have here is that if it's default OFF, it'll take longer for distributions to adopt it. Meaning that yum/apt will get non-thread-safe versions by default. But ON might be too aggressive for all users.

With an init-time config (or maybe even per-doc config?) it probably doesn't take too many checks in the code, although init-time is probably less error-prone than per-doc.

@geoffmcl
Copy link
Contributor

@russianfool we seem to be spinning in circles somewhat... ;=))

I have not had a chance to code up my ENABLE_THREAD_SAFETY ideas... but am losing faith in the memory costly allocation, 13K to 23K, on a per doc basis... so putting that on the far back burner for now...

But something needs to be done about modifying the tags_def table... that is wrong, bad coding... I was stretched at the time I wrote it... should have done better... it has to go!

Am back to liking your split into two const tables... this is thread safe... that is the first part of your PR #817... as at least an intermediate solution to some of the problems discussed here... will continue testing this...

As indicated, I do not see a misplaced legacy DOCTYPE declaration, as something to be considered here... maybe a new, or other, issues...

Let's fix thread safety first... get back to the const nature of such tables... somehow...

Meantime look forward to further feedback... thanks...

@geoffmcl
Copy link
Contributor

@russianfool although you have put some time and effort into this, it seems a conclusion has not been reached, and since there has been no further feedback in over a year, am reluctantly closing this, and the WIP PR #817... sorry...

While making libTidy thread-safe is a very worthy aim this does not seem there yet... And I am sure any further steps taken in this regard will take into account these comments, and the PR patches... thanks...

@russianfool
Copy link
Author

@geoffmcl : That's fine :) I wasn't able to get the more elegant solution implemented here since maybe my use-case has different performance characteristics than the utility usage of tidy. Just didn't have the time to fully understand all the performance stuff of this project and where shortcuts would be ok.

Specifically: 23k vs. 13k per document allocation doesn't matter to me. Two static tables in the process works just fine for me as well.

For anyone in a similar situation with memory to spare:
Just compile the table as thread-local (in GCC this is just __thread, C11 thread_local). You pay penalty for any thread running in the process, not just threads doing html parsing, but it may not matter to you.

balthisar added a commit that referenced this issue Aug 15, 2021
Ensure thread safety by ensuring that tag_defs[] is thread local.
This has some overhead but is negligible on modern systems.
@balthisar
Copy link
Member

@russianfool, just a heads up that starting with 5.9.10, we're defining the macro TIDY_THREAD_LOCAL which by default is defined to compiler-appropriate __thread or MSVC equivalent, and applying it to tag_defs[].

While this adds some overhead, this is 2021 and I'm not aware of any embedded systems with limited resources running Tidy with 16 kb of RAM, so this pragmatic approach is maintainable and fine. Thanks for all of your earlier work trying to bring back thread safety to LibTidy, and sorry that it's all kind of out the Window in favor of this quick fix.

@russianfool
Copy link
Author

@balthisar : Great to hear! That's the fix I've been running for some time, and it works well since it's some fairly hefty server. Memory in the kb range is absolutely not a consideration there :) No worries about the "wasted work."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants