-
Notifications
You must be signed in to change notification settings - Fork 429
tag_defs + AdjustTags() and ResetTags() during parsing is not thread-safe (tags.c) #816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@russianfool thanks for raising this issue... yes, the library is not thread safe... it should be... in this case, the table should be set back to So while the
Need to check more into the I like an idea that only for Any solution is very likely to cost more memory allocation... so let's go for the smartest solution available... Look forward to further feedback, even patches, PR, of a workable solution... thanks... |
@geoffmcl: Thanks for the quick response! From my testing, this (and the setlocale shenanigans) were the only bits that weren't MT-safe, so probably it's pretty close.
Add "free_memory" flag to the DictHash structure. When AdjustTags() runs, it calls a new LookupCreateTag() or something instead, which allocates modified nodes from the global nodes; this is the only caller that sets that flag. ResetTags() can then free these from the hash table when it sees that it controls the memory. ... or something similar; the focus is on memory overhead, since you're only allocating memory for the new node. Otherwise, could just have const global entries for the modified node configurations as well, just re-point the hash table to those (no extra overhead per-doc). ... I might also have to look at bit more into how the declare tags work here though.
That's true, but there's time and stuff to consider. Aka the most straightforward solutions are (generally) easiest to support. This table has ~154 entries, at 64 bytes a pop (so like 10K). So, never mind :) Copying the whole table is undesirable if your doc size is on the same order of magnitude.
Re-using a doc object is something that happens in some runmodes (as part of the tidy utility's main, as it processes multiple documents? looked into this at some point, don't remember the details). If that's the case, it is necessary to set the doc back to it's initial state so that this use-case is supported. What're your thoughts on using something like https://www.gnu.org/software/gperf/ to generate a perfect hash table at compile-time, at least for the global tags? Or, well, there are two ways to run this:
The second approach has less dependencies for build, but you could (mechanically) end up with a commit that brings the two out of sync. |
Anyway, you can ignore a lot of the "theoretical" above and look at the PR. I would love your thoughts on gperf application still - there's a large amount of overhead sunk into the O(N) table lookups when there object stream is mainly small documents. Also for the other couple of O(N) stuff that doesn't have hash tables installed (like messages). |
@russianfool still reviewing PR #817, ie your fork, Simply, add a new static const table for not Should have thought of this myself, when setting up this mess... ;=)) Only cost is this new static html4 table, 4 x 64 bytes... and the hash table memory for these 4, for every non-html5 document, whether used or not... seems a reasonable cost... Running tidy, built with MEMORY diags ON, on a given sample input, in_815-1.html, get only a tiny, understandable, increase in memory-
I then have a
Shows just 3 additional Did you deliberately skip adding back the -static Dict tag_defs[] =
+static const Dict tag_defs[] = Still question, or do not fully understand, yet!, the need for The ONLY if tidy decodes another Maybe we should discourage the re-use of the tidy Did I read somewhere you had done Have only just started to read about Will also continue testing... but further feedback appreciated... thanks... |
Woops, no that was on accident, fixed (dropped it while juggling files).
Is that the case? We zero the memory in the the constructor ( When the parsing pass starts,
Yes on the regression, ... maybe makes sense to pipe a larger html corpus into here though, not for any out-of-bounds access, but behavioral changes that maybe the regression samples don't cover yet.
Sure, I'll spawn another issue when I'm ready, we can continue conversation there. Win compatibility (if committing the intermediate result files as well) is just Windows dev has to run gperf-on-windows to generate the intermediate file, when changing these tables.
Yeah. See comments/questions on the PR review itself as well :) I'll commit the WIP part that tries to address my ("well what if this changes the larger behavior") by making things consistent for the There are also some direct node comparisons to double-check, e.g.
Previously it didn't matter if we swapped from html4 to html5, those tags would have the same address. Now that's not the case, so these comparisons can work differently. |
@russianfool thanks for the quick fix on the On reviewing d75c822, Oct 14 2015, I now see/remember! the reason fot the Every new file, or snippet, passed into Your PR seems to 100% do this... no probs...
As you can seen in the comments, these changes simply grew from Your idea to create a separate table for this seems the way to go...
I hope I have addressed everything raised there... maybe I missed something... I prefer the discussion be in the issue addressed by the PR... not in the possible solution... but please repeat anything I missed, or that you would like an answer on... thanks...
Is not everything in the At least in one test I did, there is an Yes, open another issue on Meantime, continuing testing... especially your change to If this is WIP, then it should be done in a new branch... otherwise git bundles them into the PR... Further feedback appreciated... thanks... |
Well, sure, but the discussion is relevant to the solution. I.e. it's easier for me to point at the code that I'm modifying instead of trying to get that across here. For the Specifically ... for the LookupTagDef, those are all the users that could get entries returned that don't match their doctype (html4 vs. html5). I think that runs as the cleanup step after the parsing step, and assumes that it'd be in the same mode as the document. Right?
Yes, that's the intent (to show off the WIP and get feedback, and bundle them into the PR; with this new info you definitely shouldn't merge it just yet. This isn't as straightforward as flipping the table to a thread local... or the initial proposed split). But back to the tag comparison: sure, yes, that's exactly what they do (... oh wait ... not the global table?). Now imagine our case; we've created extra entries for the same tag (e.g. has an html5 entry and an html4 entry). Well, if you can flip between modes in the middle of such a tag (any tag that has multiple entries), the begin/end comparison no longer matches properly. You'll compare two pointers, but they won't match; one points at html4, one at html5. That last bit probably requires an extra patch on top of the WIP to fix; not to mention that straightening out whether folks get the html4 or html5 version causes parsing issues ... potentially because of the flipping case I mention above). |
@russianfool thanks for the feedback... Concerning your change to If desired, by all means, demonstrate it in actual working code, but this should be in its own branch... devs can still check it out, and provide feedback... Yes, with your current table change/split, there is the possibility that a caller to this service would, at present, only get the It seems now used some 7 times - assuming the first part of your PR is applied -
Some we can discount since they are obtaining a tag which we know are not in the current 4, of TidyTag_A, TidyTag_CAPTION, TidyTag_OBJECT, and TidyTag_BUTTON... But yes, others users of the service can be one of those tag... like in But again it depends on what is done with this pointer, and then what information is used from this structure... In 2 cases, CAPTION and BUTTON, only the Maybe the Yes, one way to overcome this, is as you have suggested, extend the We certainly should not go back to declaring a Have yet to fully explore the mentioned To me, right now, you have sort of messed up, what was looking like a very good PR, with a not yet discussed, or agreed, WIP... Still exploring... look forward to further feedback... thanks... |
@russianfool oops, forgot to mention a second small patch to your initial PR... minor... diff --git a/src/tags.c b/src/tags.c
index ccfc991..ab4ce2d 100644
--- a/src/tags.c
+++ b/src/tags.c
@@ -807,10 +807,10 @@ void TY_(FreeDeclaredTags)( TidyDocImpl* doc, UserTagType tagType )
\*/
void TY_(AdjustTags)( TidyDocImpl *doc )
{
- TY_(ResetTags)( doc ); /* Reset tags ensures blank slate for tags we care about */
-
TidyTagImpl* tags = &doc->tags;
size_t i;
+ TY_(ResetTags)( doc ); /* Reset tags ensures blank slate for tags we care about */
+
for (i = 0; i < sizeof(html4_tag_defs)/sizeof(html4_tag_defs[0]); ++i)
{
tagsInstall(doc, tags, &html4_tag_defs[i]); In general, but not always strictly enforced, is like in K&R C, that all local context variables must be before any other code... or something like that... Have not found a way using gcc, to show this warning/error... but my MSVC10, 2010, flags it as an error... In this case just moves your I have now looked at some of the pointer comparisons... but in each case checked so far the pointer is to the hash of the tag... That means it is already html 4 or 5... so should all work... Can you point me to a specific case where this would not be so? Thanks... |
I'm not convinced the PR works without the extra changes. It's more useful to show the extra code and explore it with you. Don't panic though, this is git -- we can get it back to the exact state once we're convinced I'm chasing ghosts. The global goal has stayed the same: transition the tags_def table to something that's MT-safe without introducing regression issues.
I'm hesitant to just rely on that first bit; it leaves the code in a more error-prone state (folks need to be careful not to add html4 tags for those). For the second bit, yes,
You're sure we don't (where pointer here means "pointer into the global tags table"; we keep saying "hash table" pointers, but I think we're talking about the same thing: pointers at a specific tags_def entry):
It's a bit of work for me to work through the parser under debug (don't know those bits too well), but I'll try to get a more fleshed out patch next week, if I find there are actually problems in the course of this debugging. If there are, it should be possible to craft a sample for regression, at which point I can maybe explain better with examples (or prove myself wrong).
It's compatibility for an external API that mimics the previous behavior. In that, this is the closest implementation. I know the symbol is marked as "prvTidy", but that does nothing for exporting it from the .so (I think, unless that's changed in 5.7). Probably ok-ish to let them hit only the html5 table, but who knows what kind of crazy users are out there. We'd want to deprecate/remove the API asap - just doesn't make sense to break compatibility for it.
Sure, that's fine, I'll move declarations. Seems to be |
OFF TOPIC: @russianfool a Of course, correcting some small spelling mistake, grammar, etc, etc, is always no problem... appreciated even... But adding to paragraphs, adding or changing their meaning, ie substantial modifications, can cause some confusion... Maybe you do not know it, but we - members, watchers, and That email is what I first get, read, and commence to think about the problems, and maybe get into some tests... form some of the reply... at least mentally... Thankfully, in replying, I seldom reply directly by email... In fact I use an editor, to copy your post, and pre-prepare my reply, usually with at least spelling corrections, so when online, to Anyway, this is not a Just be aware that only the initial comment is notified... If you do have other, additional, thoughts, ideas, etc, - and we do want these! - then they can just as easily be posted as a new comment... like we would have to do if this was all email Maybe we should have something about this in our In searching around about this found isaacs/github#310, on the need for more notifications, which dozens reacted to, and there are probably many others, on a similar vein... But in general, I am ok with the current notifications... Just please try to keep the After more testing, will get around to replying, ON TOPIC, separately... |
@russianfool well in this case, it was your later edits, which gives me a This Yes, no one has yet come up with a way to exclude this, well all these many, internal functions, all named "prvTidyXXXX", all formed by the same They are not So the only Yes, in reading the stream, a Yes, iff it is a html4 doctype, and iff the But I am still not convinced there can be such a problem... given the order of things... parsing, then cleaning, then output... where, and how used, etc, etc... If there is still some doubt, then we could make that 2nd message an error, and force the Yes, had been previously alerted to So at this stage would prefer to go back to the original PR... thanks... Will continue to explore, as time permits, and look forward to further feedback... thanks... |
@russianfool well, where are we with this? It seems the best solution is too add the table, 9K, to the initial doc allocation, 13K... seems a *big hit, to a 23K Will try to find time to experiment with this... may use some pieces from PR #817... like all table lookups have to include the Meantime, look forward to further feedback... thanks... |
@geoffmcl : Right, I just haven't had a chance to come back and fully understand the parsing logic. The more I look at it, the more I'm convinced it's not quite right. E.g. I think you'd want to determine doctype and charset before you start your main parsing pass, otherwise you have to go back and correct already-parsed nodes as their meaning between html5 and html4 could have changed.
It's the easiest in terms of understanding side effects, in that there are none. Another such "works as it does today" is to use __declspec( thread ) and __thread_local (you gotta plex to the compiler though, but the overhead is 9k per thread in the program). |
@russianfool thank you for the feedback... I am afraid I do not share your concern about the current parsing logic... If tidy first encounters multiple html elements in the stream, then encounters a
Now that second warning could be expanded to something like ... But warnings already implies you need to fix the problems in your document, to get correct parsing... table-to-doc: The initial doc allocation, 13K... seems a *big hit, to 23K per As stated, will try to find time to experiment with this... maybe as an optional, like... Meantime, look forward to further feedback... thanks... |
@geoffmcl: Sure, that's a reasonable explanation for the most part. You'd still expect not to see side-effects of the strange variety (e.g. html4-button not registering a now-html5-button close or something). For the extra memory: E.g. for the utility ...
|
That's not a bad compromise since "advanced" users are anyway fiddling with the compilation. Would it maybe make sense as a run-time config of tidy overall? E.g. pre-some-2.x version of glib they'd require a call to g_thread_init(); I'm not sure what it used to do, but in the Tidy implementation it'd toggle a global flag to instead allocate memory for doctables. |
@russianfool as usual, thank you for the feedback... Sure, if you let your imagination run wild, it is possible Maybe the second should be an In reality, this seems a very, VERY, slim possibility, if at all... So far my testing of some current tidy's I have, has not produced an instance... I think we can close that idea... and maybe let it be a new issue... if a reliable sample html to reproduce can be given... and a credible
No I thought this issue is about library thread safety? to explore code, costs, etc to fix that... if pos... At present, we have removed the You presented some code, to put back the This still seems better than what we have... even if not yet perfect... But still exploring, and listening to, options for this Not sure what you are trying to show with As previously indicated, for
I think an option like But hey, look forward to other feedback... thanks... |
It's the legacy version of fuzzing the program :) "Never happens with normal program execution" and "might happen on some input" are pretty different, too, although I agree with you that it's probably not too important to figure out which one given your reasoning above.
Well, not optimize - but we are trying to gauge the impact of some change on some deployment, right? Which deployment? Using tidy on embedded platform vs. using tidy as a utility vs. using tidy as MT library all have different considerations. The demonstration was for that: utility only has one TidyDoc instance, and on a Linux system total RSS for program is way larger than whatever alloc is going to take place. But maybe some folks have 100's of TidyDoc instances alive at any given time and can't eat the memory cost? Or the utility/lib doesn't eat nearly that much on embedded systems as stack is smaller or (shared libs?).
The only argument/suggestion I have here is that if it's default OFF, it'll take longer for distributions to adopt it. Meaning that yum/apt will get non-thread-safe versions by default. But ON might be too aggressive for all users. With an init-time config (or maybe even per-doc config?) it probably doesn't take too many checks in the code, although init-time is probably less error-prone than per-doc. |
@russianfool we seem to be spinning in circles somewhat... ;=)) I have not had a chance to code up my But something needs to be done about modifying the Am back to liking your split into two As indicated, I do not see a misplaced legacy DOCTYPE declaration, as something to be considered here... maybe a new, or other, issues... Let's fix Meantime look forward to further feedback... thanks... |
@russianfool although you have put some time and effort into this, it seems a conclusion has not been reached, and since there has been no further feedback in over a year, am reluctantly closing this, and the WIP PR #817... sorry... While making |
@geoffmcl : That's fine :) I wasn't able to get the more elegant solution implemented here since maybe my use-case has different performance characteristics than the utility usage of tidy. Just didn't have the time to fully understand all the performance stuff of this project and where shortcuts would be ok. Specifically: 23k vs. 13k per document allocation doesn't matter to me. Two static tables in the process works just fine for me as well. For anyone in a similar situation with memory to spare: |
@russianfool, just a heads up that starting with 5.9.10, we're defining the macro While this adds some overhead, this is 2021 and I'm not aware of any embedded systems with limited resources running Tidy with 16 kb of RAM, so this pragmatic approach is maintainable and fine. Thanks for all of your earlier work trying to bring back thread safety to LibTidy, and sorry that it's all kind of out the Window in favor of this quick fix. |
@balthisar : Great to hear! That's the fix I've been running for some time, and it works well since it's some fairly hefty server. Memory in the kb range is absolutely not a consideration there :) No worries about the "wasted work." |
The usage of the static tags.c table during parsing is not thread safe.
During parsing, tags are installed from a global, non-const table
static Dict tag_defs[]
into a hash in the doc structure usingtagsInstall()
. During parsing, when certain conditions are met, the parser will callAdjustTags()
.AdjustTags()
will:LookupTagDef()
to grab the pointer to the value, regardless of whether or not it has been cached in the hash table (table contains only pointers to global static).This adjustment affects other threads, and calling
ResetTags()
is not enough as those threads continue processing while this global state is modified with T1's state.Maybe related as a root cause to #790.
How I'm using Tidy: As a library, with many parsing threads hitting it simultaneously (5.6.x, but bug is present in latest).
How this ticket manifests itself: Unexplained parsing differences; out-of-stack crashes around "button" tag parsing as we hit some parsing infinite loop; TSAN and similar tools warning on the structure.
Quick solution: Have made the tags structure thread_local in my builds; this solves my issues.
Other (more involved) solutions:
Let me know what you think about the above problem and which solution seems most inline with your goals for Tidy (e.g. whether you'd prefer (little smarter, with some doc ++memory_overhead) or (smarter), or something else).
The text was updated successfully, but these errors were encountered: