-
Notifications
You must be signed in to change notification settings - Fork 849
Ensure queued HostDB requests get rescheduled back to the original thread #5120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
88fabc7 to
217dd79
Compare
|
I don't think we are seeing any corruption in the queues. Is there any specific scenario that this happens? |
iocore/hostdb/HostDB.cc
Outdated
| while ((c = qq.dequeue())) { | ||
| c->handleEvent(EVENT_IMMEDIATE, nullptr); | ||
| // resume all queued HostDBCont in the thread associated with the netvc to avoid locking issues. | ||
| eventProcessor.schedule_imm(c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably, most of the time, the old c->handleEvent() does get handled on the "right" thread right now? Otherwise, this would cause more problems than we see today? If so, is there not a way to check if we're currently on an "acceptable" thread, and conditionally do c->handleEvent() if so ? I haven't looked at what we have available here, but it would be nice to avoid this rescheduling for the common case, and only take these rescheduling costs when absolutely necessary.
Also, is this related to the global session pools at all? It seems a lot of work is being done in the core, to address issues that only happens when the global session pool is enabled. My real concern here is changing all these things in the event system, when maybe we should revisit the global session pool (like, maybe understand why it's needed, or perhaps see if it can be improved / fixed?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zwoop I'll rejigger the remove_trigger_pending_dns to only schedule if the affinity thread is different from the current thread. That should calls the handleEvent to be called directly in the common case. This is the only place in this PR that replaced a direct handleEvent call with a schedule.
The issue here has nothing to do with the global session pool. It has to do with multiple hostdb requests queuing up for an outstanding DNS request. We sometimes have some odd performance hiccups with DNS in our deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed up a new commit with the rejigger
|
Do you have an asan report or a backtrace showing that the queues are being corrupted. Are you still seeing corruption after testing with #4379? My worry is that you are running this change on an internal fork that doesn't match master and may behave differently. |
|
@bryancall when I brought in #4379 we rapidly got asserts for failing to get the net handler lock. Which makes sense. Since the ET_NET thread holds the net handler thread for most of the time it is active, it is likely that a TRY_LOCK will fail. I'll gather up some stack traces. @dmorilha also commented on IRC about seeing the "BUG: It must have acquired the NetHandler's lock before doing anything else keep_alive_queue" assert while running the latest of the 7.1.x branch. |
217dd79 to
e42a31a
Compare
|
Here is a stack with the assert from #4379 going off |
|
Here is an example of a corrupting keep_alive_queue stack trace. Unfortunately, I did not save aside the ASAN stacks, and the log files have rotated by now. |
| c->initial_thread = t->tt == DEDICATED ? nullptr : t; | ||
| c->mutex = cont->mutex; | ||
| c->start_time = Thread::get_hrtime(); | ||
| c->setThreadAffinity(t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that means CacheVC can be bond on a dedicated thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. Is there any potential risk here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CacheVC instance should be bound to a specific thread, the same thread to which the transaction is bound. The point here is when a transaction does a cache I/O operation, the actual work is done in an AIO thread. When that completes, control should pass back to the exact thread that originally asked for the I/O. This isn't new, it's only a different mechanism. Previously this was a special case in the cache (see CacheVC::initial_thread in "P_CacheInternal.h"), but now it can be done using the general thread affinity. This is in fact one of the use cases that motivated creating the generalized thread affinity.
|
After #5089. The events always schedule to the local thread. So how to schedule a event to other thread(same etype) that I don't want it run in local thread. It seems the same as the |
|
It is not run in the same thread because cache never read across But another case. The default type is ET_CALL forgot it. |
this is the exact same backtrace we also see |
|
Because |
|
The alternative to asserting would be to do a blocking lock which is non-performant. Rescheduling at that point (of shutting down netvc's) isn't feasible or at least would be very complicated and error prone. |
I agree, its not lock free but why should we assert if lock fails? |
|
Because the alternative is subtle, inconsistent, and hard to track down memory corruption? |
|
It doesn't necessarily have to be a memory corruption if the lock fails, right? |
|
I'm not so clearly with your point. Can you please share more information. |
|
I think this PR is trying to work around #4379.How are we going to go back to original thread if the has its own threads? I think alternatives for asserting maybe inconsistent and less performant but asserting crashes ATS. I do not think we should crash the server unless the server enters a unreliable state. I don't think unable to acquire a mutex puts ATS into unreliable state .@bryancall @zwoop @shinrich @scw00 @SolidWallOfCode @oknet thoughts? |
|
In this particular case, failure to acquire the mutex means a non-thread safe data structure can be modified by multiple threads at the same time, creating a race condition. This in turn tends to create memory corruption in that object which don't actually cause problems until after the race condition has happened. The observed behavior is random crashes for inexplicable reasons. The assert means the unsafe data access is caught when it happens, rather than having to guess where it might have occurred from a subsequent crash due to the corruption. It doesn't have to be memory corruption without the lock, and in fact most of the time nothing bad will happen. However, "most of the time" is small comfort at production traffic levels. It also means the crashes will be sporadic, making it even more difficult to tell if the problem is fixed or not. |
bryancall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to get more than 1 person to approve this PR since there is a lot of discussion on this. Someone from another company would be preferable.
|
Since we were seeing problems with HostDB and #4379 crashed all the time for us, I back-ported this to 7.1.x and we are running it in a low traffic host. |
|
changes are looking good for us 24h later |
|
I think we can take @dmorilha's endorsement as an approval. |
|
I'll be concerned with #4379 until we have enough confidence that HttpSm is always run on original thread. |
|
ok, after deploying everywhere, the asserts do crash from time to time, I still think the change is beneficial and I am making the asserts just errors for now... |
|
did you convert the asserts to errors in your local repo? |
|
@dmorilha's crashes prove the theory that HttpSM may not always running on original thread. |
|
Can you create a PR with this change? |
|
Ok it looks like the remaining concerns are with respect to whether the checks in #4379 error or assert. I am going to go ahead and commit this, since it has been approved and verified by folks from another organization. Plus bluuurgh on the IRC just brought up with issue with 7.1.6. @dmorilha can you put up the PR for 7.1.x? |
… not thread-safe" This reverts commit 3599e5e. Seems this causes new, unpredcitable crashes. For now, lets revert this, and we can keep working on this for 8.x / 9.x. Reference apache#4379 and apache#4921 and apache#5120.
This change along with #5089 fixed our corruptions on the netHandler active_queue and keep_alive_queue. The HostDB continuations that were queued up waiting for the dns result were all getting reactivated on the same thread. This was unlikely to be the thread associated with the netvc's for the transaction. There were paths that created the CacheVC on on a different thread and then caused the netHandler queue manipulation to be called from the non-net handler thread.
Also removed the explicit initial_thread member variable for CacheVC and used the thread affinity instead.