Ensure queued HostDB requests get rescheduled back to the original thread #5120

shinrich · 2019-03-05T17:57:04Z

This change along with #5089 fixed our corruptions on the netHandler active_queue and keep_alive_queue. The HostDB continuations that were queued up waiting for the dns result were all getting reactivated on the same thread. This was unlikely to be the thread associated with the netvc's for the transaction. There were paths that created the CacheVC on on a different thread and then caused the netHandler queue manipulation to be called from the non-net handler thread.

Also removed the explicit initial_thread member variable for CacheVC and used the thread affinity instead.

vmamidi · 2019-03-05T18:51:34Z

I don't think we are seeing any corruption in the queues. Is there any specific scenario that this happens?

zwoop · 2019-03-05T19:02:56Z

iocore/hostdb/HostDB.cc

  while ((c = qq.dequeue())) {
-    c->handleEvent(EVENT_IMMEDIATE, nullptr);
+    // resume all queued HostDBCont in the thread associated with the netvc to avoid locking issues.
+    eventProcessor.schedule_imm(c);


Presumably, most of the time, the old c->handleEvent() does get handled on the "right" thread right now? Otherwise, this would cause more problems than we see today? If so, is there not a way to check if we're currently on an "acceptable" thread, and conditionally do c->handleEvent() if so ? I haven't looked at what we have available here, but it would be nice to avoid this rescheduling for the common case, and only take these rescheduling costs when absolutely necessary.

Also, is this related to the global session pools at all? It seems a lot of work is being done in the core, to address issues that only happens when the global session pool is enabled. My real concern here is changing all these things in the event system, when maybe we should revisit the global session pool (like, maybe understand why it's needed, or perhaps see if it can be improved / fixed?).

@zwoop I'll rejigger the remove_trigger_pending_dns to only schedule if the affinity thread is different from the current thread. That should calls the handleEvent to be called directly in the common case. This is the only place in this PR that replaced a direct handleEvent call with a schedule.

The issue here has nothing to do with the global session pool. It has to do with multiple hostdb requests queuing up for an outstanding DNS request. We sometimes have some odd performance hiccups with DNS in our deployment.

Pushed up a new commit with the rejigger

bryancall · 2019-03-05T19:40:12Z

Do you have an asan report or a backtrace showing that the queues are being corrupted. Are you still seeing corruption after testing with #4379?

My worry is that you are running this change on an internal fork that doesn't match master and may behave differently.

shinrich · 2019-03-05T20:28:18Z

@bryancall when I brought in #4379 we rapidly got asserts for failing to get the net handler lock. Which makes sense. Since the ET_NET thread holds the net handler thread for most of the time it is active, it is likely that a TRY_LOCK will fail. I'll gather up some stack traces.

@dmorilha also commented on IRC about seeing the "BUG: It must have acquired the NetHandler's lock before doing anything else keep_alive_queue" assert while running the latest of the 7.1.x branch.

…read.

shinrich · 2019-03-05T20:52:53Z

Here is a stack with the assert from #4379 going off

#0  0x00002acf80fc11d7 in raise () from /lib64/libc.so.6
#1  0x00002acf80fc28c8 in abort () from /lib64/libc.so.6
#2  0x00002acf7e25006b in ink_abort (message_format=message_format@entry=0x2acf7e26dae7 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00002acf7e24d625 in _ink_assert (expression=expression@entry=0x828298 "!\"BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue.\"", 
    file=file@entry=0x828090 "UnixNetVConnection.cc", line=line@entry=1561) at ink_assert.cc:37
#4  0x000000000077e949 in UnixNetVConnection::add_to_keep_alive_queue() () at UnixNetVConnection.cc:1561
#5  0x0000000000632917 in Http2ConnectionState::release_stream(Http2Stream*) () at Http2ConnectionState.cc:1269
#6  0x000000000061e4fe in Http2Stream::destroy() () at Http2Stream.cc:716
#7  0x0000000000619f22 in Http2Stream::terminate_if_possible() () at Http2Stream.cc:377
#8  0x000000000061a0d8 in Http2Stream::transaction_done() () at Http2Stream.cc:366
#9  0x00000000005ce8bc in HttpSM::kill_this() () at HttpSM.cc:7055
#10 0x00000000005cf7e5 in HttpSM::main_handler (this=0x2ad133c0cdb0, event=2301, data=0x2ad133c0e148) at HttpSM.cc:2762
#11 0x000000000060dfd2 in handleEvent (data=0x2ad133c0e148, event=2301, this=0x2ad133c0cdb0)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver7/iocore/eventsystem/I_Continuation.h:157
#12 HttpTunnel::main_handler (this=0x2ad133c0e148, event=<optimized out>, data=<optimized out>) at HttpTunnel.cc:1657
#13 0x00000000007335a2 in handleEvent (data=0x2ad120a46ae0, event=103, this=0x2ad133c0e148)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver7/iocore/eventsystem/I_Continuation.h:157
#14 calluser (event=103, this=0x2ad120a46890) at P_CacheInternal.h:639
#15 CacheVC::openWriteMain (this=0x2ad120a46890) at CacheWrite.cc:1353
#16 0x00000000007b87cc in handleEvent (data=0x2acf8f116ac0, event=1, this=0x2ad120a46890) at I_Continuation.h:157
#17 EThread::process_event(Event*, int) () at UnixEThread.cc:136
#18 0x00000000007b913e in EThread::process_queue (this=this@entry=0x2acf84a13680, NegativeQueue=NegativeQueue@entry=0x2acf88ffe670, ev_count=ev_count@entry=0x2acf88ffe66c, nq_count=nq_count@entry=0x2acf88ffe668)
    at UnixEThread.cc:175
#19 0x00000000007b95fc in EThread::execute_regular (this=0x2acf84a13680) at UnixEThread.cc:235
#20 0x00000000007b81c2 in spawn_thread_internal (a=0x2acf8206c200) at Thread.cc:85
#21 0x00002acf80354dc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00002acf8108376d in clone () from /lib64/libc.so.6

shinrich · 2019-03-05T20:57:09Z

Here is an example of a corrupting keep_alive_queue stack trace. Unfortunately, I did not save aside the ASAN stacks, and the log files have rotated by now.

(gdb) bt
#0  insert (after=0x2b4fc474bc00, e=0x2b4f31e72000, this=<optimized out>) at P_UnixNetVConnection.h:246
#1  insert (after=0x2b4fc474bc00, e=0x2b4f31e72000, this=<optimized out>) at ../../lib/ts/List.h:377
#2  enqueue (e=0x2b4f31e72000, this=<optimized out>) at ../../lib/ts/List.h:434
#3  NetHandler::add_to_keep_alive_queue (this=0x2b4dedc973a0, vc=0x2b4f31e72000) at UnixNet.cc:701
#4  0x0000000000632c57 in Http2ConnectionState::release_stream(Http2Stream*) () at Http2ConnectionState.cc:1269
#5  0x000000000061e83e in Http2Stream::destroy() () at Http2Stream.cc:716
#6  0x000000000061a262 in Http2Stream::terminate_if_possible() () at Http2Stream.cc:377
#7  0x000000000061e0a9 in Http2Stream::main_event_handler(int, void*) () at Http2Stream.cc:130
#8  0x00000000007b8b2c in handleEvent (data=0x2b4df1a001e0, event=103, this=0x2b4f51956800) at I_Continuation.h:157
#9  EThread::process_event(Event*, int) () at UnixEThread.cc:136
#10 0x00000000007b949e in EThread::process_queue (this=this@entry=0x2b4dedc936c0, NegativeQueue=NegativeQueue@entry=0x2b4deebd3670, ev_count=ev_count@entry=0x2b4deebd366c, nq_count=nq_count@entry=0x2b4deebd3668) at UnixEThread.cc:175
#11 0x00000000007b995c in EThread::execute_regular (this=0x2b4dedc936c0) at UnixEThread.cc:235
#12 0x00000000007b8522 in spawn_thread_internal (a=0x2b4dec86b080) at Thread.cc:85
#13 0x00002b4deaaafdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00002b4deb7de76d in clone () from /lib64/libc.so.6

scw00 · 2019-03-05T23:30:59Z

iocore/cache/P_CacheInternal.h

-  c->initial_thread   = t->tt == DEDICATED ? nullptr : t;
  c->mutex            = cont->mutex;
  c->start_time       = Thread::get_hrtime();
+  c->setThreadAffinity(t);


Does that means CacheVC can be bond on a dedicated thread?

Not sure. Is there any potential risk here.

A CacheVC instance should be bound to a specific thread, the same thread to which the transaction is bound. The point here is when a transaction does a cache I/O operation, the actual work is done in an AIO thread. When that completes, control should pass back to the exact thread that originally asked for the I/O. This isn't new, it's only a different mechanism. Previously this was a special case in the cache (see CacheVC::initial_thread in "P_CacheInternal.h"), but now it can be done using the general thread affinity. This is in fact one of the use cases that motivated creating the generalized thread affinity.

scw00 · 2019-03-06T00:28:01Z

After #5089. The events always schedule to the local thread. So how to schedule a event to other thread(same etype) that I don't want it run in local thread. It seems the same as the thread->schedule

scw00 · 2019-03-06T00:39:25Z

~~Does cache initialized run in one thread now ?~~

It is not run in the same thread because cache never read across Vol.

But another case.

/* NOTE:: This state can be called by an AIO thread, so DON'T DON'T
   DON'T schedule any events on this thread using VC_SCHED_XXX or
   mutex->thread_holding->schedule_xxx_local(). ALWAYS use
   eventProcessor.schedule_xxx().
   */
int
Vol::aggWriteDone(int event, Event *e)
{
  cancel_trigger();

  // ensure we have the cacheDirSync lock if we intend to call it later
  // retaking the current mutex recursively is a NOOP
  CACHE_TRY_LOCK(lock, dir_sync_waiting ? cacheDirSync->mutex : mutex, mutex->thread_holding);
  if (!lock.is_locked()) {
    eventProcessor.schedule_in(this, HRTIME_MSECONDS(cache_config_mutex_retry_delay));
    return EVENT_CONT;
  }

The default type is ET_CALL forgot it.

dmorilha · 2019-03-06T02:18:40Z

Here is a stack with the assert from #4379 going off

#0  0x00002acf80fc11d7 in raise () from /lib64/libc.so.6
#1  0x00002acf80fc28c8 in abort () from /lib64/libc.so.6
#2  0x00002acf7e25006b in ink_abort (message_format=message_format@entry=0x2acf7e26dae7 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00002acf7e24d625 in _ink_assert (expression=expression@entry=0x828298 "!\"BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue.\"", 
    file=file@entry=0x828090 "UnixNetVConnection.cc", line=line@entry=1561) at ink_assert.cc:37
#4  0x000000000077e949 in UnixNetVConnection::add_to_keep_alive_queue() () at UnixNetVConnection.cc:1561
#5  0x0000000000632917 in Http2ConnectionState::release_stream(Http2Stream*) () at Http2ConnectionState.cc:1269
#6  0x000000000061e4fe in Http2Stream::destroy() () at Http2Stream.cc:716
#7  0x0000000000619f22 in Http2Stream::terminate_if_possible() () at Http2Stream.cc:377
#8  0x000000000061a0d8 in Http2Stream::transaction_done() () at Http2Stream.cc:366
#9  0x00000000005ce8bc in HttpSM::kill_this() () at HttpSM.cc:7055
#10 0x00000000005cf7e5 in HttpSM::main_handler (this=0x2ad133c0cdb0, event=2301, data=0x2ad133c0e148) at HttpSM.cc:2762
#11 0x000000000060dfd2 in handleEvent (data=0x2ad133c0e148, event=2301, this=0x2ad133c0cdb0)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver7/iocore/eventsystem/I_Continuation.h:157
#12 HttpTunnel::main_handler (this=0x2ad133c0e148, event=<optimized out>, data=<optimized out>) at HttpTunnel.cc:1657
#13 0x00000000007335a2 in handleEvent (data=0x2ad120a46ae0, event=103, this=0x2ad133c0e148)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver7/iocore/eventsystem/I_Continuation.h:157
#14 calluser (event=103, this=0x2ad120a46890) at P_CacheInternal.h:639
#15 CacheVC::openWriteMain (this=0x2ad120a46890) at CacheWrite.cc:1353
#16 0x00000000007b87cc in handleEvent (data=0x2acf8f116ac0, event=1, this=0x2ad120a46890) at I_Continuation.h:157
#17 EThread::process_event(Event*, int) () at UnixEThread.cc:136
#18 0x00000000007b913e in EThread::process_queue (this=this@entry=0x2acf84a13680, NegativeQueue=NegativeQueue@entry=0x2acf88ffe670, ev_count=ev_count@entry=0x2acf88ffe66c, nq_count=nq_count@entry=0x2acf88ffe668)
    at UnixEThread.cc:175
#19 0x00000000007b95fc in EThread::execute_regular (this=0x2acf84a13680) at UnixEThread.cc:235
#20 0x00000000007b81c2 in spawn_thread_internal (a=0x2acf8206c200) at Thread.cc:85
#21 0x00002acf80354dc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00002acf8108376d in clone () from /lib64/libc.so.6

this is the exact same backtrace we also see

vmamidi · 2019-03-06T04:19:51Z

@scw00 @oknet why do we have an assert if we can't acquire the lock in #4379 ?

scw00 · 2019-03-06T08:34:19Z

Because keep_alive_queue belongs to NetHandler and it is not lock free.

shinrich · 2019-03-06T14:34:06Z

The alternative to asserting would be to do a blocking lock which is non-performant. Rescheduling at that point (of shutting down netvc's) isn't feasible or at least would be very complicated and error prone.

vmamidi · 2019-03-06T18:42:49Z

Because keep_alive_queue belongs to NetHandler and it is not lock free.

I agree, its not lock free but why should we assert if lock fails?

SolidWallOfCode · 2019-03-06T21:53:23Z

Because the alternative is subtle, inconsistent, and hard to track down memory corruption?

vmamidi · 2019-03-06T21:58:36Z

It doesn't necessarily have to be a memory corruption if the lock fails, right?

scw00 · 2019-03-07T00:25:09Z

I'm not so clearly with your point. Can you please share more information.

vmamidi · 2019-03-07T17:55:12Z

I think this PR is trying to work around #4379.How are we going to go back to original thread if the has its own threads? I think alternatives for asserting maybe inconsistent and less performant but asserting crashes ATS. I do not think we should crash the server unless the server enters a unreliable state. I don't think unable to acquire a mutex puts ATS into unreliable state .@bryancall @zwoop @shinrich @scw00 @SolidWallOfCode @oknet thoughts?

SolidWallOfCode · 2019-03-07T20:09:41Z

In this particular case, failure to acquire the mutex means a non-thread safe data structure can be modified by multiple threads at the same time, creating a race condition. This in turn tends to create memory corruption in that object which don't actually cause problems until after the race condition has happened. The observed behavior is random crashes for inexplicable reasons. The assert means the unsafe data access is caught when it happens, rather than having to guess where it might have occurred from a subsequent crash due to the corruption.

It doesn't have to be memory corruption without the lock, and in fact most of the time nothing bad will happen. However, "most of the time" is small comfort at production traffic levels. It also means the crashes will be sporadic, making it even more difficult to tell if the problem is fixed or not.

bryancall

I would like to get more than 1 person to approve this PR since there is a lot of discussion on this. Someone from another company would be preferable.

dmorilha · 2019-03-11T20:52:41Z

Since we were seeing problems with HostDB and #4379 crashed all the time for us, I back-ported this to 7.1.x and we are running it in a low traffic host.

dmorilha · 2019-03-12T20:29:31Z

changes are looking good for us 24h later

SolidWallOfCode · 2019-03-13T13:44:28Z

I think we can take @dmorilha's endorsement as an approval.

vmamidi · 2019-03-13T14:43:40Z

I'll be concerned with #4379 until we have enough confidence that HttpSm is always run on original thread.

dmorilha · 2019-03-13T17:56:47Z

ok, after deploying everywhere, the asserts do crash from time to time, I still think the change is beneficial and I am making the asserts just errors for now...

vmamidi · 2019-03-13T18:34:27Z

did you convert the asserts to errors in your local repo?

vmamidi · 2019-03-13T19:01:53Z

@dmorilha's crashes prove the theory that HttpSM may not always running on original thread.
I think there are several scenarios where HttpSM will not run on original thread and this PR definitely fixes one of them. If we feel confident that HttpSM is running on original thread, we shouldn't even try to acquire mutex in change #4379. If we want to verify if we are on the same thread we can do ink_assert(thread_holding == this_thread()) instead of trying to acquire mutex. I am not confident that HttpSM is running on original thread as of now and i think we should change the assert to an error until we cover all the scenarios.

dmorilha · 2019-03-13T22:20:00Z

https://github.com/netlify/trafficserver/commit/55b3fed4eb45718ae1cb513d932159b3e90554fb

vmamidi · 2019-03-14T04:21:02Z

netlify@55b3fed

Can you create a PR with this change?

shinrich · 2019-03-14T17:37:59Z

Ok it looks like the remaining concerns are with respect to whether the checks in #4379 error or assert. I am going to go ahead and commit this, since it has been approved and verified by folks from another organization. Plus bluuurgh on the IRC just brought up with issue with 7.1.6.

@dmorilha can you put up the PR for 7.1.x?

… not thread-safe" This reverts commit 3599e5e. Seems this causes new, unpredcitable crashes. For now, lets revert this, and we can keep working on this for 8.x / 9.x. Reference apache#4379 and apache#4921 and apache#5120.

… not thread-safe" This reverts commit 3599e5e. Seems this causes new, unpredcitable crashes. For now, lets revert this, and we can keep working on this for 8.x / 9.x. Reference #4379 and #4921 and #5120.

dmorilha · 2019-03-22T16:29:24Z

@shinrich @vmamidi I will do it, thanks,

dmorilha · 2019-03-22T16:47:39Z

#5198

shinrich added the Crash label Mar 5, 2019

shinrich added this to the 9.0.0 milestone Mar 5, 2019

shinrich self-assigned this Mar 5, 2019

shinrich force-pushed the add-more-affinity branch from 88fabc7 to 217dd79 Compare March 5, 2019 18:49

zwoop reviewed Mar 5, 2019

View reviewed changes

Ensure queued HostDB requests get rescheduled back to the original th…

e42a31a

…read.

shinrich force-pushed the add-more-affinity branch from 217dd79 to e42a31a Compare March 5, 2019 20:48

scw00 reviewed Mar 6, 2019

View reviewed changes

bryancall approved these changes Mar 11, 2019

View reviewed changes

shinrich merged commit 33bbe34 into apache:master Mar 14, 2019

zwoop mentioned this pull request Mar 14, 2019

Revert "Assert if operations on keep_alive_queue and active_queue are… #5170

Merged

shinrich mentioned this pull request Mar 16, 2021

Sporadic crash and core dump in Mutex_trylock #7581

Closed

Ensure queued HostDB requests get rescheduled back to the original thread #5120

Ensure queued HostDB requests get rescheduled back to the original thread #5120

Conversation

shinrich commented Mar 5, 2019

Uh oh!

vmamidi commented Mar 5, 2019

Uh oh!

zwoop Mar 5, 2019

Choose a reason for hiding this comment

Uh oh!

shinrich Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shinrich Mar 5, 2019

Choose a reason for hiding this comment

Uh oh!

bryancall commented Mar 5, 2019

Uh oh!

shinrich commented Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shinrich commented Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shinrich commented Mar 5, 2019

Uh oh!

scw00 Mar 5, 2019

Choose a reason for hiding this comment

Uh oh!

scw00 Mar 6, 2019

Choose a reason for hiding this comment

Uh oh!

SolidWallOfCode Mar 6, 2019

Choose a reason for hiding this comment

Uh oh!

scw00 commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scw00 commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmorilha commented Mar 6, 2019

Uh oh!

vmamidi commented Mar 6, 2019

Uh oh!

scw00 commented Mar 6, 2019

Uh oh!

shinrich commented Mar 6, 2019

Uh oh!

vmamidi commented Mar 6, 2019

Uh oh!

SolidWallOfCode commented Mar 6, 2019

Uh oh!

vmamidi commented Mar 6, 2019

Uh oh!

scw00 commented Mar 7, 2019

Uh oh!

vmamidi commented Mar 7, 2019

Uh oh!

SolidWallOfCode commented Mar 7, 2019

Uh oh!

bryancall left a comment

Choose a reason for hiding this comment

Uh oh!

dmorilha commented Mar 11, 2019

Uh oh!

dmorilha commented Mar 12, 2019

Uh oh!

SolidWallOfCode commented Mar 13, 2019

Uh oh!

vmamidi commented Mar 13, 2019

Uh oh!

dmorilha commented Mar 13, 2019

Uh oh!

vmamidi commented Mar 13, 2019

Uh oh!

vmamidi commented Mar 13, 2019

Uh oh!

dmorilha commented Mar 13, 2019

shinrich Mar 5, 2019 •

edited

Loading

shinrich commented Mar 5, 2019 •

edited

Loading

shinrich commented Mar 5, 2019 •

edited

Loading

scw00 commented Mar 6, 2019 •

edited

Loading

scw00 commented Mar 6, 2019 •

edited

Loading

shinrich commented Mar 14, 2019 •

edited

Loading