Skip to content

Conversation

@shinrich
Copy link
Member

@shinrich shinrich commented Mar 5, 2019

This change along with #5089 fixed our corruptions on the netHandler active_queue and keep_alive_queue. The HostDB continuations that were queued up waiting for the dns result were all getting reactivated on the same thread. This was unlikely to be the thread associated with the netvc's for the transaction. There were paths that created the CacheVC on on a different thread and then caused the netHandler queue manipulation to be called from the non-net handler thread.

Also removed the explicit initial_thread member variable for CacheVC and used the thread affinity instead.

@shinrich shinrich added the Crash label Mar 5, 2019
@shinrich shinrich added this to the 9.0.0 milestone Mar 5, 2019
@shinrich shinrich self-assigned this Mar 5, 2019
@shinrich shinrich force-pushed the add-more-affinity branch from 88fabc7 to 217dd79 Compare March 5, 2019 18:49
@vmamidi
Copy link
Contributor

vmamidi commented Mar 5, 2019

I don't think we are seeing any corruption in the queues. Is there any specific scenario that this happens?

while ((c = qq.dequeue())) {
c->handleEvent(EVENT_IMMEDIATE, nullptr);
// resume all queued HostDBCont in the thread associated with the netvc to avoid locking issues.
eventProcessor.schedule_imm(c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, most of the time, the old c->handleEvent() does get handled on the "right" thread right now? Otherwise, this would cause more problems than we see today? If so, is there not a way to check if we're currently on an "acceptable" thread, and conditionally do c->handleEvent() if so ? I haven't looked at what we have available here, but it would be nice to avoid this rescheduling for the common case, and only take these rescheduling costs when absolutely necessary.

Also, is this related to the global session pools at all? It seems a lot of work is being done in the core, to address issues that only happens when the global session pool is enabled. My real concern here is changing all these things in the event system, when maybe we should revisit the global session pool (like, maybe understand why it's needed, or perhaps see if it can be improved / fixed?).

Copy link
Member Author

@shinrich shinrich Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zwoop I'll rejigger the remove_trigger_pending_dns to only schedule if the affinity thread is different from the current thread. That should calls the handleEvent to be called directly in the common case. This is the only place in this PR that replaced a direct handleEvent call with a schedule.

The issue here has nothing to do with the global session pool. It has to do with multiple hostdb requests queuing up for an outstanding DNS request. We sometimes have some odd performance hiccups with DNS in our deployment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed up a new commit with the rejigger

@bryancall
Copy link
Contributor

Do you have an asan report or a backtrace showing that the queues are being corrupted. Are you still seeing corruption after testing with #4379?

My worry is that you are running this change on an internal fork that doesn't match master and may behave differently.

@shinrich
Copy link
Member Author

shinrich commented Mar 5, 2019

@bryancall when I brought in #4379 we rapidly got asserts for failing to get the net handler lock. Which makes sense. Since the ET_NET thread holds the net handler thread for most of the time it is active, it is likely that a TRY_LOCK will fail. I'll gather up some stack traces.

@dmorilha also commented on IRC about seeing the "BUG: It must have acquired the NetHandler's lock before doing anything else keep_alive_queue" assert while running the latest of the 7.1.x branch.

@shinrich shinrich force-pushed the add-more-affinity branch from 217dd79 to e42a31a Compare March 5, 2019 20:48
@shinrich
Copy link
Member Author

shinrich commented Mar 5, 2019

Here is a stack with the assert from #4379 going off

#0  0x00002acf80fc11d7 in raise () from /lib64/libc.so.6
#1  0x00002acf80fc28c8 in abort () from /lib64/libc.so.6
#2  0x00002acf7e25006b in ink_abort (message_format=message_format@entry=0x2acf7e26dae7 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00002acf7e24d625 in _ink_assert (expression=expression@entry=0x828298 "!\"BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue.\"", 
    file=file@entry=0x828090 "UnixNetVConnection.cc", line=line@entry=1561) at ink_assert.cc:37
#4  0x000000000077e949 in UnixNetVConnection::add_to_keep_alive_queue() () at UnixNetVConnection.cc:1561
#5  0x0000000000632917 in Http2ConnectionState::release_stream(Http2Stream*) () at Http2ConnectionState.cc:1269
#6  0x000000000061e4fe in Http2Stream::destroy() () at Http2Stream.cc:716
#7  0x0000000000619f22 in Http2Stream::terminate_if_possible() () at Http2Stream.cc:377
#8  0x000000000061a0d8 in Http2Stream::transaction_done() () at Http2Stream.cc:366
#9  0x00000000005ce8bc in HttpSM::kill_this() () at HttpSM.cc:7055
#10 0x00000000005cf7e5 in HttpSM::main_handler (this=0x2ad133c0cdb0, event=2301, data=0x2ad133c0e148) at HttpSM.cc:2762
#11 0x000000000060dfd2 in handleEvent (data=0x2ad133c0e148, event=2301, this=0x2ad133c0cdb0)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver7/iocore/eventsystem/I_Continuation.h:157
#12 HttpTunnel::main_handler (this=0x2ad133c0e148, event=<optimized out>, data=<optimized out>) at HttpTunnel.cc:1657
#13 0x00000000007335a2 in handleEvent (data=0x2ad120a46ae0, event=103, this=0x2ad133c0e148)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver7/iocore/eventsystem/I_Continuation.h:157
#14 calluser (event=103, this=0x2ad120a46890) at P_CacheInternal.h:639
#15 CacheVC::openWriteMain (this=0x2ad120a46890) at CacheWrite.cc:1353
#16 0x00000000007b87cc in handleEvent (data=0x2acf8f116ac0, event=1, this=0x2ad120a46890) at I_Continuation.h:157
#17 EThread::process_event(Event*, int) () at UnixEThread.cc:136
#18 0x00000000007b913e in EThread::process_queue (this=this@entry=0x2acf84a13680, NegativeQueue=NegativeQueue@entry=0x2acf88ffe670, ev_count=ev_count@entry=0x2acf88ffe66c, nq_count=nq_count@entry=0x2acf88ffe668)
    at UnixEThread.cc:175
#19 0x00000000007b95fc in EThread::execute_regular (this=0x2acf84a13680) at UnixEThread.cc:235
#20 0x00000000007b81c2 in spawn_thread_internal (a=0x2acf8206c200) at Thread.cc:85
#21 0x00002acf80354dc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00002acf8108376d in clone () from /lib64/libc.so.6                                                  

@shinrich
Copy link
Member Author

shinrich commented Mar 5, 2019

Here is an example of a corrupting keep_alive_queue stack trace. Unfortunately, I did not save aside the ASAN stacks, and the log files have rotated by now.

(gdb) bt
#0  insert (after=0x2b4fc474bc00, e=0x2b4f31e72000, this=<optimized out>) at P_UnixNetVConnection.h:246
#1  insert (after=0x2b4fc474bc00, e=0x2b4f31e72000, this=<optimized out>) at ../../lib/ts/List.h:377
#2  enqueue (e=0x2b4f31e72000, this=<optimized out>) at ../../lib/ts/List.h:434
#3  NetHandler::add_to_keep_alive_queue (this=0x2b4dedc973a0, vc=0x2b4f31e72000) at UnixNet.cc:701
#4  0x0000000000632c57 in Http2ConnectionState::release_stream(Http2Stream*) () at Http2ConnectionState.cc:1269
#5  0x000000000061e83e in Http2Stream::destroy() () at Http2Stream.cc:716
#6  0x000000000061a262 in Http2Stream::terminate_if_possible() () at Http2Stream.cc:377
#7  0x000000000061e0a9 in Http2Stream::main_event_handler(int, void*) () at Http2Stream.cc:130
#8  0x00000000007b8b2c in handleEvent (data=0x2b4df1a001e0, event=103, this=0x2b4f51956800) at I_Continuation.h:157
#9  EThread::process_event(Event*, int) () at UnixEThread.cc:136
#10 0x00000000007b949e in EThread::process_queue (this=this@entry=0x2b4dedc936c0, NegativeQueue=NegativeQueue@entry=0x2b4deebd3670, ev_count=ev_count@entry=0x2b4deebd366c, nq_count=nq_count@entry=0x2b4deebd3668) at UnixEThread.cc:175
#11 0x00000000007b995c in EThread::execute_regular (this=0x2b4dedc936c0) at UnixEThread.cc:235
#12 0x00000000007b8522 in spawn_thread_internal (a=0x2b4dec86b080) at Thread.cc:85
#13 0x00002b4deaaafdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00002b4deb7de76d in clone () from /lib64/libc.so.6

c->initial_thread = t->tt == DEDICATED ? nullptr : t;
c->mutex = cont->mutex;
c->start_time = Thread::get_hrtime();
c->setThreadAffinity(t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that means CacheVC can be bond on a dedicated thread?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Is there any potential risk here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A CacheVC instance should be bound to a specific thread, the same thread to which the transaction is bound. The point here is when a transaction does a cache I/O operation, the actual work is done in an AIO thread. When that completes, control should pass back to the exact thread that originally asked for the I/O. This isn't new, it's only a different mechanism. Previously this was a special case in the cache (see CacheVC::initial_thread in "P_CacheInternal.h"), but now it can be done using the general thread affinity. This is in fact one of the use cases that motivated creating the generalized thread affinity.

@scw00
Copy link
Member

scw00 commented Mar 6, 2019

After #5089. The events always schedule to the local thread. So how to schedule a event to other thread(same etype) that I don't want it run in local thread. It seems the same as the thread->schedule

@scw00
Copy link
Member

scw00 commented Mar 6, 2019

Does cache initialized run in one thread now ?

It is not run in the same thread because cache never read across Vol.

But another case.

/* NOTE:: This state can be called by an AIO thread, so DON'T DON'T
   DON'T schedule any events on this thread using VC_SCHED_XXX or
   mutex->thread_holding->schedule_xxx_local(). ALWAYS use
   eventProcessor.schedule_xxx().
   */
int
Vol::aggWriteDone(int event, Event *e)
{
  cancel_trigger();

  // ensure we have the cacheDirSync lock if we intend to call it later
  // retaking the current mutex recursively is a NOOP
  CACHE_TRY_LOCK(lock, dir_sync_waiting ? cacheDirSync->mutex : mutex, mutex->thread_holding);
  if (!lock.is_locked()) {
    eventProcessor.schedule_in(this, HRTIME_MSECONDS(cache_config_mutex_retry_delay));
    return EVENT_CONT;
  }

The default type is ET_CALL forgot it.

@dmorilha
Copy link

dmorilha commented Mar 6, 2019

Here is a stack with the assert from #4379 going off

#0  0x00002acf80fc11d7 in raise () from /lib64/libc.so.6
#1  0x00002acf80fc28c8 in abort () from /lib64/libc.so.6
#2  0x00002acf7e25006b in ink_abort (message_format=message_format@entry=0x2acf7e26dae7 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00002acf7e24d625 in _ink_assert (expression=expression@entry=0x828298 "!\"BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue.\"", 
    file=file@entry=0x828090 "UnixNetVConnection.cc", line=line@entry=1561) at ink_assert.cc:37
#4  0x000000000077e949 in UnixNetVConnection::add_to_keep_alive_queue() () at UnixNetVConnection.cc:1561
#5  0x0000000000632917 in Http2ConnectionState::release_stream(Http2Stream*) () at Http2ConnectionState.cc:1269
#6  0x000000000061e4fe in Http2Stream::destroy() () at Http2Stream.cc:716
#7  0x0000000000619f22 in Http2Stream::terminate_if_possible() () at Http2Stream.cc:377
#8  0x000000000061a0d8 in Http2Stream::transaction_done() () at Http2Stream.cc:366
#9  0x00000000005ce8bc in HttpSM::kill_this() () at HttpSM.cc:7055
#10 0x00000000005cf7e5 in HttpSM::main_handler (this=0x2ad133c0cdb0, event=2301, data=0x2ad133c0e148) at HttpSM.cc:2762
#11 0x000000000060dfd2 in handleEvent (data=0x2ad133c0e148, event=2301, this=0x2ad133c0cdb0)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver7/iocore/eventsystem/I_Continuation.h:157
#12 HttpTunnel::main_handler (this=0x2ad133c0e148, event=<optimized out>, data=<optimized out>) at HttpTunnel.cc:1657
#13 0x00000000007335a2 in handleEvent (data=0x2ad120a46ae0, event=103, this=0x2ad133c0e148)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver7/iocore/eventsystem/I_Continuation.h:157
#14 calluser (event=103, this=0x2ad120a46890) at P_CacheInternal.h:639
#15 CacheVC::openWriteMain (this=0x2ad120a46890) at CacheWrite.cc:1353
#16 0x00000000007b87cc in handleEvent (data=0x2acf8f116ac0, event=1, this=0x2ad120a46890) at I_Continuation.h:157
#17 EThread::process_event(Event*, int) () at UnixEThread.cc:136
#18 0x00000000007b913e in EThread::process_queue (this=this@entry=0x2acf84a13680, NegativeQueue=NegativeQueue@entry=0x2acf88ffe670, ev_count=ev_count@entry=0x2acf88ffe66c, nq_count=nq_count@entry=0x2acf88ffe668)
    at UnixEThread.cc:175
#19 0x00000000007b95fc in EThread::execute_regular (this=0x2acf84a13680) at UnixEThread.cc:235
#20 0x00000000007b81c2 in spawn_thread_internal (a=0x2acf8206c200) at Thread.cc:85
#21 0x00002acf80354dc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00002acf8108376d in clone () from /lib64/libc.so.6                                                  

this is the exact same backtrace we also see

@vmamidi
Copy link
Contributor

vmamidi commented Mar 6, 2019

@scw00 @oknet why do we have an assert if we can't acquire the lock in #4379 ?

@scw00
Copy link
Member

scw00 commented Mar 6, 2019

Because keep_alive_queue belongs to NetHandler and it is not lock free.

@shinrich
Copy link
Member Author

shinrich commented Mar 6, 2019

The alternative to asserting would be to do a blocking lock which is non-performant. Rescheduling at that point (of shutting down netvc's) isn't feasible or at least would be very complicated and error prone.

@vmamidi
Copy link
Contributor

vmamidi commented Mar 6, 2019

Because keep_alive_queue belongs to NetHandler and it is not lock free.

I agree, its not lock free but why should we assert if lock fails?

@SolidWallOfCode
Copy link
Member

Because the alternative is subtle, inconsistent, and hard to track down memory corruption?

@vmamidi
Copy link
Contributor

vmamidi commented Mar 6, 2019

It doesn't necessarily have to be a memory corruption if the lock fails, right?

@scw00
Copy link
Member

scw00 commented Mar 7, 2019

I'm not so clearly with your point. Can you please share more information.

@vmamidi
Copy link
Contributor

vmamidi commented Mar 7, 2019

I think this PR is trying to work around #4379.How are we going to go back to original thread if the has its own threads? I think alternatives for asserting maybe inconsistent and less performant but asserting crashes ATS. I do not think we should crash the server unless the server enters a unreliable state. I don't think unable to acquire a mutex puts ATS into unreliable state .@bryancall @zwoop @shinrich @scw00 @SolidWallOfCode @oknet thoughts?

@SolidWallOfCode
Copy link
Member

In this particular case, failure to acquire the mutex means a non-thread safe data structure can be modified by multiple threads at the same time, creating a race condition. This in turn tends to create memory corruption in that object which don't actually cause problems until after the race condition has happened. The observed behavior is random crashes for inexplicable reasons. The assert means the unsafe data access is caught when it happens, rather than having to guess where it might have occurred from a subsequent crash due to the corruption.

It doesn't have to be memory corruption without the lock, and in fact most of the time nothing bad will happen. However, "most of the time" is small comfort at production traffic levels. It also means the crashes will be sporadic, making it even more difficult to tell if the problem is fixed or not.

Copy link
Contributor

@bryancall bryancall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to get more than 1 person to approve this PR since there is a lot of discussion on this. Someone from another company would be preferable.

@dmorilha
Copy link

Since we were seeing problems with HostDB and #4379 crashed all the time for us, I back-ported this to 7.1.x and we are running it in a low traffic host.

@dmorilha
Copy link

changes are looking good for us 24h later

@SolidWallOfCode
Copy link
Member

I think we can take @dmorilha's endorsement as an approval.

@vmamidi
Copy link
Contributor

vmamidi commented Mar 13, 2019

I'll be concerned with #4379 until we have enough confidence that HttpSm is always run on original thread.

@dmorilha
Copy link

ok, after deploying everywhere, the asserts do crash from time to time, I still think the change is beneficial and I am making the asserts just errors for now...

@vmamidi
Copy link
Contributor

vmamidi commented Mar 13, 2019

did you convert the asserts to errors in your local repo?

@vmamidi
Copy link
Contributor

vmamidi commented Mar 13, 2019

@dmorilha's crashes prove the theory that HttpSM may not always running on original thread.
I think there are several scenarios where HttpSM will not run on original thread and this PR definitely fixes one of them. If we feel confident that HttpSM is running on original thread, we shouldn't even try to acquire mutex in change #4379. If we want to verify if we are on the same thread we can do ink_assert(thread_holding == this_thread()) instead of trying to acquire mutex. I am not confident that HttpSM is running on original thread as of now and i think we should change the assert to an error until we cover all the scenarios.

@dmorilha
Copy link

@vmamidi
Copy link
Contributor

vmamidi commented Mar 14, 2019

netlify@55b3fed

Can you create a PR with this change?

@shinrich
Copy link
Member Author

shinrich commented Mar 14, 2019

Ok it looks like the remaining concerns are with respect to whether the checks in #4379 error or assert. I am going to go ahead and commit this, since it has been approved and verified by folks from another organization. Plus bluuurgh on the IRC just brought up with issue with 7.1.6.

@dmorilha can you put up the PR for 7.1.x?

@shinrich shinrich merged commit 33bbe34 into apache:master Mar 14, 2019
zwoop added a commit to zwoop/trafficserver that referenced this pull request Mar 14, 2019
… not thread-safe"

This reverts commit 3599e5e.

Seems this causes new, unpredcitable crashes. For now, lets revert this, and we can
keep working on this for 8.x / 9.x. Reference apache#4379 and apache#4921 and apache#5120.
zwoop added a commit that referenced this pull request Mar 14, 2019
… not thread-safe"

This reverts commit 3599e5e.

Seems this causes new, unpredcitable crashes. For now, lets revert this, and we can
keep working on this for 8.x / 9.x. Reference #4379 and #4921 and #5120.
@dmorilha
Copy link

@shinrich @vmamidi I will do it, thanks,

@dmorilha
Copy link

#5198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants