Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Admin deadlock when concurrent connections query some stats tables #4415

Closed
5 tasks done
JavierJF opened this issue Jan 10, 2024 · 0 comments
Closed
5 tasks done

Admin deadlock when concurrent connections query some stats tables #4415

JavierJF opened this issue Jan 10, 2024 · 0 comments

Comments

@JavierJF
Copy link
Collaborator

JavierJF commented Jan 10, 2024

  • A clear description of the issue

A deadlock can take place if a concurrent Admin connection query the stats database meanwhile other Admin session is serving a query that requires calling GenericRefreshStatistics.

  • ProxySQL version

Affected versions v2.5.2 to v2.5.5.

  • OS version

ArchLinux

  • Detailed explanation

The deadlock takes place when the following flow is exercised:

  1. A first session is being processed by the Admin interface, the query in this session is forcing a stats refresh via GenericRefreshStatistics. This calls stats___proxysql_servers_checksums, starting a transaction against SQLite here.
  2. A second session is waiting in Admin global mutex sql_query_global_mutex, for now, let's assume this session query is SELECT * FROM stats_mysql_gtid_executed, more details about why later.
  3. Several lines later, the first session unlocks global Admin mutex, preventing a potential deadlock with Cluster.
  4. Since the mutex is unlocked, the second session is able to take it, and continue its processing. The previously started transaction on SQLite connection for statsdb object, is still open.
  5. Since the query being executed in the second session is stats_mysql_gtid_executed, no COMMITs operations using the same connection (statsdb), are found in the execution flow, before reaching ProxySQL_Admin::vacuum_stats. This is crucial for the deadlock to be triggered, because otherwise the created transaction in the first session would have been closed by this one.
  6. Since both are Admin sessions (no sessions using stats user), when the first query tries to be executed in ProxySQL_Admin::vacuum_stats, it will try to do so using the admindb connection to SQLite, since there is a transaction already started by the first session in statsdb connection, SQLite will respond to this execution with error SQLITE_LOCKED.
  7. ProxySQL will keep retrying the operation in hope for a temporary lock, waiting for the other transaction to finish, but since the first session is locked in Admin global mutex, that will never happen.
  • The steps to reproduce the issue

Files: repro.tar.gz

Using the provided init.gdb start ProxySQL with gdb via the following command:

gdb -x init.gdb -ex run --args ./src/proxysql -M --sqlite3-server --idle-threads -f -D test_datadir

In another terminal use the supplied script that opens to Admin connections:

./admin_conns_br.sh

The script just performs the following operations:

mysql --prompt="admin> " -uadmin -padmin -h127.0.0.1 -P6032 -e "SELECT * FROM stats_proxysql_servers_checksums" &
sleep 1
mysql --prompt="admin> " -uadmin -padmin -h127.0.0.1 -P6032 -e "SELECT * FROM stats_mysql_gtid_executed" &

wait

This will trigger the breakpoints defined in init.gdb in the intended sequence. Once the script is waiting, we must go back to gdb and make the stopped threads continue in the following order:

  1. Thread in Breakpoint 1, at ProxySQL_Admin::stats___proxysql_servers_checksums
  2. Thread at Breakpoint 2, at ProxySQL_Admin::vacuum_stats

Example:

Thread 17 "proxysql" hit Breakpoint 1, ProxySQL_Admin::stats___proxysql_servers_checksums (this=0x7ffff6c59c00) at ProxySQL_Admin.cpp:9730
9730		resultset=GloProxyCluster->get_stats_proxysql_servers_checksums();
(gdb)
Thread 18 "proxysql" hit Breakpoint 2, ProxySQL_Admin::vacuum_stats (this=0x7ffff6c59c00, is_admin=true) at ProxySQL_Admin.cpp:5424
5424		for (auto it = tablenames.begin(); it != tablenames.end(); it++) {
Quit
(gdb) thread 17
[Switching to thread 17 (Thread 0x7fffee7ff6c0 (LWP 2732346))]
#0  ProxySQL_Admin::stats___proxysql_servers_checksums (this=0x7ffff6c59c00) at ProxySQL_Admin.cpp:9730
9730		resultset=GloProxyCluster->get_stats_proxysql_servers_checksums();
(gdb) c
Continuing.
^C
Thread 1 "proxysql" received signal SIGINT, Interrupt.
0x00007ffff76f37f5 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffffffc840, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48	  r = INTERNAL_SYSCALL_CANCEL (clock_nanosleep_time64, clock_id, flags, req,
(gdb) c -a
Continuing.

At this point the Admin interface will be deadlock, meaning the previous connections will be stuck, and no other Admin connections will be accepted. Example of thread status in this scenario:

Thread 19 (Thread 0x7fffef5ff6c0 (LWP 2732596) "proxysql"):
#0  futex_wait (private=0, expected=2, futex_word=0x7ffff6c59f00) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7ffff6c59f00, private=0) at lowlevellock.c:49
#2  0x00007ffff76adec2 in lll_mutex_lock_optimized (mutex=0x7ffff6c59f00) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x7ffff6c59f00) at pthread_mutex_lock.c:93
#4  0x0000555555c4e8fe in admin_session_handler (sess=0x7fffee80c000, _pa=0x7ffff6c59c00, pkt=0x7fffee80c1a0) at ProxySQL_Admin.cpp:3745
#5  0x0000555555beb554 in MySQL_Session::handler___status_WAITING_CLIENT_DATA___STATE_SLEEP___MYSQL_COM_QUERY___not_mysql (this=0x7fffee80c000, pkt=...) at MySQL_Session.cpp:3532
#6  0x0000555555bed51a in MySQL_Session::get_pkts_from_client (this=0x7fffee80c000, wrong_pass=@0x7fffef5fcc1f: false, pkt=...) at MySQL_Session.cpp:4058
#7  0x0000555555bf06f7 in MySQL_Session::handler (this=0x7fffee80c000) at MySQL_Session.cpp:4797
#8  0x0000555555c5801d in child_mysql (arg=0x7fffefa00010) at ProxySQL_Admin.cpp:5545
#9  0x00007ffff76aa9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#10 0x00007ffff772e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 18 (Thread 0x7fffed9ff6c0 (LWP 2732351) "proxysql"):
#0  0x00007ffff76f37f5 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffed9fbb20, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007ffff77058c7 in __GI___nanosleep (req=req@entry=0x7fffed9fbb20, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007ffff772e349 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:31
#3  0x0000555555b302a7 in SQLite3DB::execute (this=0x7ffff6da2230, str=0x7fffecc0c7c0 "DELETE FROM stats.stats_mysql_commands_counters") at sqlite3db.cpp:175
#4  0x0000555555c578fa in ProxySQL_Admin::vacuum_stats (this=0x7ffff6c59c00, is_admin=true) at ProxySQL_Admin.cpp:5428
#5  0x0000555555c56bbf in admin_session_handler (sess=0x7fffecc13000, _pa=0x7ffff6c59c00, pkt=0x7fffecc131a0) at ProxySQL_Admin.cpp:5365
#6  0x0000555555beb554 in MySQL_Session::handler___status_WAITING_CLIENT_DATA___STATE_SLEEP___MYSQL_COM_QUERY___not_mysql (this=0x7fffecc13000, pkt=...) at MySQL_Session.cpp:3532
#7  0x0000555555bed51a in MySQL_Session::get_pkts_from_client (this=0x7fffecc13000, wrong_pass=@0x7fffed9fcc1f: false, pkt=...) at MySQL_Session.cpp:4058
#8  0x0000555555bf06f7 in MySQL_Session::handler (this=0x7fffecc13000) at MySQL_Session.cpp:4797
#9  0x0000555555c5801d in child_mysql (arg=0x7fffefa00460) at ProxySQL_Admin.cpp:5545
#10 0x00007ffff76aa9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#11 0x00007ffff772e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 17 (Thread 0x7fffee7ff6c0 (LWP 2732346) "proxysql"):
#0  futex_wait (private=0, expected=2, futex_word=0x7ffff6c59f00) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7ffff6c59f00, private=0) at lowlevellock.c:49
#2  0x00007ffff76adec2 in lll_mutex_lock_optimized (mutex=0x7ffff6c59f00) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x7ffff6c59f00) at pthread_mutex_lock.c:93
#4  0x0000555555c77384 in ProxySQL_Admin::stats___proxysql_servers_checksums (this=0x7ffff6c59c00) at ProxySQL_Admin.cpp:9731
#5  0x0000555555c4d14d in ProxySQL_Admin::GenericRefreshStatistics (this=0x7ffff6c59c00, query_no_space=0x7fffeda00150 "SELECT * FROM stats_proxysql_servers_checksums", query_no_space_length=46, admin=true) at ProxySQL_Admin.cpp:3385
#6  0x0000555555c50de8 in admin_session_handler (sess=0x7fffeda29000, _pa=0x7ffff6c59c00, pkt=0x7fffeda291a0) at ProxySQL_Admin.cpp:4294
#7  0x0000555555beb554 in MySQL_Session::handler___status_WAITING_CLIENT_DATA___STATE_SLEEP___MYSQL_COM_QUERY___not_mysql (this=0x7fffeda29000, pkt=...) at MySQL_Session.cpp:3532
#8  0x0000555555bed51a in MySQL_Session::get_pkts_from_client (this=0x7fffeda29000, wrong_pass=@0x7fffee7fcc1f: false, pkt=...) at MySQL_Session.cpp:4058
#9  0x0000555555bf06f7 in MySQL_Session::handler (this=0x7fffeda29000) at MySQL_Session.cpp:4797
#10 0x0000555555c5801d in child_mysql (arg=0x7fffefa004a0) at ProxySQL_Admin.cpp:5545
#11 0x00007ffff76aa9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#12 0x00007ffff772e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

In the previous output we can see Threads 18 and Thread 17 in the deadlock, while Thread 19 is a new connection waiting for the lock that won't be released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant