-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-31953 madvise(..., MADV_FREE) is causing a performance regression / MDEV-24670 avoid OOM by linux kernel co-operative memory management #2805
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I think that the general idea is good.
e4f9169
to
7d94122
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost good to go. Please squash my commit to yours, and just note in the commit message that the logic of buf_pool_t::garbage_collect()
was provided by me.
I think that the buf_pool_t::garbage_collect()
needs to be reviewed by @vlad-lesin or @Thirunarayanan. We’d also want subject this to some stress testing with a MDEV-25341 type workload and maybe also using the revised debug instrumentation.
7d94122
to
90c1e91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Hopefully this time the stress test run will not reveal any problems with buf_pool_t::garbage_collect()
.
90c1e91
to
826b1a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s wait for the stress test results.
The RQG testing on sdp:/data1/results/1699453476/TBR-2080$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio Testing scenario:
|
The failure that @mleich1 reported on 826b1a0 is in the
While this thread is waiting, diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
index d46446a4b58..ed434e77fa4 100644
--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -2148,11 +2148,16 @@ inline void buf_pool_t::garbage_collect()
if (state < buf_page_t::READ_FIX && bpage->lock.u_lock_try(true))
{
ut_ad(!bpage->is_io_fixed());
- const lsn_t oldest_modification= bpage->oldest_modification();
+ lsn_t oldest_modification= bpage->oldest_modification();
switch (oldest_modification) {
case 1:
mysql_mutex_lock(&buf_pool.flush_list_mutex);
- buf_pool.delete_from_flush_list(bpage);
+ oldest_modification= bpage->oldest_modification();
+ if (oldest_modification)
+ {
+ ut_ad(oldest_modification == 1);
+ buf_pool.delete_from_flush_list(bpage);
+ }
mysql_mutex_unlock(&buf_pool.flush_list_mutex);
/* fall through */
case 0: |
fc4effa
to
00fa511
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to push after addressing these. It was a bit tricky to stabilize the regression test.
SELECT CAST(VARIABLE_VALUE AS INTEGER) INTO @dirty_prev FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty'; | ||
|
||
set debug_dbug="d,trigger_garbage_collection"; | ||
SET GLOBAL innodb_buffer_pool_size=@@innodb_buffer_pool_size; | ||
|
||
SELECT VARIABLE_VALUE < @dirty_prev AS LESS_DIRTY_IS_GOOD FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing a CAST
in the second SELECT
. For some reason, the return value of get_one_variable()
is a string. I did not track down where the comparison is being executed and why it would fail, but I assume that one of the operands is a string and the other is INTEGER
.
set @save_dbug=@@debug_dbug; | ||
|
||
CREATE TABLE t1 (t TEXT) ENGINE=InnoDB; | ||
INSERT INTO t1 SELECT CONCAT(REPEAT('junk text', 500), CAST(seq AS CHAR)) FROM seq_1_to_1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fill the buffer pool with undo log pages due to row-level undo logging. We should enable the bulk insert interface to fill the buffer pool mainly with the pages of the table t1
. Therefore, it can happen that before and after the buf_pool_t::garbage_collect()
, the buffer pool is filled with undo log pages.
Because we are using debug instrumentation anyway, we can also make use of innodb_limit_optimistic_insert_debug
and write considerably less redo log.
We’d better wait that all undo logs have been purged before we start the test. Before I added that, the test would fail after a couple of dozen repetitions.
diff --git a/mysql-test/suite/innodb/t/mem_pressure.test b/mysql-test/suite/innodb/t/mem_pressure.test
index 2d140fb22ad..60f5acc291d 100644
--- a/mysql-test/suite/innodb/t/mem_pressure.test
+++ b/mysql-test/suite/innodb/t/mem_pressure.test
@@ -9,17 +9,27 @@
--echo #
set @save_dbug=@@debug_dbug;
-
-CREATE TABLE t1 (t TEXT) ENGINE=InnoDB;
-INSERT INTO t1 SELECT CONCAT(REPEAT('junk text', 500), CAST(seq AS CHAR)) FROM seq_1_to_1000;
-
+set @save_limit=@@GLOBAL.innodb_limit_optimistic_insert_debug;
+SET GLOBAL innodb_max_purge_lag_wait=0;
+
+CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;
+SET GLOBAL innodb_limit_optimistic_insert_debug=2;
+SET unique_checks=0, foreign_key_checks=0;
+INSERT INTO t1 SELECT * FROM seq_1_to_1000;
+SET GLOBAL innodb_limit_optimistic_insert_debug=@save_limit;
DROP TABLE t1;
-SELECT CAST(VARIABLE_VALUE AS INTEGER) INTO @dirty_prev FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty';
+SET unique_checks=1, foreign_key_checks=1;
+
+SELECT CAST(VARIABLE_VALUE AS INTEGER) INTO @dirty_prev
+FROM INFORMATION_SCHEMA.GLOBAL_STATUS
+WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty';
set debug_dbug="d,trigger_garbage_collection";
SET GLOBAL innodb_buffer_pool_size=@@innodb_buffer_pool_size;
-SELECT VARIABLE_VALUE < @dirty_prev AS LESS_DIRTY_IS_GOOD FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty';
+SELECT CAST(VARIABLE_VALUE AS INTEGER) < @dirty_prev AS LESS_DIRTY_IS_GOOD
+FROM INFORMATION_SCHEMA.GLOBAL_STATUS
+WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty';
let SEARCH_FILE= $MYSQLTEST_VARDIR/log/mysqld.1.err;
let SEARCH_PATTERN= InnoDB: Memory pressure event freed;
@@ -27,6 +37,4 @@ let SEARCH_PATTERN= InnoDB: Memory pressure event freed;
set debug_dbug=@save_dbug;
---echo #
--echo # End of 10.11 tests
---echo #
The test as modified above passed my local test:
innodb.mem_pressure 'innodb' w22 [ 1000 pass ] 139
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 5053.123 of 291 seconds executing testcases
Completed: All 42000 tests were successful.
storage/innobase/include/buf0buf.h
Outdated
/** Collect garbage (release pages from the LRU list) */ | ||
#ifdef __linux__ | ||
inline void garbage_collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Doxygen could complain about the stray comment if not run with -D__linux__
. Better move it inside the #ifdef
.
storage/innobase/buf/buf0buf.cc
Outdated
/** Memory Pressure */ | ||
|
||
#ifdef __linux__ | ||
/* | ||
based off https://www.kernel.org/doc/html/latest/accounting/psi.html#pressure-interface | ||
and https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory | ||
*/ | ||
#include <poll.h> | ||
#include <fstream> | ||
|
||
class mem_pressure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the /**
Doxygen comment right before class mem_pressure
, and consider including the whole comment there, in case someone is actually generating documentation from the source code and reading it:
#ifdef __linux__
# include <poll.h>
# include <fstream>
/** Memory Pressure
based on …
*/
class mem_pressure
buf_page_t::set_os_unused(): Remove the system call that had been added in commit 16c9718 and revised in commit c1fd082 for Microsoft Windows. buf_pool_t::garbage_collect(): A new function to collect any garbage from the InnoDB buffer pool that can be removed without writing any log or data files. This will also invoke madvise() for all of buf_pool.free. To trigger this the following MDEV is implemented: MDEV-24670 avoid OOM by linux kernel co-operative memory management To avoid frequent triggers that caused the MDEV-31953 regression, while still preserving the 10.11 functionality of non-greedy kernel memory usage, memory triggers are used. On the triggering of memory pressure, if supported in the Linux kernel, trigger the garbage collection of the innodb buffer pool. The hard coded triggers occur where there is: * some memory pressure in 5 of the last 10 seconds * a full stall on memory pressure for 10ms in the last 2 seconds The kernel will trigger only one in each of these time windows. To avoid mariadb being in a constant state of memory garbage collection, this has been limited to once per minute. For a small set of kernels in 2023 (6.5, 6.6), there was a limit requiring CAP_SYS_RESOURCE that was lifted[1] to support the use case of user memory pressure. It not currently possible to set CAP_SYS_RESOURCES in a systemd service as its setting a capability inside a usernamespace. Running under systemd v254+ requires the default MemoryPressureWatch=auto (or alternately "on"). Functionality was tested in a 6.4 kernel Fedora successfully under a systemd service. Running in a container requires that (unmask=)/sys/fs/cgroup be writable by the mariadbd process. To aid testing, the buf_pool_resize was a convient trigger point on which to trigger garbage collection. ref [1]: https://lore.kernel.org/all/CAMw=ZnQ56cm4Txgy5EhGYvR+Jt4s-KVgoA9_65HKWVMOXp7a9A@mail.gmail.com/T/#m3bd2a73c5ee49965cb73a830b1ccaa37ccf4e427 Co-Author: Daniel Black (on memory pressure trigger) Reviewed by: Marko Mäkelä, Vladislav Vaintroub, Vladislav Lesin, Thirunarayanan Balathandayuthapani Tested by: Matthias Leich
00fa511
to
4f58de1
Compare
Description
The madvice(FREE) added in 10.11 cased performance regressions. To trigger it less often the memory pressure mechanism of Linux was used.
How can this PR be tested?
Manually like in MDEV-25341.
Basing the PR against the correct MariaDB version
PR quality check