-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix slow getaddrs call in global distribution #4086
Conversation
Fixes flaky mod_global_distrib_SUITE:test_host_refreshing testcase
small_tests_24 / small_tests / 6b3daf7 elasticsearch_and_cassandra_25 / elasticsearch_and_cassandra_mnesia / 6b3daf7 small_tests_25 / small_tests / 6b3daf7 small_tests_25_arm64 / small_tests / 6b3daf7 dynamic_domains_pgsql_mnesia_24 / pgsql_mnesia / 6b3daf7 ldap_mnesia_24 / ldap_mnesia / 6b3daf7 ldap_mnesia_25 / ldap_mnesia / 6b3daf7 dynamic_domains_mysql_redis_25 / mysql_redis / 6b3daf7 metrics_session_SUITE:session:login_one{error,
{{badmatch,{value,6}},
[{metrics_helper,assert_counter,3,
[{file,"/home/circleci/project/big_tests/tests/metrics_helper.erl"},
{line,36}]},
{metrics_session_SUITE,'-login_one/1-fun-0-',3,
[{file,
"/home/circleci/project/big_tests/tests/metrics_session_SUITE.erl"},
{line,79}]},
{escalus_story,story,4,
[{file,
"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
{line,72}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1782}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1291}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1223}]}]}} metrics_session_SUITE:session_global:session_unique{error,
{{uniqueSessionCount,
{value,1},
[{times,25,
{error,
{badmatch,{value,6}},
[{metrics_helper,assert_counter,3,
[{file,
"/home/circleci/project/big_tests/tests/metrics_helper.erl"},
{line,36}]},
{mongoose_helper,do_wait_until,2,
[{file,
"/home/circleci/project/big_tests/tests/mongoose_helper.erl"},
{line,360}]},
{metrics_session_SUITE,'-session_unique/1-fun-0-',2,
[{file,
"/home/circleci/project/big_tests/tests/metrics_session_SUITE.erl"},
{line,120}]},
{escalus_story,story,4,
[{file,
"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
{line,72}]},
{test_server,ts_tc,3,
[{file,"test_server.erl"},{line,1782}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1291}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1223}]}]}}],
ok},
[{mongoose_helper,do_wait_until,2,
[{file,"/home/circleci/project/big_tests/tests/mongoose_helper.erl"},
{line,357}]},
{metrics_session_SUITE,'-session_unique/1-fun-0-',2,
[{file,
"/home/circleci/project/big_tests/tests/metrics_session_SUITE.erl"},
{line,120}]},
{escalus_story,story,4,
[{file,
"/home/circleci/pr... metrics_api_SUITE:global:session_counters{error,{{assertEqual,[{module,metrics_api_SUITE},
{line,231},
{expression,"fetch_global_gauge_value ( uniqueSessionCount , Config )"},
{expected,2},
{value,7}]},
[{metrics_api_SUITE,'-session_counters/1-fun-3-',4,
[{file,"/home/circleci/project/big_tests/tests/metrics_api_SUITE.erl"},
{line,231}]},
{escalus_story,story,4,
[{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
{line,72}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1782}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1291}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1223}]}]}} mongooseimctl_SUITE:sessions:sessions_info{error,
{test_case_failed,
#{user_info =>
["alice@domain.example.com/res1\tc2s\t127.0.0.1\t42140\t0\tmongooseim@localhost\t0",
"kate@domain.example.com/res1\tc2s\t127.0.0.1\t42154\t0\tmongooseim@localhost\t0",
"neustradamus_log_one_scram_sha1_plus_1620@domain.example.com/res1\tc2s_tls\t127.0.0.1\t47280\t0\tmongooseim@localhost\t169",
"bob@domain.example.com/res1\tc2s\t127.0.0.1\t42148\t0\tmongooseim@localhost\t0",
"neustradamus_log_one_scram_sha512_plus_1619@domain.example.com/res1\tc2s_tls\t127.0.0.1\t47262\t0\tmongooseim@localhost\t169",
"neustradamus_log_one_scram_sha256_plus_1617@domain.example.com/res1\tc2s_tls\t127.0.0.1\t47272\t0\tmongooseim@localhost\t169",
"neustradamus_log_one_scram_sha384_plus_1621@domain.example.com/res1\tc2s_tls\t127.0.0.1\t47308\t0\tmongooseim@localhost\t169",
"neustradamus_log_one_scram_sha224_plus_1618@domain.example.com/res1\tc2s_tls\t127.0.0.1\t47256\t0\tmongooseim@localhost\t169"],
users =>
[{client,<<"alice@domain.example.com/res1">>,escalus_tcp,
<0.31303.1>,
[{event_manager,<0.31302.1>},
{server,<<"domain.example.com">>},
{username,<<"alicE">>},
{resource,<<"res1">>}],
[{event_client,
[{event_manager,<0.31302.1>},
{server,<<"domain.example.com">>},
{username,<<"alicE">>},
{resource,<<"res1">>}]},
{resource,<<"res1">>},
... mongooseimctl_SUITE:stats:stats_host{error,{{badmatch,{"7\n",0}},
[{mongooseimctl_SUITE,'-stats_host/1-fun-0-',3,
[{file,"/home/circleci/project/big_tests/tests/mongooseimctl_SUITE.erl"},
{line,1139}]},
{escalus_story,story,4,
[{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
{line,72}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1782}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1291}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1223}]}]}} dynamic_domains_mssql_mnesia_25 / odbc_mssql_mnesia / 6b3daf7 dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / 6b3daf7 pgsql_mnesia_24 / pgsql_mnesia / 6b3daf7 internal_mnesia_25 / internal_mnesia / 6b3daf7 pgsql_mnesia_25 / pgsql_mnesia / 6b3daf7 mysql_redis_25 / mysql_redis / 6b3daf7 mssql_mnesia_25 / odbc_mssql_mnesia / 6b3daf7 |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #4086 +/- ##
==========================================
- Coverage 83.90% 83.90% -0.01%
==========================================
Files 527 527
Lines 33158 33164 +6
==========================================
+ Hits 27822 27826 +4
- Misses 5336 5338 +2
☔ View full report in Codecov by Sentry. |
small_tests_24 / small_tests / e845a71 elasticsearch_and_cassandra_25 / elasticsearch_and_cassandra_mnesia / e845a71 small_tests_25 / small_tests / e845a71 small_tests_25_arm64 / small_tests / e845a71 dynamic_domains_pgsql_mnesia_24 / pgsql_mnesia / e845a71 ldap_mnesia_24 / ldap_mnesia / e845a71 ldap_mnesia_25 / ldap_mnesia / e845a71 dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / e845a71 dynamic_domains_mysql_redis_25 / mysql_redis / e845a71 dynamic_domains_mssql_mnesia_25 / odbc_mssql_mnesia / e845a71 pgsql_mnesia_24 / pgsql_mnesia / e845a71 mysql_redis_25 / mysql_redis / e845a71 pgsql_mnesia_25 / pgsql_mnesia / e845a71 internal_mnesia_25 / internal_mnesia / e845a71 mssql_mnesia_25 / odbc_mssql_mnesia / e845a71 |
small_tests_24 / small_tests / 7ad6c5d elasticsearch_and_cassandra_25 / elasticsearch_and_cassandra_mnesia / 7ad6c5d small_tests_25 / small_tests / 7ad6c5d small_tests_25_arm64 / small_tests / 7ad6c5d ldap_mnesia_24 / ldap_mnesia / 7ad6c5d dynamic_domains_pgsql_mnesia_24 / pgsql_mnesia / 7ad6c5d ldap_mnesia_25 / ldap_mnesia / 7ad6c5d dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / 7ad6c5d dynamic_domains_mysql_redis_25 / mysql_redis / 7ad6c5d internal_mnesia_25 / internal_mnesia / 7ad6c5d dynamic_domains_mssql_mnesia_25 / odbc_mssql_mnesia / 7ad6c5d pgsql_mnesia_24 / pgsql_mnesia / 7ad6c5d mam_SUITE:rdbms_async_cache_muc_all:muc_configurable_archiveid:muc_no_elements{failed,
{mam_SUITE,end_per_testcase,
{'EXIT',
{{room_archive_size,0,[{times,200,1}],ok},
[{mongoose_helper,do_wait_until,2,
[{file,
"/home/circleci/project/big_tests/tests/mongoose_helper.erl"},
{line,357}]},
{mam_helper,wait_for_room_archive_size,3,
[{file,
"/home/circleci/project/big_tests/tests/mam_helper.erl"},
{line,789}]},
{mam_helper,clean_room_archive,1,
[{file,
"/home/circleci/project/big_tests/tests/mam_helper.erl"},
{line,733}]},
{mam_helper,destroy_room,1,
[{file,
"/home/circleci/project/big_tests/tests/mam_helper.erl"},
{line,726}]},
{mam_SUITE,end_per_testcase,2,
[{file,
"/home/circleci/project/big_tests/tests/mam_SUITE.erl"},
{line,962}]},
{test_server,do_end_per_testcase,4,
[{file,"test_server.erl"},{line,1627}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1335}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1224}]}]}}}} mam_SUITE:rdbms_async_cache_muc_all:muc_configurable_archiveid:muc_only_stanzaid{failed,
{mam_SUITE,end_per_testcase,
{'EXIT',
{{room_archive_size,0,[{times,200,1}],ok},
[{mongoose_helper,do_wait_until,2,
[{file,
"/home/circleci/project/big_tests/tests/mongoose_helper.erl"},
{line,357}]},
{mam_helper,wait_for_room_archive_size,3,
[{file,
"/home/circleci/project/big_tests/tests/mam_helper.erl"},
{line,789}]},
{mam_helper,clean_room_archive,1,
[{file,
"/home/circleci/project/big_tests/tests/mam_helper.erl"},
{line,733}]},
{mam_helper,destroy_room,1,
[{file,
"/home/circleci/project/big_tests/tests/mam_helper.erl"},
{line,726}]},
{mam_SUITE,end_per_testcase,2,
[{file,
"/home/circleci/project/big_tests/tests/mam_SUITE.erl"},
{line,965}]},
{test_server,do_end_per_testcase,4,
[{file,"test_server.erl"},{line,1627}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1335}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1224}]}]}}}} pgsql_mnesia_25 / pgsql_mnesia / 7ad6c5d mysql_redis_25 / mysql_redis / 7ad6c5d mssql_mnesia_25 / odbc_mssql_mnesia / 7ad6c5d |
small_tests_24 / small_tests / 46915d4 elasticsearch_and_cassandra_25 / elasticsearch_and_cassandra_mnesia / 46915d4 small_tests_25 / small_tests / 46915d4 small_tests_25_arm64 / small_tests / 46915d4 ldap_mnesia_24 / ldap_mnesia / 46915d4 dynamic_domains_pgsql_mnesia_24 / pgsql_mnesia / 46915d4 service_mongoose_system_metrics_SUITE:module_opts:rdbms_module_opts_are_reported{error,
{{assertEqual,
[{module,service_mongoose_system_metrics_SUITE},
{line,370},
{expression,
"is_module_opt_reported ( atom_to_binary ( Module ) , Key , Value )"},
{expected,true},
{value,false}]},
[{service_mongoose_system_metrics_SUITE,check_module_opt,3,
[{file,
"/home/circleci/project/big_tests/tests/service_mongoose_system_metrics_SUITE.erl"},
{line,370}]},
{service_mongoose_system_metrics_SUITE,rdbms_module_opts_are_reported,
1,
[{file,
"/home/circleci/project/big_tests/tests/service_mongoose_system_metrics_SUITE.erl"},
{line,236}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1783}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1292}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1224}]}]}} dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / 46915d4 dynamic_domains_mysql_redis_25 / mysql_redis / 46915d4 pgsql_mnesia_24 / pgsql_mnesia / 46915d4 ldap_mnesia_25 / ldap_mnesia / 46915d4 internal_mnesia_25 / internal_mnesia / 46915d4 dynamic_domains_mssql_mnesia_25 / odbc_mssql_mnesia / 46915d4 mysql_redis_25 / mysql_redis / 46915d4 pgsql_mnesia_25 / pgsql_mnesia / 46915d4 mssql_mnesia_25 / odbc_mssql_mnesia / 46915d4 dynamic_domains_pgsql_mnesia_24 / pgsql_mnesia / 46915d4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. That's like the best PR description github has ever seen. Very good debug, nice conclusions, cool fix. Thanks! 😃
This PR addresses "MIM-1990".
Fixes flaky mod_global_distrib_SUITE:test_host_refreshing testcase
There was an error happening from time to time (once every 10 CI jobs). Report
The error does not contain a lot of information, but we know that the previous test group was advertised_endpoints, where we try to do something with endpoints.
The
trees_for_connections_present
checks thatmod_global_distrib_server_mgr
creates a supervisor for each of global-distribution domains.I.e. logic is:
While ensure_server_started looks like:
The initial testcase had very little info. I.e. it only checked that one of the Erlang nodes had issues. Let's improve logging:
europe_node2
ismim2
node.So, we know that there is not enough outgoing connections on
mim2
.But the only host we need to create connections is
reg1
. (i.e. refresh function only callsmod_global_distrib_outgoing_conns_sup:ensure_server_started(<<"reg1">>)
).What is
is_available(Server)
call? it is agen_server:call
:So, the server process is busy.
It is actually could be blocked up to 16 seconds and it is doing inet:getaddrs:
First of all, we see that it is sitting in
refresh_in_init
, which means it is ininit/1
callback of a gen_server.Oh, it is also crashing after 16 seconds in init function:
So, this means it blocks the supervisor for 16 seconds, which is pretty bad (i.e. we cannot for example, init connections in parallel, because supervisor is busy). So, improvement - we could do delayed initialisation.
But what about our
nxdomain
?Apparently, a single init:getaddrs call could take 2 seconds or more:
Also, non-existing domain is set in the previous testcase
advertised_endpoints
.(investigating why a testcase is affected by a testcase, executed before is outside of the bug investigation. Eventually, GD would retry creating an outgoing connection with the proper endpoints. I would assume, that the mod_global_distrib is not fully restarted between test groups). I.e.:
So, we could run
getaddrs
in parallel using pmap and limit maximum execution time to 5 seconds.It fixes the testcase (verified with
{repeat_until_any_fail, 60}
for the compile PR which was able to consistently reproduce the bug without the fix).Proposed changes include:
src/global_distrib/mod_global_distrib_server_mgr.erl
getaddrs
call in parallel and limit how long it can take to 5 seconds.Stuff we've learned:
gen_server:init/1
should be fast and not fail, ideally.Though, I am not trying to redesign global distribution in this PR, just fix a flaky test :)
Debug code for this fix could be found in #4084 (just for the reference, if someone wants to debug something similar again).