Async pools #3433

NelsonVides · 2021-12-06T18:23:52Z

This PR learns from the MAM async workers, and tries to separate two different concepts into two different abstractions: one, is the concept of a pool of workers that accumulate tasks, and act on the tasks every time a batch size is reached, or a timeout since the first submitted task is triggered; the second, is how MAM acts on the batches.

I attempt here to implement the first concept in a very generic way, using the new modules mongoose_batch_worker and mongoose_async_pools, as an abstraction that takes the batch size, the flush timeout, and a callback to be run on the accumulated queues. This is then used by a new mam async module, but the goal is, that it can be used as well in for example a future inbox async module, or any other extension that would benefit from batching (I know of quite a few myself!).

Note that in the case of inbox, the batches would accumulate inbox changes for a user and then actually only submit the last one to the DB, as, note, unlike mam being an append-only operation, inbox is an update-only operation, so applying all inbox changes sequentially is equivalent to applying the last one.

NOTE: please do read the commit messages, as they contain really detailed information on the code submitted on each of them.

codecov · 2021-12-06T18:32:27Z

Codecov Report

Merging #3433 (6479445) into master (38fcb5e) will increase coverage by 0.04%.
The diff coverage is 81.76%.

@@            Coverage Diff             @@
##           master    #3433      +/-   ##
==========================================
+ Coverage   80.78%   80.83%   +0.04%     
==========================================
  Files         415      415              
  Lines       32306    32266      -40     
==========================================
- Hits        26100    26083      -17     
+ Misses       6206     6183      -23

Impacted Files	Coverage Δ
src/ejabberd_sup.erl	`85.71% <ø> (-0.50%)`	⬇️
src/mam/mod_mam_muc.erl	`74.01% <ø> (ø)`
src/mongoose_batch_worker.erl	`56.81% <56.81%> (ø)`
src/mam/mod_mam_rdbms_arch_async.erl	`86.90% <86.90%> (ø)`
src/mongoose_async_pools.erl	`96.87% <96.87%> (ø)`
src/mam/mod_mam_meta.erl	`93.47% <100.00%> (+1.08%)`	⬆️
src/domain/mongoose_domain_loader.erl	`89.28% <0.00%> (-0.90%)`	⬇️
src/mod_muc_room.erl	`76.97% <0.00%> (-0.29%)`	⬇️
src/pubsub/mod_pubsub.erl	`73.13% <0.00%> (-0.06%)`	⬇️
src/pubsub/mod_pubsub_db_mnesia.erl	`92.85% <0.00%> (+0.42%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38fcb5e...6479445. Read the comment docs.

chrzaszcz

Code looks good in general, my main concern is the low test coverage, the workers are tested only for storing messages one by one. A load test of this PR for the PM and MUC scenarios with MAM enabled should be enough here.

big_tests/tests/mam_SUITE.erl

doc/migrations/5.0.0_x.md

doc/modules/mod_mam.md

src/mongoose_async_pools.erl

arcusfelis

looks pretty nice!

src/mongoose_async_pools.erl

src/mam/mod_mam_rdbms_arch_async.erl

arcusfelis · 2021-12-08T09:53:03Z

src/mam/mod_mam_rdbms_arch_async.erl

+    InsertResult =
+        case MessageCount of
+            MaxSize ->
+                mongoose_rdbms:execute(HostType, insert_mam_messages, lists:append(Rows));


this line could be covered with batch_size = 1

arcusfelis · 2021-12-08T09:54:54Z

src/mam/mod_mam_rdbms_arch_async.erl

+    case InsertResult of
+        {updated, _Count} -> ok;
+        {error, Reason} ->
+            mongoose_metrics:update(HostType, modMamDropped, MessageCount),


we would need a drop test to test that the metric is incremented. Maybe not in this PR

src/mam/mod_mam_rdbms_arch_async.erl

src/mongoose_batch_worker.erl

arcusfelis · 2021-12-08T10:10:31Z

src/mongoose_batch_worker.erl

+                 flush_queue_len => length(State#state.flush_queue)}),
+    NewState = do_run_flush(State#state{flush_queue = lists:reverse(Queue),
+                                        flush_interval_tref = undefined}),
+    erlang:garbage_collect(),


wondering if that would collect Queue (something says to me it will still keep Queue, because the caller is still referencing it).
There is another way to collect for gen_server: is to return {noreply,NewState,hibernate} after a flush, could be more efficient.

hibernate more efficient, not at all, because it will not only GC the heap, but also collapse the whole process memory to a PCB, which will then require a new heap when the process is woken up, which will be a very small heap because hibernation imposes a shrink as well, that will very likely need to be grown once the queue starts filling... no way.

But I'll make sure the queue is GC'ed here, hold on.

You were right, it wasn't collected. But got how to, see ee83696 😁

chrzaszcz

Looks ok to me, only the load test concerns remain.

src/mongoose_batch_worker.erl

src/mam/mod_mam_rdbms_arch_async.erl

arcusfelis · 2021-12-10T10:15:43Z

src/mongoose_batch_worker.erl

+
+% Don't leak the tasks to logs, can contain private information
+format_status(_Opt, [_PDict, State | _]) ->
+    [{data, [{"State", State#state{flush_queue = []}},


flush_queue = censored ;)
actually, setting it to a valid value here (i.e. []) could be pretty confusing, when people try to call for status.

could be = hidden or sanitized

censored returns a dialyzer error here, but can add a type for that in the typedef

sad :(

Actually, just actually, why not to convert State into a map in format status? :)
It is much easier to read ;)

I think a super canonical example for format_status was actually converting state into a proplist :)
(it is what I think about when see that gen-server callback).

src/mam/mod_mam_rdbms_arch_async.erl

src/mongoose_batch_worker.erl

arcusfelis

Brilliant. Merging after a load test done by Nelson :)

This is a gen_server, whose purpose is to be part of a worker pool. The server accumulates tasks to take action to, until the number of tasks reaches a configured threshold, or, a timeout is fired since the first task that was requested in this batch (this is to ensure that no queue awaits too long for action). Note that we don't trigger timeouts unless the queue is not empty, to avoid waking up an unused worker. The worker also takes a callback, and a map with any metadata the callback may need. When is time to flush a queue, it is this callback that will be ran, taking the queue in the order in which it has been submitted, and the optionally provided metadata. Most of the code for this worker is learnt from MAM's async workers, with some changes or learnt lessons: * Note that this worker sets info logs on start and terminate. * There is also a debug log on queue flushes. * The terminate callback flushes the task queue if there's any task pending, to try to avoid losing tasks inadvertently. * The format_status callback prevents printing the actual tasks on crashes or system state lookups, as the tasks might contain data private to clients under any GDPR-like legal constrains. * Garbage collection is triggered after every flush, first to ensure saving memory, but second, to ensure the release of any binaries this worker might be holding.

This module implements a worker pool abstraction to handle batch workers. It will start a transient supervisor under ejabberd_sup, that will spawn a pool using the `worker_pool` library. This will then supervise the three supervisors that `worker_pool` supervises (time-manager, queue-manager, and worker-sup). One of these, 'worker-sup', will in turn spawn all the mongoose_batch_worker workers with the correct configuration.

First of all, remove all the old modules implementing the mam rdbms async pools. Then, create a new module called mod_mam_rdbms_arch_async, that abstract what the previous two where doing, by using the modules created in the two previous commits. This module starts the pools with all the correct parameters, and registers the hooks that will then call `worker_pool` using the `hash_worker` strategy, to ensure MAM entries to a single archive are all processed in parallel. Note a few lessons here: * Selecting the worker isn't sound based on the `rem` operator, because the distribution of archives over workers is not at all ensured to be uniform. `worker_pool` uses `erlang:phash2/2`, which ensures such uniform distribution. * The callbacks for the batch workers abstract all the metrics. * mod_mam_meta parsing has breaking changes: this is what was reported as broken in the previous implementation. Now there's an entire toml section called `async_writer`, that can be enabled, disabled and configured both globally for mam, or in a fine-grained manner within the `pm` and `muc` sections.

The purpose of this is for the supervisor to be able to dynamically generate data that will be passed to the workers, an example being, the supervisor might create ets tables. We keep this in the context of the supervisor and its workers, to keep the abstraction isolated to this supervision tree: an ets table may not make any sense outside of this supervision tree, we might want the ets table not to even be named, so that nobody outside of the tree will have access to it, only the workers will get an opaque reference. We will also want the ets tables to be cleaned when the supervision tree dies, as the data might not make sense on a newly restarted tree anymore, or we will want the tree to die fast if the ets table is lost. This init callback can also set global configuration flags, or send messages, or log something, or define new metrics, or many other things that for example the current MongooseIM init callbacks for `mongoose_wpool` are already doing.

The whole operation to build lists and append them and generate an atom in the end is quite expensive: not only it generates a lot of lists and copies of lists that later need to be garbage collected, but also, the `list_to_atom` operation is a concurrent bottleneck: the atom table is not optimised for inserts, only for reads, this one needs to grab locks and check if the atom already existed and so on. Instead, store the name on a persistent_term record, and simply fetch that when needed, making the hotter part of the code, finding the pool to submit a task to, faster.

Note that running garbage collection within a function will not collect the references this function has, so the queue will actually not be cleared after being flushed as it was expected. Instead, we use the option `async` for the GC. This will make this process end its reductions, and once it has been preempted, it will be scheduled for garbage collection, and thereafter, a message will be delivered notifying him of so.

mongoose-im · 2021-12-17T13:53:30Z

small_tests_24 / small_tests / 6479445
Reports root / small

small_tests_23 / small_tests / 6479445
Reports root / small

dynamic_domains_pgsql_mnesia_23 / pgsql_mnesia / 6479445
Reports root/ big
OK: 2725 / Failed: 0 / User-skipped: 186 / Auto-skipped: 0

dynamic_domains_pgsql_mnesia_24 / pgsql_mnesia / 6479445
Reports root/ big
OK: 2725 / Failed: 0 / User-skipped: 186 / Auto-skipped: 0

dynamic_domains_mysql_redis_24 / mysql_redis / 6479445
Reports root/ big
OK: 2708 / Failed: 0 / User-skipped: 203 / Auto-skipped: 0

dynamic_domains_mssql_mnesia_24 / odbc_mssql_mnesia / 6479445
Reports root/ big
OK: 2724 / Failed: 1 / User-skipped: 186 / Auto-skipped: 0

mam_SUITE:rdbms_async_cache_mam_all:mam06:retract_message

{error,{{archive_size,1,[{times,200,0}]},
    [{mongoose_helper,do_wait_until,2,
              [{file,"/home/circleci/project/big_tests/tests/mongoose_helper.erl"},
               {line,377}]},
     {mam_SUITE,'-retract_message/1-fun-0-',4,
          [{file,"/home/circleci/project/big_tests/tests/mam_SUITE.erl"},
           {line,1927}]},
     {escalus_story,story,4,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,72}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1783}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1292}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1224}]}]}}

Report log

ldap_mnesia_24 / ldap_mnesia / 6479445
Reports root/ big
OK: 1500 / Failed: 0 / User-skipped: 386 / Auto-skipped: 0

ldap_mnesia_23 / ldap_mnesia / 6479445
Reports root/ big
OK: 1500 / Failed: 0 / User-skipped: 386 / Auto-skipped: 0

internal_mnesia_24 / internal_mnesia / 6479445
Reports root/ big
OK: 1585 / Failed: 0 / User-skipped: 301 / Auto-skipped: 0

elasticsearch_and_cassandra_24 / elasticsearch_and_cassandra_mnesia / 6479445
Reports root/ big
OK: 1878 / Failed: 0 / User-skipped: 313 / Auto-skipped: 0

pgsql_mnesia_24 / pgsql_mnesia / 6479445
Reports root/ big
OK: 3112 / Failed: 0 / User-skipped: 195 / Auto-skipped: 0

pgsql_mnesia_23 / pgsql_mnesia / 6479445
Reports root/ big
OK: 3112 / Failed: 0 / User-skipped: 195 / Auto-skipped: 0

mysql_redis_24 / mysql_redis / 6479445
Reports root/ big
OK: 3107 / Failed: 0 / User-skipped: 200 / Auto-skipped: 0

riak_mnesia_24 / riak_mnesia / 6479445
Reports root/ big
OK: 1724 / Failed: 0 / User-skipped: 314 / Auto-skipped: 0

dynamic_domains_mssql_mnesia_24 / odbc_mssql_mnesia / 6479445
Reports root/ big
OK: 2722 / Failed: 3 / User-skipped: 186 / Auto-skipped: 0

service_domain_db_SUITE:db:db_keeps_syncing_after_cluster_join

{error,{test_case_failed,{[<<"example1.com">>],
              [<<"example1.com">>,<<"example2.com">>]}}}

Report log

service_domain_db_SUITE:db:rest_with_auth:rest_delete_domain_cleans_data_from_mam

{error,
  {timeout_when_waiting_for_stanza,
    [{escalus_client,wait_for_stanza,
       [{client,
          <<"bob_rest_delete_domain_cleans_data_from_mam_1935@example.org/res1">>,
          escalus_tcp,<0.29327.1>,
          [{event_manager,<0.29321.1>},
           {server,<<"example.org">>},
           {username,
             <<"bob_rest_delete_domain_cleans_data_from_mam_1935">>},
           {resource,<<"res1">>}],
          [{event_client,
             [{event_manager,<0.29321.1>},
            {server,<<"example.org">>},
            {username,
              <<"bob_rest_delete_domain_cleans_data_from_mam_1935">>},
            {resource,<<"res1">>}]},
           {resource,<<"res1">>},
           {username,
             <<"bob_rest_delete_domain_cleans_data_from_mam_1935">>},
           {server,<<"example.org">>},
           {host,<<"localhost">>},
           {port,5232},
           {auth,{escalus_auth,auth_plain}},
           {wspath,undefined},
           {username,
             <<"bob_rest_delete_domain_cleans_data_from_mam_1935">>},
           {server,<<"example.org">>},
           {host,<<"localhost">>},
           {password,<<"makota3">>},
           {port,5232},
           {stream_id,<<"7a54345859a684a1">>}]},
        5000],
       [{file,
          "/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
        {line,136}]},
     {service_domain_db_SUITE,
       '-rest_delete_domain_cleans_data_from_mam/1-fun-0-',5,
       [{file,
    ...

Report log

service_domain_db_SUITE:db:rest_without_auth:rest_delete_domain_cleans_data_from_mam

{error,
  {timeout_when_waiting_for_stanza,
    [{escalus_client,wait_for_stanza,
       [{client,
          <<"bob_rest_delete_domain_cleans_data_from_mam_1936@example.org/res1">>,
          escalus_tcp,<0.29923.1>,
          [{event_manager,<0.29917.1>},
           {server,<<"example.org">>},
           {username,
             <<"bob_rest_delete_domain_cleans_data_from_mam_1936">>},
           {resource,<<"res1">>}],
          [{event_client,
             [{event_manager,<0.29917.1>},
            {server,<<"example.org">>},
            {username,
              <<"bob_rest_delete_domain_cleans_data_from_mam_1936">>},
            {resource,<<"res1">>}]},
           {resource,<<"res1">>},
           {username,
             <<"bob_rest_delete_domain_cleans_data_from_mam_1936">>},
           {server,<<"example.org">>},
           {host,<<"localhost">>},
           {port,5232},
           {auth,{escalus_auth,auth_plain}},
           {wspath,undefined},
           {username,
             <<"bob_rest_delete_domain_cleans_data_from_mam_1936">>},
           {server,<<"example.org">>},
           {host,<<"localhost">>},
           {password,<<"makota3">>},
           {port,5232},
           {stream_id,<<"bc16d79a34247558">>}]},
        5000],
       [{file,
          "/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
        {line,136}]},
     {service_domain_db_SUITE,
       '-rest_delete_domain_cleans_data_from_mam/1-fun-0-',5,
       [{file,
    ...

Report log

dynamic_domains_mssql_mnesia_24 / odbc_mssql_mnesia / 6479445
Reports root/ big
OK: 2739 / Failed: 3 / User-skipped: 186 / Auto-skipped: 0

mam_SUITE:rdbms_async_cache_mam_all:mam06:retract_wrong_message

{error,{{archive_size,1,[{times,200,0}]},
    [{mongoose_helper,do_wait_until,2,
              [{file,"/home/circleci/project/big_tests/tests/mongoose_helper.erl"},
               {line,377}]},
     {mam_SUITE,'-retract_message/1-fun-0-',4,
          [{file,"/home/circleci/project/big_tests/tests/mam_SUITE.erl"},
           {line,1927}]},
     {escalus_story,story,4,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,72}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1783}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1292}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1224}]}]}}

Report log

mam_SUITE:rdbms_async_cache_mam_all:mam06:querying_for_all_messages_with_jid

{error,{test_case_failed,"Respond size is 7, 17 is expected."}}

Report log

vcard_SUITE:ro_full:retrieve_own_card

{error,{test_case_failed,"Expected <<\"alice\">> got undefined\n"}}

Report log

dynamic_domains_mssql_mnesia_24 / odbc_mssql_mnesia / 6479445
Reports root/ big
OK: 2725 / Failed: 0 / User-skipped: 186 / Auto-skipped: 0

NelsonVides added the WIP 🚧 label Dec 6, 2021