Skip to content

Commit

Permalink
osiris_replica_reader: Stop with normal if the leader is already go…
Browse files Browse the repository at this point in the history
…ne during `init/1`

[Why]
In the context of RabbitMQ, if a stream queue is deleted right after
being declared, there is a chance that some Osiris processes might not
be ready yet at the time the queue is deleted.

In particular, the `osiris_replica_reader` process monitors the given
leader (an `osiris_writer` process in the context of a RabbitMQ stream
queue) during its `init/1` and that process might be stopped already.

When this happens, here is the crash that is logged:

    [error] <0.1548.0> ** Generic server <0.1548.0> terminating
    [error] <0.1548.0> ** Last message in was {'DOWN',#Ref<0.1118981177.1281884162.97904>,process,
    [error] <0.1548.0>                                <0.1535.0>,noproc}
    [error] <0.1548.0> ** When Server state == {state,
    [error] <0.1548.0>                          {osiris_log,
    [error] <0.1548.0>                           {cfg,
    [error] <0.1548.0>                            ".../__delete_queue_1716383944197847531",
    [error] <0.1548.0>                            <<"__delete_queue_1716383944197847531">>,500000000,
    [error] <0.1548.0>                            256000,#{},[],
    [error] <0.1548.0>                            {write_concurrency,
    [error] <0.1548.0>                             #Ref<0.1118981177.1282015234.97903>},
    [error] <0.1548.0>                            {osiris_replica_reader,
    [error] <0.1548.0>                             {resource,<<"/">>,queue,<<"delete_queue">>},
    [error] <0.1548.0>                             {127,0,0,1},
    [error] <0.1548.0>                             6489},
    [error] <0.1548.0>                            #Fun<osiris_writer.0.78287785>,
    [error] <0.1548.0>                            #Ref<0.1118981177.1282015234.97826>,16},
    [error] <0.1548.0>                           {read,data,0,tcp,all,8,undefined},
    [error] <0.1548.0>                           undefined,undefined,
    [error] <0.1548.0>                           {file_descriptor,prim_file,
    [error] <0.1548.0>                            #{handle => #Ref<0.1118981177.1282015238.91045>,
    [error] <0.1548.0>                              owner => <0.1548.0>,
    [error] <0.1548.0>                              r_buffer => #Ref<0.1118981177.1282015234.97902>,
    [error] <0.1548.0>                              r_ahead_size => 0}}},
    [error] <0.1548.0>                          <<"__delete_queue_1716383944197847531">>,tcp,
    [error] <0.1548.0>                          #Port<0.84>,<33363.1916.0>,<0.1535.0>,
    [error] <0.1548.0>                          #Ref<0.1118981177.1281884162.97904>,
    [error] <0.1548.0>                          {write_concurrency,
    [error] <0.1548.0>                           #Ref<0.1118981177.1282015234.97903>},
    [error] <0.1548.0>                          {osiris_replica_reader,
    [error] <0.1548.0>                           {resource,<<"/">>,queue,<<"delete_queue">>},
    [error] <0.1548.0>                           {127,0,0,1},
    [error] <0.1548.0>                           6489},
    [error] <0.1548.0>                          -1,0}
    [error] <0.1548.0> ** Reason for termination ==
    [error] <0.1548.0> ** noproc

That is because the `osiris_replica_reader` process receives the `DOWN`
message from the leader monitoring with the `noproc` reason. It reuses
the reason for its own exit reason. Because this is an abnormal reason,
a crash is being logged.

[How]
There is no reason to log such a crash when the process tree is being
shut down concurrently. `osiris_replica_reader` can terminate with a
`normal` reason.

That is what this patch does: if the leader exit reason is `noproc`, it
terminates with the `normal` reason instead.
  • Loading branch information
dumbbell committed Jun 24, 2024
1 parent 2994958 commit 8a4ec95
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion src/osiris_replica_reader.erl
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,14 @@ handle_info({'DOWN', Ref, _, _, Info},
[Info, 10]),
%% this should be enough to make the replica shut down
ok = close(Transport, Sock),
{stop, Info, State};
%% If the reason is `noproc`, it means the leader is already gone at the
%% time `init/1` was called. Therefore the set of processes is being shut
%% down concurrently. We can exit with the `normal` reason in this case.
Reason = case Info of
noproc -> normal;
_ -> Info
end,
{stop, Reason, State};
handle_info({tcp_closed, Socket},
#state{name = Name, socket = Socket} = State) ->
?DEBUG_(Name, "Socket closed. Exiting...", []),
Expand Down

0 comments on commit 8a4ec95

Please sign in to comment.