osiris_replica_reader: Stop with `normal` if the leader is already go… · rabbitmq/osiris@8a4ec95

Commit

osiris_replica_reader: Stop with normal if the leader is already go…

…ne during `init/1`

[Why]
In the context of RabbitMQ, if a stream queue is deleted right after
being declared, there is a chance that some Osiris processes might not
be ready yet at the time the queue is deleted.

In particular, the `osiris_replica_reader` process monitors the given
leader (an `osiris_writer` process in the context of a RabbitMQ stream
queue) during its `init/1` and that process might be stopped already.

When this happens, here is the crash that is logged:

    [error] <0.1548.0> ** Generic server <0.1548.0> terminating
    [error] <0.1548.0> ** Last message in was {'DOWN',#Ref<0.1118981177.1281884162.97904>,process,
    [error] <0.1548.0>                                <0.1535.0>,noproc}
    [error] <0.1548.0> ** When Server state == {state,
    [error] <0.1548.0>                          {osiris_log,
    [error] <0.1548.0>                           {cfg,
    [error] <0.1548.0>                            ".../__delete_queue_1716383944197847531",
    [error] <0.1548.0>                            <<"__delete_queue_1716383944197847531">>,500000000,
    [error] <0.1548.0>                            256000,#{},[],
    [error] <0.1548.0>                            {write_concurrency,
    [error] <0.1548.0>                             #Ref<0.1118981177.1282015234.97903>},
    [error] <0.1548.0>                            {osiris_replica_reader,
    [error] <0.1548.0>                             {resource,<<"/">>,queue,<<"delete_queue">>},
    [error] <0.1548.0>                             {127,0,0,1},
    [error] <0.1548.0>                             6489},
    [error] <0.1548.0>                            #Fun<osiris_writer.0.78287785>,
    [error] <0.1548.0>                            #Ref<0.1118981177.1282015234.97826>,16},
    [error] <0.1548.0>                           {read,data,0,tcp,all,8,undefined},
    [error] <0.1548.0>                           undefined,undefined,
    [error] <0.1548.0>                           {file_descriptor,prim_file,
    [error] <0.1548.0>                            #{handle => #Ref<0.1118981177.1282015238.91045>,
    [error] <0.1548.0>                              owner => <0.1548.0>,
    [error] <0.1548.0>                              r_buffer => #Ref<0.1118981177.1282015234.97902>,
    [error] <0.1548.0>                              r_ahead_size => 0}}},
    [error] <0.1548.0>                          <<"__delete_queue_1716383944197847531">>,tcp,
    [error] <0.1548.0>                          #Port<0.84>,<33363.1916.0>,<0.1535.0>,
    [error] <0.1548.0>                          #Ref<0.1118981177.1281884162.97904>,
    [error] <0.1548.0>                          {write_concurrency,
    [error] <0.1548.0>                           #Ref<0.1118981177.1282015234.97903>},
    [error] <0.1548.0>                          {osiris_replica_reader,
    [error] <0.1548.0>                           {resource,<<"/">>,queue,<<"delete_queue">>},
    [error] <0.1548.0>                           {127,0,0,1},
    [error] <0.1548.0>                           6489},
    [error] <0.1548.0>                          -1,0}
    [error] <0.1548.0> ** Reason for termination ==
    [error] <0.1548.0> ** noproc

That is because the `osiris_replica_reader` process receives the `DOWN`
message from the leader monitoring with the `noproc` reason. It reuses
the reason for its own exit reason. Because this is an abnormal reason,
a crash is being logged.

[How]
There is no reason to log such a crash when the process tree is being
shut down concurrently. `osiris_replica_reader` can terminate with a
`normal` reason.

That is what this patch does: if the leader exit reason is `noproc`, it
terminates with the `normal` reason instead.

Loading branch information

dumbbell committed Jun 24, 2024

1 parent 2994958 commit 8a4ec95

src/osiris_replica_reader.erl

-Original file line number
+Diff line change
@@ Expand Up / @@ -250,7 +250,14 @@ handle_info({'DOWN', Ref, _, _, Info}, @@
                [Info, 10]),
         %% this should be enough to make the replica shut down
         ok = close(Transport, Sock),
-        {stop, Info, State};
+        %% If the reason is `noproc`, it means the leader is already gone at the
+        %% time `init/1` was called. Therefore the set of processes is being shut
+        %% down concurrently. We can exit with the `normal` reason in this case.
+        Reason = case Info of
+                     noproc -> normal;
+                     _ -> Info
+                 end,
+        {stop, Reason, State};
     handle_info({tcp_closed, Socket},
                 #state{name = Name, socket = Socket} = State) ->
         ?DEBUG_(Name, "Socket closed. Exiting...", []),
@@ Expand Down @@

0 comments on commit `8a4ec95`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `8a4ec95`

Commit

There are no files selected for viewing

0 comments on commit 8a4ec95

0 comments on commit `8a4ec95`