-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug
When deleting a stream, the stream coordinator may delete the underlying stream but the deletion from rabbit_db_queue could fail since they are independent Raft clusters. If this happens and the stream coordinator fully deletes the stream, the stream queue then becomes sort of 'stuck' as subsequent calls to rabbit_stream_queue:delete/4 time out: when the stream coordinator doesn't know about a stream in a {delete_stream, StreamId, #{}} command it does not reply to the caller.
Reproduction steps
This is probably really hard to reproduce in practice. In a shell though we can 'fake' this state pretty easily:
make run-brokerstream-perf-test --time 1- create the stream[SQ] = rabbit_db_queue:get_all().rabbit_stream_coordinator:process_command({delete_stream, maps:get(name, amqqueue:get_type_state(SQ)), #{}}).- simulate partial failure by only deleting from the stream coordinator and not the metadata store.
After this, the stream can't be deleted via AMQP, the stream protocol, UI, etc..
Expected behavior
I think it's reasonable for the stream coordinator to reply ok when prompted with a delete stream for an unknown stream ID. So the command would be idempotent.
Additional context
The stream coordinator replies after it performs the deletion. If there is no stream then we hit the clause here which results in no reply. So calls to delete a stream which do not exist (according to the coordinator) will time out. We could add a clause for the delete_stream command against an undefined stream which would result in an ok reply. It looks like this may take some refactoring though. Then rabbit_stream_coordinator:delete/2 would continue to delete from the metadata store.