Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (Consumed from an unexpected offset) in PartitionMoveInterruption.test_cancelling_partition_move #17847

Closed
vbotbuildovich opened this issue Apr 12, 2024 · 23 comments · Fixed by #18021 or #18153
Assignees
Labels
area/replication auto-triaged used to know which issues have been opened from a CI job ci-failure ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue

Comments

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 12, 2024

https://buildkite.com/redpanda/redpanda/builds/47713

Module: rptest.tests.partition_move_interruption_test
Class: PartitionMoveInterruption
Method: test_cancelling_partition_move
Arguments: {
    "recovery": "restart_recovery",
    "compacted": false,
    "unclean_abort": true,
    "replication_factor": 3
}
test_id:    PartitionMoveInterruption.test_cancelling_partition_move
status:     FAIL
run time:   141.675 seconds

Exception('VerifiableConsumer-0-139821824201184-worker-1: Traceback (most recent call last):\n  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/services/background_thread.py", line 38, in _protected_worker\n    self._worker(idx, node)\n  File "/root/tests/rptest/services/verifiable_consumer.py", line 356, in _worker\n    raise e\n  File "/root/tests/rptest/services/verifiable_consumer.py", line 338, in _worker\n    handler.handle_records_consumed(event, self.logger)\n  File "/root/tests/rptest/services/verifiable_consumer.py", line 101, in handle_records_consumed\n    raise AssertionError(msg)\nAssertionError: Consumed from an unexpected offset (1455, 0) for partition TopicPartition(topic=\'topic-zrjtbbdhfp\', partition=0)\n')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 104, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 199, in test_cancelling_partition_move
    self.consumer.stop()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/services/background_thread.py", line 86, in stop
    self._propagate_exceptions()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/services/background_thread.py", line 100, in _propagate_exceptions
    raise Exception(self.errors)
Exception: VerifiableConsumer-0-139821824201184-worker-1: Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/services/background_thread.py", line 38, in _protected_worker
    self._worker(idx, node)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 356, in _worker
    raise e
  File "/root/tests/rptest/services/verifiable_consumer.py", line 338, in _worker
    handler.handle_records_consumed(event, self.logger)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 101, in handle_records_consumed
    raise AssertionError(msg)
AssertionError: Consumed from an unexpected offset (1455, 0) for partition TopicPartition(topic='topic-zrjtbbdhfp', partition=0)

JIRA Link: CORE-2353

@vbotbuildovich vbotbuildovich added auto-triaged used to know which issues have been opened from a CI job ci-failure labels Apr 12, 2024
@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@ztlpn ztlpn self-assigned this Apr 22, 2024
ztlpn added a commit to ztlpn/redpanda that referenced this issue Apr 23, 2024
Previously, when force-aborting a reconfiguration, we appended an
aborting configuration on all replicas. This can lead to log inconsistencies
as on followers the configuration will be duplicated (one from own append,
one replicated by the leader). Although these inconsistencies are
expected for force-abort, if the leader is alive, we can minimize the chance
of their appearance by waiting on followers for the aborting config to be
replicated from the leader.

Fixes redpanda-data#17847
@ztlpn
Copy link
Contributor

ztlpn commented Apr 23, 2024

This was indirectly caused by #17789 that fixed a bug in offset translation of log end offset (and as a result fetch offset validation became stricter). In case of force-abort there is a log discrepancy between leaders and followers that (after a leadership change) leads to offset-out-of-range error and fetch offset reset (previously this wasn't the case because fetch offset validation was incorrect). Although this discrepancy is kind of expected for force-abort, we can minimize the chance of it, see the attached pr.

@vbotbuildovich
Copy link
Collaborator Author

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Apr 24, 2024
Previously, when force-aborting a reconfiguration, we appended an
aborting configuration on all replicas. This can lead to log inconsistencies
as on followers the configuration will be duplicated (one from own append,
one replicated by the leader). Although these inconsistencies are
expected for force-abort, if the leader is alive, we can minimize the chance
of their appearance by waiting on followers for the aborting config to be
replicated from the leader.

Fixes redpanda-data#17847

(cherry picked from commit 8e221d3)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Apr 24, 2024
Previously, when force-aborting a reconfiguration, we appended an
aborting configuration on all replicas. This can lead to log inconsistencies
as on followers the configuration will be duplicated (one from own append,
one replicated by the leader). Although these inconsistencies are
expected for force-abort, if the leader is alive, we can minimize the chance
of their appearance by waiting on followers for the aborting config to be
replicated from the leader.

Fixes redpanda-data#17847

(cherry picked from commit 8e221d3)
@ztlpn ztlpn changed the title CI Failure (key symptom) in PartitionMoveInterruption.test_cancelling_partition_move CI Failure (Consumed from an unexpected offset) in PartitionMoveInterruption.test_cancelling_partition_move Apr 24, 2024
@ztlpn ztlpn added the ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue label Apr 24, 2024
ztlpn added a commit to ztlpn/redpanda that referenced this issue Apr 24, 2024
Previously, when force-aborting a reconfiguration, we appended an
aborting configuration on all replicas. This can lead to log inconsistencies
as on followers the configuration will be duplicated (one from own append,
one replicated by the leader). Although these inconsistencies are
expected for force-abort, if the leader is alive, we can minimize the chance
of their appearance by waiting on followers for the aborting config to be
replicated from the leader.

Fixes redpanda-data#17847

(cherry picked from commit 8e221d3)
@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/replication auto-triaged used to know which issues have been opened from a CI job ci-failure ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue
Projects
None yet
4 participants