Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (TimeoutError: Node docker-rp-xx draining leaderships) in BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth #10136

Closed
dotnwat opened this issue Apr 17, 2023 · 31 comments
Labels
area/kafka area/tests ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@dotnwat
Copy link
Member

dotnwat commented Apr 17, 2023

https://buildkite.com/redpanda/redpanda/builds/27277#01878dc4-9515-414e-88a6-89d717325c82

Module: rptest.tests.pandaproxy_test
Class:  BasicAuthUpgradeTest
Method: test_upgrade_and_enable_basic_auth
Arguments:
{
  "base_release": [
    22,
    3
  ],
  "next_release": [
    23,
    1
  ]
}
====================================================================================================
test_id:    rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.3.next_release=.23.1
status:     FAIL
run time:   54.194 seconds


    TimeoutError('Node docker-rp-23 draining leaderships')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/pandaproxy_test.py", line 2127, in test_upgrade_and_enable_basic_auth
    self.redpanda.rolling_restart_nodes(self.redpanda.nodes)
  File "/root/tests/rptest/services/redpanda.py", line 2289, in rolling_restart_nodes
    restarter.restart_nodes(nodes,
  File "/root/tests/rptest/services/rolling_restarter.py", line 88, in restart_nodes
    wait_until(lambda: has_drained_leaders(node),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Node docker-rp-23 draining leaderships

JIRA Link: CORE-1276

@dotnwat dotnwat added kind/bug Something isn't working ci-failure labels Apr 17, 2023
@dotnwat
Copy link
Member Author

dotnwat commented Apr 17, 2023

Probably related #9049

@dlex
Copy link
Contributor

dlex commented May 15, 2023

@dlex
Copy link
Contributor

dlex commented May 20, 2023

A very much similar failure in https://buildkite.com/redpanda/redpanda/builds/29432#01883193-77a7-4f46-bac1-be8fac94ff52

test_id:    rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=False
status:     FAIL
run time:   1 minute 31.113 seconds


    TimeoutError('Node docker-rp-23 draining leaderships')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cluster_bootstrap_test.py", line 137, in test_change_bootstrap_configs_during_upgrade
    self.redpanda.rolling_restart_nodes(self.redpanda.nodes,
  File "/root/tests/rptest/services/redpanda.py", line 953, in rolling_restart_nodes
    restarter.restart_nodes(nodes,
  File "/root/tests/rptest/services/rolling_restarter.py", line 88, in restart_nodes
    wait_until(lambda: has_drained_leaders(node),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Node docker-rp-23 draining leaderships

Log excerpts from docker-rp-23 node_id 1.

DEBUG - 2023-05-19 02:04:33,890 - admin - _request - lineno:332]: Dispatching put http://docker-rp-23:9644/v1/brokers/1/maintenance
DEBUG 2023-05-19 02:04:33,893 [shard 0] admin_api_server - admin_server.cc:365 - [admin] PUT http://docker-rp-23:9644/v1/brokers/1/maintenance 
INFO  2023-05-19 02:04:33,894 [shard 0] cluster - members_table.cc:221 - marking node 1 in maintenance state 
INFO  2023-05-19 02:04:33,894 [shard 1] cluster - members_table.cc:221 - marking node 1 in maintenance state    
...
INFO  2023-05-19 02:04:33,894 [shard 0] cluster - drain_manager.cc:49 - Node draining is starting                                                                                                                              
INFO  2023-05-19 02:04:33,894 [shard 0] cluster - drain_manager.cc:143 - Node draining has started                                                                                                                             
INFO  2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2946 - Starting leadership transfer from {id: {1}, revision: {0}} to {id: {3}, revision: {0}} in term 1                      
TRACE 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2801 - transfer leadership: preparing target={id: {3}, revision: {0}}, dirty_offset=21                                       
TRACE 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2807 - transfer leadership: cleared oplock                                                                                   
DEBUG 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2839 - transfer leadership: node {id: {3}, revision: {0}} doesn't need recovery or is already recovering (is_recovering false dirty offset 21)
DEBUG 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2872 - transfer leadership: node {id: {3}, revision: {0}} is not recovering, proceeding (dirty offset 21)                    
INFO  2023-05-19 02:04:33,894 [shard 0] cluster - drain_manager.cc:207 - Draining leadership from 1 groups                                                                                                                     
INFO  2023-05-19 02:04:33,894 [shard 1] cluster - drain_manager.cc:49 - Node draining is starting                                                                                                                              
INFO  2023-05-19 02:04:33,894 [shard 1] cluster - drain_manager.cc:143 - Node draining has started                                                                                                                             
INFO  2023-05-19 02:04:33,894 [shard 1] cluster - drain_manager.cc:256 - Node draining has completed on shard 1                                                                                                                
TRACE 2023-05-19 02:04:33,894 [shard 0] storage - readers_cache.cc:328 - {redpanda/controller/0} - removing reader: [0,21] lower_bound: 22                                                                                     
[DEBUG - 2023-05-19 02:04:33,895 - admin - _request - lineno:361]: Response OK                                                                                                                                                 
[INFO  - 2023-05-19 02:04:33,895 - rolling_restarter - restart_nodes - lineno:86]: Waiting for node docker-rp-23 leadership drain                                                                                              
[DEBUG - 2023-05-19 02:04:33,895 - admin - _request - lineno:332]: Dispatching get http://docker-rp-23:9644/v1/brokers/1                                                                                                       
INFO  2023-05-19 02:04:33,895 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:187 - [leadership_transfer] Stepping down as leader in term 1, dirty offset 21                                 
DEBUG 2023-05-19 02:04:33,895 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2618 - triggering leadership notification with term: 1, new leader: {nullopt}                                  
DEBUG 2023-05-19 02:04:33,895 [shard 0] cluster - health_monitor_backend.cc:395 - aborting current refresh request to 1                                                                                          
DEBUG 2023-05-19 02:04:33,895 [shard 0] cluster - feature_manager.cc:117 - Controller leader notification term 1                                                                                                 
TRACE 2023-05-19 02:04:33,895 [shard 0] cluster - partition_leaders_table.cc:160 - updated partition: {redpanda/controller/0} leader: {term: 1, current leader: {nullopt}, previous leader: {1}, revision: 0}    
INFO  2023-05-19 02:04:33,895 [shard 0] cluster - drain_manager.cc:239 - Draining leadership from 1 groups 1 succeeded                                                                                           
INFO  2023-05-19 02:04:33,895 [shard 0] cluster - drain_manager.cc:256 - Node draining has completed on shard 0                                                                                                  
TRACE 2023-05-19 02:04:33,895 [shard 1] cluster - partition_leaders_table.cc:160 - updated partition: {redpanda/controller/0} leader: {term: 1, current leader: {nullopt}, previous leader: {1}, revision: 0}    
TRACE 2023-05-19 02:04:33,898 [shard 0] request_auth - request_auth.cc:126 - Authenticated user {admin}                                                                                                          
DEBUG 2023-05-19 02:04:33,898 [shard 0] admin_api_server - admin_server.cc:365 - [admin] GET http://docker-rp-23:9644/v1/brokers/1                                                                               
[DEBUG - 2023-05-19 02:04:33,899 - admin - _request - lineno:355]: Response OK, JSON: {                 
    'node_id': 1,                       
    'num_cores': 2,                     
    'membership_status': 'active',      
    'maintenance_status': {             
        'draining': False,              
        'finished': False,              
        'errors': False,                
        'partitions': 0,                
        'eligible': 0,                  
        'transferring': 0,              
        'failed': 0                     
    }                                   
}                                       

Whereas

        def has_drained_leaders(node):
            try:
                node_id = self.redpanda.idx(node)
                broker_resp = admin.get_broker(node_id, node=node)
                maintenance_status = broker_resp["maintenance_status"]
                return maintenance_status["draining"] and maintenance_status["finished"]

So it looks like node draining has finished before the test gets the draining status for the 1st time. The test wants to see the draining status before moving forward but misses it, and timeouts.

@dlex dlex added area/tests sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. labels May 20, 2023
@dlex
Copy link
Contributor

dlex commented May 22, 2023

@piyushredpanda
Copy link
Contributor

Sounds like more a test failure and hence sev/low to me. @michael-redpanda could you keep me honest, please?

@piyushredpanda piyushredpanda added sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages and removed sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. labels Jun 23, 2023
@bharathv
Copy link
Contributor

@oleiman oleiman self-assigned this Oct 24, 2023
@michael-redpanda
Copy link
Contributor

reported as stale by pandatriage

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich vbotbuildovich reopened this Nov 3, 2023
@vbotbuildovich
Copy link
Collaborator

@oleiman oleiman removed their assignment Nov 29, 2023
@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

2 similar comments
@vbotbuildovich
Copy link
Collaborator

1 similar comment
@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@piyushredpanda
Copy link
Contributor

Closing older-bot-filed CI issues as we transition to a more reliable system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka area/tests ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

No branches or pull requests

8 participants