Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testnet 21 not exiting on SIGINT - API engine suspected #3759

Closed
Ben-PH opened this issue Apr 3, 2023 · 8 comments · Fixed by #3776
Closed

Testnet 21 not exiting on SIGINT - API engine suspected #3759

Ben-PH opened this issue Apr 3, 2023 · 8 comments · Fixed by #3776
Assignees
Labels
api Issues related to the API

Comments

@Ben-PH
Copy link
Contributor

Ben-PH commented Apr 3, 2023

Describe the bug
On testnet 21, SIGINT (ctrl-c) handling causes a deadlock/blockage while stopping the various processes

To Reproduce

deploy labnet on testnet 21 (haven't checked with sandbox)
run a node that bootstraps off a labnet server
once the client is up and running, provide a SIGINT (ctrl-c works well)
observe that "Stopping consensus worker..." is logged, with no follow-up

Expected behavior
The consensus worker exits correctly, followed by clean exit out of the program shortly after

Additional context
with these implementations on ConsensusControllerImpl in massa-consensus-worker/src/controller.rs

impl Clone for ConsensusControllerImpl {
    fn clone(&self) -> Self {
        warn!("cloning {}", self.name);
        Self {
            command_sender: self.command_sender.clone(),
            channels: self.channels.clone(),
            shared_state: self.shared_state.clone(),
            bootstrap_part_size: self.bootstrap_part_size.clone(),
            broadcast_enabled: self.broadcast_enabled.clone(),
            name: self.name.clone(),
        }
    }
}
impl Drop for ConsensusControllerImpl {
    fn drop(&mut self) {
        warn!("dropping ctrler imple: {}", self.name);
    }
}

as well as changes such as these in massa-node/src/main.rs

+    consensus_controller.set_name("base for protocol_controller");
+    let mut cloned_1 = consensus_controller.clone();
+    cloned_1.set_name("protocol_controller");
+    consensus_controller.set_name("base-again 1");
 
     let protocol_manager = start_protocol_controller(
         protocol_config,
         protocol_receivers,
         protocol_senders.clone(),
-        consensus_controller.clone(),
+        cloned_1,

a cleaned up version of the logs will look like this:

WARN massa_consensus_worker::controller: cloning base for protocol_controller    
WARN massa_consensus_worker::controller: cloning base for factory_channel    
WARN massa_consensus_worker::controller: cloning factory_channels    
WARN massa_consensus_worker::controller: cloning base for bootstrap_server    
WARN massa_consensus_worker::controller: cloning base for massa-api    

WARN massa_node: sending signal to stop
WARN massa_node: signal sent
INFO massa_node: interrupt signal received

INFO massa_api: API finished cleanly
INFO massa_api: API finished cleanly
INFO massa_api: API finished cleanly

INFO massa_factory_worker::manager: stopping factory...
WARN massa_consensus_worker::controller: dropping ctrler imple: factory_channels    
WARN massa_consensus_worker::controller: dropping ctrler imple: factory_channels    
INFO massa_factory_worker::manager: factory stopped

INFO massa_protocol_exports::protocol_controller: stopping protocol controller...
WARN massa_consensus_worker::controller: dropping ctrler imple: protocol_controller    
INFO massa_protocol_worker::protocol_worker: protocol worker finished cleanly
INFO massa_protocol_exports::protocol_controller: protocol controller stopped

INFO massa_pool_worker::controller_impl: stopping pool workers...
INFO massa_pool_worker::controller_impl: pool workers stopped

INFO massa_bootstrap::server: ip white/blacklist updater exited cleanly

WARN massa_consensus_worker::controller: dropping ctrler imple: bootstrap_server  

Note: there are 5 clones of the consensus-controller, so we would expect 6 drops (one for each clone, plus the original),

BUT

I count only 4 drops. It looks like the instances that are provided to the public api and massa api are not dropped.

This is important, because the consensus manager stops after joining on the thread that's spawned to host the ConsensusWorker::run() execution. The content of which exits its main-loop by way of detecting that the channel it receives on is disconnected.

The disconnection will only occur once all the consensus controllers that are cloned from that same base-controller are dropped, as each clone will also clone the sender side of that channel.

@Ben-PH Ben-PH added api Issues related to the API node labels Apr 3, 2023
@Ben-PH Ben-PH changed the title Testnet 21 not exiting on SIGINT Testnet 21 not exiting on SIGINT - API engine suspected Apr 3, 2023
@aoudiamoncef
Copy link
Contributor

aoudiamoncef commented Apr 3, 2023

Hi @Ben-PH,

We have a lake of examples/documentation for jsonrpsee, I read literally all the codebase and It seems that we did the integration in the right way.

The issue is Massa and Public API never stop, even if I call self.server_handle.stopped()of drop the complete reference to ServeHandle. The signal to trigger the stop is sent but It never works.

2023-04-03T22:41:45.388469Z  INFO massa_node: Node version : SAND.20.0
2023-04-03T22:41:45.400245Z  INFO massa_network_worker: The node_id of this node is: N1DoT6WwDHSKPijrNYVrMDN2LR5xgGntoEPgJkfuM9ofAdwCJdH
2023-04-03T22:41:45.449219Z  INFO massa_consensus_worker::worker::init: Started node at time 2023-04-03T22:41:45Z, cycle 0, period 0, thread 0    
2023-04-03T22:41:45.449250Z  INFO massa_consensus_worker::worker::init: 0 days, 0 hours, 0 minutes, 9 seconds remaining to genesis    
2023-04-03T22:41:45.450848Z  WARN massa_bootstrap::server::white_black_list: error on load whitelist/blacklist file : base_config/bootstrap_blacklist.json | No such file or directory (os error 2)    
2023-04-03T22:41:45.452210Z  INFO massa_node: API | EXPERIMENTAL JsonRPC | listening on: 0.0.0.0:33036
2023-04-03T22:41:45.453636Z  INFO massa_node: API | gRPC | listening on: 0.0.0.0:33037
2023-04-03T22:41:45.453723Z  INFO massa_node: API | PRIVATE JsonRPC | listening on: 127.0.0.1:33034
2023-04-03T22:41:45.453809Z  INFO massa_node: API | PUBLIC JsonRPC | listening on: 0.0.0.0:33035
2023-04-03T22:41:45.453814Z  WARN massa_node: deadlocks detector will run every 10s
2023-04-03T22:41:54.171413Z  INFO massa_node: stop command received from private API
2023-04-03T22:41:54.171742Z  INFO massa_bootstrap::server: ip white/blacklist updater exited cleanly
2023-04-03T22:41:54.173736Z  INFO massa_node: Start stopping API's
2023-04-03T22:41:54.174004Z  INFO massa_grpc::server: gRPC API finished cleanly    
2023-04-03T22:41:54.174039Z  INFO massa_api: API finished cleanly
2023-04-03T22:41:54.174844Z  INFO massa_node: API | PRIVATE JsonRPC | stopped
2023-04-03T22:41:54.174878Z  INFO massa_api: API finished cleanly
2023-04-03T22:41:55.403466Z  INFO massa_consensus_worker::worker::main_loop: Massa network has started ! 🎉    
2023-04-03T22:41:55.403595Z  INFO massa_consensus_worker::worker::main_loop: Started cycle 0

It's same generic code for the PUBLIC, PRIVATE and EXPERIMENTAL, but we have only PRIVATE which is able to shutdown. I change dthe code to use self.server_handle.stopped().await in order to wait indefinitely and see clearly the issue.

The new gRPC works as expected as expected too.

cc @modship, if you have an idea about this nightmare, feel free to help us 👍🏻

@AurelienFT
Copy link
Contributor

I also have problem when stopping my node, let's merge all the lasts PR and fix the stop of the node after.

@AurelienFT AurelienFT self-assigned this Apr 4, 2023
@aoudiamoncef
Copy link
Contributor

I have a strange behaviour, stop_node will trigger the node stop, but the Public and Massa API still listen to their ports. Send any request to Massa and Public API triggers their shutdown and we got a clean shutdown of the node. We should investigate more. I tried to remove consensus component from Massa API, no effect. seems to be something in Public API ?

@AurelienFT AurelienFT assigned aoudiamoncef and modship and unassigned AurelienFT Apr 4, 2023
@aoudiamoncef
Copy link
Contributor

aoudiamoncef commented Apr 5, 2023

@Ben-PH @AurelienFT

We did a deep dive into jsonrpsee code with @modship, the result: you can't stop your server gracefully if haven't initialise a connection to It.

		/// Start responding to connections requests.
	///
	/// This will run on the tokio runtime until the server is stopped or the `ServerHandle` is dropped.
	pub fn start(mut self, methods: impl Into<Methods>) -> Result<ServerHandle, Error> {
		let methods = methods.into().initialize_resources(&self.resources)?;
		let (stop_tx, stop_rx) = watch::channel(());

		let stop_handle = StopHandle::new(stop_rx);

		match self.cfg.tokio_runtime.take() {
			Some(rt) => rt.spawn(self.start_inner(methods, stop_handle)),
			None => tokio::spawn(self.start_inner(methods, stop_handle)),
		};

		Ok(ServerHandle::new(stop_tx))
	}

	async fn start_inner(self, methods: Methods, stop_handle: StopHandle) {
		let max_request_body_size = self.cfg.max_request_body_size;
		let max_response_body_size = self.cfg.max_response_body_size;
		let max_log_length = self.cfg.max_log_length;
		let allow_hosts = self.cfg.allow_hosts;
		let resources = self.resources;
		let logger = self.logger;
		let batch_requests_supported = self.cfg.batch_requests_supported;
		let id_provider = self.id_provider;
		let max_subscriptions_per_connection = self.cfg.max_subscriptions_per_connection;

		let mut id: u32 = 0;
		let connection_guard = ConnectionGuard::new(self.cfg.max_connections as usize);
		let mut connections = FutureDriver::default();
		let mut incoming = Monitored::new(Incoming(self.listener), &stop_handle);

		loop {
			match connections.select_with(&mut incoming).await {
				Ok((socket, remote_addr)) => {
					let data = ProcessConnection {
						remote_addr,
						methods: methods.clone(),
						allow_hosts: allow_hosts.clone(),
						resources: resources.clone(),
						max_request_body_size,
						max_response_body_size,
						max_log_length,
						batch_requests_supported,
						id_provider: id_provider.clone(),
						ping_interval: self.cfg.ping_interval,
						stop_handle: stop_handle.clone(),
						max_subscriptions_per_connection,
						conn_id: id,
						logger: logger.clone(),
						max_connections: self.cfg.max_connections,
						enable_http: self.cfg.enable_http,
						enable_ws: self.cfg.enable_ws,
					};
					process_connection(&self.service_builder, &connection_guard, data, socket, &mut connections);
					id = id.wrapping_add(1);
				}
				Err(MonitoredError::Selector(err)) => {
					tracing::error!("Error while awaiting a new connection: {:?}", err);
				}
				Err(MonitoredError::Shutdown) => break,
			}
		}

		connections.await;
	}
}

GG to @modship for his efforts on debugging this

@AurelienFT
Copy link
Contributor

Nice finding. So we make a stop command in the consensus as a workaround for now ? Until we provide a fix to jsonrpsee ?

@Ben-PH
Copy link
Contributor Author

Ben-PH commented Apr 5, 2023

is this happening in the bootstrap server?

@modship
Copy link
Member

modship commented Apr 5, 2023

is this happening in the bootstrap server?

it's jsonrpsee lib here so maybe "identical" but not correlated.

@modship
Copy link
Member

modship commented Apr 5, 2023

this works with same version as massa (tokio=1.23, jsonrpsee=0.16.2)

fn main() {
    tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(async {
            let server = ServerBuilder::default()
                .http_only()
                .ping_interval(Duration::from_secs(1))
                .build("0.0.0.0:8080".parse::<SocketAddr>().unwrap())
                .await
                .unwrap();
            let mut module = RpcModule::new(());
            module
                .register_method("say_hello", |_, _| Ok("lo"))
                .unwrap();

            println!("server start");
            let handle = server.start(module).unwrap();
            assert!(!handle.is_stopped());
            let res = handle.stop();
            dbg!(res);
            handle.stopped().await;
            println!("server stop");
        });

}

in massa this works :
main.rs

    let api_public_handle = api_public
        .serve(&SETTINGS.api.bind_public, &api_config)
        .await
        .expect("failed to start PUBLIC API");
    info!(
        "API | PUBLIC JsonRPC | listening on: {}",
        api_config.bind_public
    );

    api_public_handle.stop().await;
    dbg!("stopped");

with stop() =

    pub async fn stop(self) {
        match self.server_handler.stop() {
            Ok(_) => {
                info!("API finished cleanly");
            }
            Err(err) => warn!("API thread panicked: {:?}", err),
        }

        self.server_handler.stopped().await;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Issues related to the API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants