Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1563 - Delete connections optionally on deletion of tcpListener/tcpConnector #1570

Closed

Conversation

gabordozsa
Copy link
Collaborator

@gabordozsa gabordozsa commented Jul 16, 2024

Fixes #1563: Delete corresponding connections optionally on deletion of tcpListener and tcpConnector

New config parameters added to tcpListener and tcpConnector entities in order to make deletion of connections optional.

@gabordozsa gabordozsa added the enhancement New feature or request label Jul 16, 2024
*
* @param connections Reference to the list of connections
*/
static void terminate_connections(qd_tcp_connection_list_t *connections)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function called close_connection_XSIDE_IO() which can be used to terminate tcp connections. It take care of freeing qd_tcp_connection_t objects and well as its corresponding qdr_connection_t objects
Like so -

static void terminate_listener_connections(qd_tcp_listener_t *listener) {
        //
        // Deliberately call qd_tcp_listener_incref() to make sure that freeing the connections, don't
        // free the listener. Then we call qd_tcp_listener_free() to forcefully free the listener without checking the listener->ref_count
        //
        qd_tcp_listener_incref(listener);
        qd_tcp_connection_t *conn = DEQ_HEAD(listener->connections);
        while (conn) {
            qd_tcp_connection_t *next_conn = DEQ_NEXT(conn);
            close_connection_XSIDE_IO(conn);
            conn = next_conn;
        }
        qd_tcp_listener_free(listener);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that close_connection_XSIDE_IO() cannot be called directly in a loop because it makes proton calls on the connection. Another proactor thread may process the same connection at the same time which is prohibited. I maybe misunderstood this paragraph from threading-info.txt:

The Proton subsystem limits concurrency to a single connection. That

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you did not misunderstand the paragraph. What I suggested - calling close_connection_XSIDE_IO() - from a management thread - is wrong. I will get back to you shortly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might try calling qdr_core_close_connection(conn->core_conn)
like here - https://github.com/skupperproject/skupper-router/blob/main/src/adaptors/http2/http2_adaptor.c#L3403

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the pointer! If I call that function then I do not need to change the type of the closed flag from bool to sys_atomic_t in qdr_connection_t. Is there any other reason why the current PR is not a proper solution? I'm just asking because I would like to understand the router code better :-).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I have updated to PR to use qdr_core_close_connection() for initiating the connection termination.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things

  1. You don't need to call pn_raw_connection_wake() since it is already called as part of the call to qdr_core_close_connection()
  2. You also don't need to lock on the activaton_lock since you are not going to call pn_raw_connection_wake() anymore

Copy link
Collaborator Author

@gabordozsa gabordozsa Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure that I understand correctly the code, would the call sequence be roughly be as follows:

  1. qdr_core_close_connection() will activate the core thread
  2. qdr_core_close_connection_CT() will set conn->closed=true and activate the connection
  3. the TCP adaptor activation callback CORE_activate() will call pn_raw_connection_wake()

I think the check for conn->raw_opened can also be removed from terminate_connections() because the same check is done in CORE_activate().

Copy link
Contributor

@ganeshmurthy ganeshmurthy Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. qdr_core_close_connection(qdr_connection_t *conn) will add a core thread action with callback qdr_core_close_connection_CT() and wake up the core thread
  2. The qdr_core_close_connection_CT() is executed by the core thread and will set the conn->closed=true and activate the connection to be deleted via a call to qdr_connection_activate_CT(core, conn);
  3. qdr_connection_process() in connections.c is executed on an I/O thread and an adaptor callback function is called (line 359 of connections.c) - conn->protocol_adaptor->conn_close_handler(conn->protocol_adaptor->user_context, conn, conn->error);
  4. The conn_close_handler is set here - https://github.com/skupperproject/skupper-router/blob/main/src/adaptors/tcp/tcp_adaptor.c#L2548 which is static void CORE_connection_close(void *context, qdr_connection_t *conn, qdr_error_t *error) where there is a call to pn_raw_connection_close(tcp_conn->raw_conn); which will generate a PN_RAW_CONNECTION_DISCONNECTED proton event and the handler for that event ultimately calls close_connection_XSIDE_IO(conn) where the qd_tcp_connection_t object and its contents are ultimately freed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ganeshmurthy
Copy link
Contributor

Also, this PR needs to include a system test using vanflow -
The test must start two routers, create a bunch of connections from a tcpListener on one router and then delete the tcpListener and validate that the associated connections are gone, same with tcpConnector (on the other router). The checking for the presence and deletion of a tcpConnection can be done using the vanflow API.

@gabordozsa gabordozsa force-pushed the delete_tcp_connections branch from cd1050a to 7a7fb49 Compare July 17, 2024 10:15
@gabordozsa gabordozsa marked this pull request as draft July 17, 2024 10:19
@gabordozsa
Copy link
Collaborator Author

Also, this PR needs to include a system test using vanflow - The test must start two routers, create a bunch of connections from a tcpListener on one router and then delete the tcpListener and validate that the associated connections are gone, same with tcpConnector (on the other router). The checking for the presence and deletion of a tcpConnection can be done using the vanflow API.

I'm working on this system test now (so I converted the PR to draft for now). Could you point me to an existing system test which could be a good starting point for me for this new test?

@ganeshmurthy
Copy link
Contributor

Also, this PR needs to include a system test using vanflow - The test must start two routers, create a bunch of connections from a tcpListener on one router and then delete the tcpListener and validate that the associated connections are gone, same with tcpConnector (on the other router). The checking for the presence and deletion of a tcpConnection can be done using the vanflow API.

I'm working on this system test now (so I converted the PR to draft for now). Could you point me to an existing system test which could be a good starting point for me for this new test?

Please take a look at #1566 and see if you understand how the vanflow_snooper.py works and let me know if you have any questions

@ganeshmurthy
Copy link
Contributor

@gabordozsa Just wanted to mention that if you are working on a github issue, please make sure you assign the issue to yourself before you start working on it. That will let other developers know who is responsible for this issue, so they will not work on that issue. Also if the issue is already assigned to someone, and you want to work on it, please contact that person to see if they are not already working on it, if not, you can that it be assigned to you. Thanks.

@gabordozsa
Copy link
Collaborator Author

@gabordozsa Just wanted to mention that if you are working on a github issue, please make sure you assign the issue to yourself before you start working on it. That will let other developers know who is responsible for this issue, so they will not work on that issue. Also if the issue is already assigned to someone, and you want to work on it, please contact that person to see if they are not already working on it, if not, you can that it be assigned to you. Thanks.

Sure, I will do! Thanks for the reminder and for assigning 1563 to me!

@gabordozsa gabordozsa force-pushed the delete_tcp_connections branch from 7a7fb49 to cfede6b Compare July 17, 2024 15:38
Copy link

codecov bot commented Jul 17, 2024

Codecov Report

Attention: Patch coverage is 95.58824% with 9 lines in your changes missing coverage. Please review.

Project coverage is 62.5%. Comparing base (1b1d4b9) to head (35d4d8b).
Report is 34 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #1570     +/-   ##
=======================================
- Coverage   63.0%   62.5%   -0.5%     
=======================================
  Files        214     214             
  Lines      51857   52308    +451     
  Branches    5986    6146    +160     
=======================================
+ Hits       32681   32727     +46     
- Misses     16966   17281    +315     
- Partials    2210    2300     +90     
Flag Coverage Δ
pysystemtests 78.8% <98.1%> (+0.6%) ⬆️
systemtests 55.5% <84.6%> (-0.7%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
unittests 62.5% <68.0%> (-0.5%) ⬇️
systemtests 62.5% <68.0%> (-0.5%) ⬇️

//
if (connector->adaptor_config->terminate_conns) {
qd_tcp_connection_t *conn;
conn = DEQ_HEAD(listener->connections);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be conn = DEQ_HEAD(connector->connections)' not conn = DEQ_HEAD(listener->connections)`?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops, copy-paste issue.

@gabordozsa gabordozsa force-pushed the delete_tcp_connections branch from cfede6b to 7a8ffc7 Compare July 17, 2024 16:30
@gabordozsa gabordozsa force-pushed the delete_tcp_connections branch from 7a8ffc7 to d9ec438 Compare July 25, 2024 12:35
@gabordozsa gabordozsa marked this pull request as ready for review July 25, 2024 13:31
@gabordozsa
Copy link
Collaborator Author

@ganeshmurthy I have added a system test. I think the PR is ready for more reviews.


# vflow Ids are used to associate flows with tcpListeners and tcpConnectors
vflow_ids = {}
self.assertTrue(self.router_1_vflow_id is not None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use self.assertIsNotNone instead of self.assertTrue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

'LISTENER', address)
vflow_ids['listener_2'] = self.get_tcp_entity_vflow_id(self.router_2_vflow_id,
'LISTENER', address)
self.assertTrue(vflow_ids['listener_1'] is not None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, use self.assertIsNotNone instead of self.assertTrue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ganeshmurthy
Copy link
Contributor

This test must also include another scenario where a TLS enabled tcpListener and TLS enabled tcpConnector are deleted. You can attach an sslProfile to a tcpListener or a tcpConnector to make them do TLS. Examples of this can be found in system_tests_tcp_adaptor_tls

@ganeshmurthy
Copy link
Contributor

To test the TLS case, you can use the EchoClientRunner to start the echo client which can be found in system_tests_tcp_adaptor. It has the test_ssl flag which can be set to True

pass
return None

def run_skmanage(self, cmd, router):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a class in system_test.py called QdManager that runs the skmanage command for you. You can use that class instead of directly running skmanage

Copy link
Collaborator Author

@gabordozsa gabordozsa Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased and removed the run_skmanage() method.

@gabordozsa gabordozsa force-pushed the delete_tcp_connections branch from 806511d to 16d5cb7 Compare July 25, 2024 18:25
@ganeshmurthy
Copy link
Contributor

@gabordozsa I introduced a commit - f1f61ee - where I renamed the QdManager class to SkManager so it is more in line with the actual name of the tool skmanage
Please rebase with the latest main branch and use the SkManager class in your PR

@gabordozsa gabordozsa force-pushed the delete_tcp_connections branch from 16d5cb7 to 9fea144 Compare July 26, 2024 13:05
@gabordozsa gabordozsa requested a review from ganeshmurthy July 26, 2024 13:42
@ganeshmurthy
Copy link
Contributor

@gabordozsa This looks good to me, great work !
I want @kgiusti to look at this and once he approves it, we can push this PR to main branch. He is back on Monday !

@gabordozsa gabordozsa force-pushed the delete_tcp_connections branch from 7122289 to 35d4d8b Compare August 12, 2024 14:50
@ganeshmurthy
Copy link
Contributor

@gabordozsa there seem to be some CI failures in this PR like this one - https://github.com/skupperproject/skupper-router/actions/runs/10354198908/job/28658907680?pr=1570

I have kicked off the main branch CI just to make sure that it is clean, let's see what happens on main branch CI.

@ganeshmurthy
Copy link
Contributor

ganeshmurthy commented Aug 12, 2024

@@ -66,7 +66,8 @@ qd_error_t qd_load_adaptor_config(qdr_core_t *core, qd_adaptor_config_t *config,
config->site_id = qd_entity_opt_string(entity, "siteId", 0); CHECK();
config->ssl_profile_name = qd_entity_opt_string(entity, "sslProfile", 0); CHECK();
config->authenticate_peer = qd_entity_opt_bool(entity, "authenticatePeer", false); CHECK();
config->verify_host_name = qd_entity_opt_bool(entity, "verifyHostname", false); CHECK();
config->verify_host_name = qd_entity_opt_bool(entity, "verifyHostname", false);
CHECK();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is clang format causing this diff to appear ? Can you simply restore this file to its original version and not run the clang formatter ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was done by clang format. I plan to squash all commits when the PR gets stabilised and restore this file then to its original.

@ganeshmurthy
Copy link
Contributor

@gabordozsa
Copy link
Collaborator Author

Leak of qd_tls_domain_t object https://github.com/skupperproject/skupper-router/actions/runs/10355435503/job/28663084819?pr=1570#step:34:5092

@ganeshmurthy In the same error log I can see (https://github.com/skupperproject/skupper-router/actions/runs/10355435503/job/28663084819?pr=1570#step:34:4777):

alloc.c: Items of type 'qd_tcp_connection_t' remain allocated at shutdown: 1
Leak: 2024-08-12 16:14:00.677874 +0000 type: qd_tcp_connection_t address: 0x5160001f9610

I'm wondering if the problem could be in ADAPTOR_final() https://github.com/gabordozsa/skupper-router/blob/2c3629b9e17b5a9ce0bb7591d6d9e2dd9b715a52/src/adaptors/tcp/tcp_adaptor.c#L2648. ADAPTOR_final() iterates through the connections list and trigger closing them. In the main branch every accepted connection is added to the connections list. However, we delay adding it to the list now (until the link is setup). So if a new connection is
accepted but ADAPTOR_final() is called before the link is setup for it then this conn can leak. Does that make sense?

@ganeshmurthy
Copy link
Contributor

However, we delay adding it to the list now (until the link is setup). So if a new connection is
accepted but ADAPTOR_final() is called before the link is setup for it then this conn can leak. Does that make sense?

Good catch @gabordozsa . It looks like the on_accept() was called and the connection was created there but before we could get a PN_RAW_CONNECTION_CONNECTED event from proton, the router seems to be shutting down which leaves the newly created connection (which is not yet in the list) to leak.

If you ask me, this should not really be a problem. The router is anyway shutting down, who cares about the leak ? But unfortunately, this is how we do things in the router.

@gabordozsa gabordozsa force-pushed the delete_tcp_connections branch from 234585b to 83c4988 Compare August 14, 2024 18:51
@gabordozsa
Copy link
Collaborator Author

Closing this as code got merged via #1601

@gabordozsa gabordozsa closed this Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete corresponding connections on deletion of tcpListener and tcpConnector
3 participants