Fixes #742: Removed granted read buffer tracking in http2 and tcp ada… #743

ganeshmurthy · 2022-09-28T20:39:26Z

…ptors

codecov-commenter · 2022-09-28T20:45:19Z

Codecov Report

Attention: Patch coverage is 0% with 15 lines in your changes missing coverage. Please review.

Project coverage is 27.6%. Comparing base (a9b9c90) to head (59ad085).
Report is 502 commits behind head on main.

Additional details and impacted files

@@          Coverage Diff          @@
##            main    #743   +/-   ##
=====================================
  Coverage   27.6%   27.6%           
=====================================
  Files        131     131           
  Lines      31431   31418   -13     
  Branches    5033    5032    -1     
=====================================
  Hits        8679    8679           
+ Misses     21647   21634   -13     
  Partials    1105    1105

Flag	Coverage Δ
unittests	`27.6% <0.0%> (+<0.1%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
unittests	`27.6% <0.0%> (+<0.1%)`	⬆️
systemtests	`27.6% <0.0%> (+<0.1%)`	⬆️

jiridanek

Here is the test failure -
https://github.com/skupperproject/skupper-router/actions/runs/3146738179/jobs/5115896447#step:29:2330
The timeout of 400 seconds, is it set here - https://github.com/skupperproject/skupper-router/blob/main/.github/workflows/build.yaml#L167
Is that an individual timeout per test of 400 seconds ?

Yes, SkstatTest::test_yy_query_many_links had 400 seconds to finish, but it did not finish. I logged issue for this test already #637

Also, I had some thoughts in #672, if we had that, it would be easier to debug this. Currently we cannot tell if this problem is caused by #535 or if it is a completely new issue.

Also, the PR is attempting to close all outstanding adaptor (tcp, http) listeners, connectors and non amqp connections before the router shuts down in test. This way, I don't have to track granted read buffers in the adaptor code. But those buffers are still leaking here - https://github.com/skupperproject/skupper-router/pull/743/checks#step:6:505
It looks like we will have to do the same thing (shutdown all connections) in the container image test that is running h2spec.

What the test does now is described in the comment

skupper-router/tests/testcontainers/test_h2spec.rs

Lines 92 to 99 in 000de22

    
            * Performs the DISPATCH-1940 reproducer steps using Container images. To run manually: 
        
            * 
        
            * 1. Write the `ROUTER_CONFIG` above to a file, say `h2spec_router.conf` 
        
            *     - change `host: nghttpd` to `host: localhost` 
        
            * 2. Run router, nghttpd and h2spec, each in its own terminal, using docker or (for skrouterd) without it 
        
            *     - `docker run --rm -it --network=host nixery.dev/nghttp2:latest nghttpd -a 0.0.0.0 --no-tls -d /tmp 8888` 
        
            *     - `skrouterd -c h2spec_router.conf` 
        
            *     - `docker run --rm -it --network=host summerwind/h2spec:2.6.0 -h localhost -p 24162 --verbose --insecure --timeout 10`

What you suggest is that we need to run skmanage, find out what connections are active, and terminate the connections using a management operation? Or is the trick to shutdown the nghttpd first, and then shutdown the router? How do I detect that the connection is already scrapped in the router? With a skmanage call?

Wouldn't it be simpler to teardown all connections in the router itself if it is running in a memory-debug mode?

jiridanek · 2022-09-30T11:34:10Z

@ganeshmurthy When I put #750 on top of this, the leak from image tests disappears for me

jiridanek · 2022-09-30T13:45:08Z

https://github.com/skupperproject/skupper-router/actions/runs/3159057422/jobs/5141859848#step:10:5046

33/33 Test #57: system_tests_http1_adaptor ...........***Timeout 1199.71 sec

What just happened?

ganeshmurthy · 2022-09-30T13:51:51Z

https://github.com/skupperproject/skupper-router/actions/runs/3159057422/jobs/5141859848#step:10:5046
33/33 Test #57: system_tests_http1_adaptor ...........***Timeout 1199.71 sec
What just happened?

I don't think that failure has anything to do with this PR in particular but I am not sure. This PR is trying to shut down all connections just before the router shuts down, so I am hoping that the PR did not cause the error. @kgiusti, what do you think ?

kgiusti

I like what this does, but I'm concerned that there's something not right if the router is calling close connection twice now.

I would also break this up into two separate PRs: one patch to the CI that does the adaptor/connector/connection cleanup and a follow up PR patch that removes the adaptor buffer tracking. These really are two separate fixes.

tests/system_test.py

kgiusti · 2022-09-30T14:46:30Z

tests/system_test.py

+                qd_manager.delete_all_entities(long_type)
+            retry_assertion(self.delete_adaptor_connections(qd_manager))
+        except Exception as e:
+            # We tried to delete some stuff but ran into issues. There could


Are exceptions being hit in your tests? If so, what kind?

In any case at least add a print statement indicating what happened, the exception raised, etc. rather than failing silently.

+1, also, catching Exception will catch SyntaxError, etc... Best to have explicit list, esp. because it is hard to improve this later on

Sometimes a proton ConnectionException is thrown which proton does not seem to expose ?

It should be exposed

https://github.com/apache/qpid-proton/blob/a4375a8351c3435bd5025bb2481307efe97dc99e/python/proton/__init__.py#L57

so

from proton import ConnectionException

?

If I understand correctly the problem is that qdmanager is used to do the operation. Qdmanager isn't using proton directly - it spawns a process that runs skmanage as a shell process then simply raises Exception should the process return an error code.

So we lose any access to the proton exception details.

QdManager does indeed wrap all exceptions into an Exception object but the exception details are not lost. QdManager calls raise Exception("%s %s" % rc) which upon failure prints the following error message (for example a ConnectionException)
ConnectionException: Connection amqp://0.0.0.0:23860 disconnected: Condition('proton.pythonio', 'Connection refused to all addresses') - prints on a proton ConnectionException

tests/system_test.py

src/adaptors/adaptor_common.h

src/router_core/connections.c

ganeshmurthy · 2022-09-30T16:02:15Z

I like what this does, but I'm concerned that there's something not right if the router is calling close connection twice now.

I would also break this up into two separate PRs: one patch to the CI that does the adaptor/connector/connection cleanup and a follow up PR patch that removes the adaptor buffer tracking. These really are two separate fixes.

The reason, I had all this in the same patch is because these are related. Deleting the connectors/connections is what is allowing me to take out the tracking of the granted_read_buffers from the adaptors. If I do a PR with the management change separately and another PR which just removes the granted_read_buffs, I am sure someone might wonder looking at the second PR why suddenly we are able to remove the granted_read_buffs.

src/router_core/connections.c

kgiusti · 2022-09-30T20:46:29Z

https://github.com/skupperproject/skupper-router/actions/runs/3159057422/jobs/5141859848#step:10:5046
33/33 Test #57: system_tests_http1_adaptor ...........***Timeout 1199.71 sec
What just happened?
I don't think that failure has anything to do with this PR in particular but I am not sure. This PR is trying to shut down all connections just before the router shuts down, so I am hoping that the PR did not cause the error. @kgiusti, what do you think ?

Oh great... yet another buffer corruption issue!

From the server log:

2022-09-30 13:26:40.230373 +0000 HTTP_ADAPTOR (trace) [C1] HTTP request/response codec done. Octets read: 393570 written: 210 (../src/adaptors/http1/http1_server.c:1176)

Yep, those numbers look good. Now what did the client see? From the client log:

2022-09-30 13:31:40.304957 +0000 HTTP_ADAPTOR (trace) [C66] HTTP request msg-id=18 cancelled. Octets read: 210 written: 195584 (../src/adaptors/http1/http1_client.c:965)

TL;DR test client is hanging since it hasn't received the full response although the server sent it.

I'll go re-open the old Issue... Oh look! It's about Beer O'clock on a Friday! So long, folks!!

kgiusti · 2022-10-03T18:06:20Z

tests/system_test.py

+            try:
+                self._qd_manager = QdManager(address=self.addresses[0])
+            except Exception as e:
+                return None


Probably don't want to silently discard the exception - it will be very hard to figure out what happened if the QdManager fails in GH actions.

Probably don't want to silently discard the exception - it will be very hard to figure out what happened if the QdManager fails in GH actions.

You mean, it will be hard to figure out why the QdManager constructor fails in GHA ? The only one reason it could fail is if there is no self.addresses, right ? I will add a print(e) before returning None

Is not having a self.addresses[0] a bug, or is it OK as long as the caller checks for a None? I think that's my point of confusion: I don't understand why the exception is being ignored. If the API allows a None return code (not a bug), then I would simply comment the code to explain that is why the exception is discarded.

As it stands now, not all routers have a valid self.addresses[0], so it is ok to return a None. I will add a comment clarifying that. But the person who fixes #753 will make not having a self.addresses[0] illegal and require every router to have a management port on self.addresses[0]

…p2 and tcp adaptors

ganeshmurthy requested review from kgiusti and jiridanek September 29, 2022 13:48

jiridanek reviewed Sep 29, 2022

View reviewed changes

ganeshmurthy force-pushed the ISSUE-742 branch from 0515a7a to 87314de Compare September 29, 2022 19:15

ganeshmurthy force-pushed the ISSUE-742 branch from 87314de to 6a08ee3 Compare September 30, 2022 13:19

ganeshmurthy mentioned this pull request Sep 30, 2022

Occasional leak of qd_adaptor_buffer_t during system_tests_http1_adaptor #751

Closed

kgiusti reviewed Sep 30, 2022

View reviewed changes

kgiusti mentioned this pull request Sep 30, 2022

CI Enhancement: All routers under test should have a consistent API for accessing the management interface #753

Open

kgiusti reviewed Sep 30, 2022

View reviewed changes

src/router_core/connections.c Show resolved Hide resolved

kgiusti mentioned this pull request Sep 30, 2022

system_tests_http1_adaptor.Http1AdaptorEdge2EdgeTest::test_01_concurrent_requests timeout #754

Closed

ganeshmurthy force-pushed the ISSUE-742 branch 3 times, most recently from ebd2031 to 834e8a3 Compare October 3, 2022 16:01

kgiusti reviewed Oct 3, 2022

View reviewed changes

kgiusti approved these changes Oct 3, 2022

View reviewed changes

Fixes skupperproject#742: Removed granted read buffer tracking in htt…

59ad085

…p2 and tcp adaptors

ganeshmurthy force-pushed the ISSUE-742 branch from 834e8a3 to 59ad085 Compare October 3, 2022 18:52

ganeshmurthy merged commit 220804c into skupperproject:main Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #742: Removed granted read buffer tracking in http2 and tcp ada… #743

Fixes #742: Removed granted read buffer tracking in http2 and tcp ada… #743

ganeshmurthy commented Sep 28, 2022

codecov-commenter commented Sep 28, 2022 •

edited by codecov bot

Loading

jiridanek left a comment

jiridanek commented Sep 30, 2022

jiridanek commented Sep 30, 2022

ganeshmurthy commented Sep 30, 2022

kgiusti left a comment

kgiusti Sep 30, 2022

jiridanek Sep 30, 2022

ganeshmurthy Sep 30, 2022

jiridanek Sep 30, 2022

kgiusti Oct 3, 2022

ganeshmurthy Oct 3, 2022 •

edited

Loading

ganeshmurthy commented Sep 30, 2022

kgiusti commented Sep 30, 2022

kgiusti Oct 3, 2022

ganeshmurthy Oct 3, 2022

kgiusti Oct 3, 2022

ganeshmurthy Oct 3, 2022

	* Performs the DISPATCH-1940 reproducer steps using Container images. To run manually:
	*
	* 1. Write the `ROUTER_CONFIG` above to a file, say `h2spec_router.conf`
	* - change `host: nghttpd` to `host: localhost`
	* 2. Run router, nghttpd and h2spec, each in its own terminal, using docker or (for skrouterd) without it
	* - `docker run --rm -it --network=host nixery.dev/nghttp2:latest nghttpd -a 0.0.0.0 --no-tls -d /tmp 8888`
	* - `skrouterd -c h2spec_router.conf`
	* - `docker run --rm -it --network=host summerwind/h2spec:2.6.0 -h localhost -p 24162 --verbose --insecure --timeout 10`

Fixes #742: Removed granted read buffer tracking in http2 and tcp ada… #743

Fixes #742: Removed granted read buffer tracking in http2 and tcp ada… #743

Conversation

ganeshmurthy commented Sep 28, 2022

codecov-commenter commented Sep 28, 2022 • edited by codecov bot Loading

Codecov Report

jiridanek left a comment

Choose a reason for hiding this comment

jiridanek commented Sep 30, 2022

jiridanek commented Sep 30, 2022

ganeshmurthy commented Sep 30, 2022

kgiusti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganeshmurthy Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

ganeshmurthy commented Sep 30, 2022

kgiusti commented Sep 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 28, 2022 •

edited by codecov bot

Loading

ganeshmurthy Oct 3, 2022 •

edited

Loading