-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #742: Removed granted read buffer tracking in http2 and tcp ada… #743
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #743 +/- ##
=====================================
Coverage 27.6% 27.6%
=====================================
Files 131 131
Lines 31431 31418 -13
Branches 5033 5032 -1
=====================================
Hits 8679 8679
+ Misses 21647 21634 -13
Partials 1105 1105
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the test failure -
https://github.com/skupperproject/skupper-router/actions/runs/3146738179/jobs/5115896447#step:29:2330
The timeout of 400 seconds, is it set here - https://github.com/skupperproject/skupper-router/blob/main/.github/workflows/build.yaml#L167
Is that an individual timeout per test of 400 seconds ?
Yes, SkstatTest::test_yy_query_many_links had 400 seconds to finish, but it did not finish. I logged issue for this test already #637
Also, I had some thoughts in #672, if we had that, it would be easier to debug this. Currently we cannot tell if this problem is caused by #535 or if it is a completely new issue.
Also, the PR is attempting to close all outstanding adaptor (tcp, http) listeners, connectors and non amqp connections before the router shuts down in test. This way, I don't have to track granted read buffers in the adaptor code. But those buffers are still leaking here - https://github.com/skupperproject/skupper-router/pull/743/checks#step:6:505
It looks like we will have to do the same thing (shutdown all connections) in the container image test that is running h2spec.
What the test does now is described in the comment
skupper-router/tests/testcontainers/test_h2spec.rs
Lines 92 to 99 in 000de22
* Performs the DISPATCH-1940 reproducer steps using Container images. To run manually: | |
* | |
* 1. Write the `ROUTER_CONFIG` above to a file, say `h2spec_router.conf` | |
* - change `host: nghttpd` to `host: localhost` | |
* 2. Run router, nghttpd and h2spec, each in its own terminal, using docker or (for skrouterd) without it | |
* - `docker run --rm -it --network=host nixery.dev/nghttp2:latest nghttpd -a 0.0.0.0 --no-tls -d /tmp 8888` | |
* - `skrouterd -c h2spec_router.conf` | |
* - `docker run --rm -it --network=host summerwind/h2spec:2.6.0 -h localhost -p 24162 --verbose --insecure --timeout 10` |
What you suggest is that we need to run skmanage, find out what connections are active, and terminate the connections using a management operation? Or is the trick to shutdown the nghttpd first, and then shutdown the router? How do I detect that the connection is already scrapped in the router? With a skmanage call?
Wouldn't it be simpler to teardown all connections in the router itself if it is running in a memory-debug mode?
0515a7a
to
87314de
Compare
@ganeshmurthy When I put #750 on top of this, the leak from image tests disappears for me |
87314de
to
6a08ee3
Compare
What just happened? |
I don't think that failure has anything to do with this PR in particular but I am not sure. This PR is trying to shut down all connections just before the router shuts down, so I am hoping that the PR did not cause the error. @kgiusti, what do you think ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what this does, but I'm concerned that there's something not right if the router is calling close connection twice now.
I would also break this up into two separate PRs: one patch to the CI that does the adaptor/connector/connection cleanup and a follow up PR patch that removes the adaptor buffer tracking. These really are two separate fixes.
tests/system_test.py
Outdated
qd_manager.delete_all_entities(long_type) | ||
retry_assertion(self.delete_adaptor_connections(qd_manager)) | ||
except Exception as e: | ||
# We tried to delete some stuff but ran into issues. There could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are exceptions being hit in your tests? If so, what kind?
In any case at least add a print statement indicating what happened, the exception raised, etc. rather than failing silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, also, catching Exception
will catch SyntaxError, etc... Best to have explicit list, esp. because it is hard to improve this later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes a proton ConnectionException is thrown which proton does not seem to expose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be exposed
so
from proton import ConnectionException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly the problem is that qdmanager is used to do the operation. Qdmanager isn't using proton directly - it spawns a process that runs skmanage as a shell process then simply raises Exception should the process return an error code.
So we lose any access to the proton exception details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QdManager does indeed wrap all exceptions into an Exception object but the exception details are not lost. QdManager calls raise Exception("%s %s" % rc)
which upon failure prints the following error message (for example a ConnectionException)
ConnectionException: Connection amqp://0.0.0.0:23860 disconnected: Condition('proton.pythonio', 'Connection refused to all addresses')
- prints on a proton ConnectionException
The reason, I had all this in the same patch is because these are related. Deleting the connectors/connections is what is allowing me to take out the tracking of the granted_read_buffers from the adaptors. If I do a PR with the management change separately and another PR which just removes the granted_read_buffs, I am sure someone might wonder looking at the second PR why suddenly we are able to remove the granted_read_buffs. |
Oh great... yet another buffer corruption issue! From the server log: 2022-09-30 13:26:40.230373 +0000 HTTP_ADAPTOR (trace) [C1] HTTP request/response codec done. Octets read: 393570 written: 210 (../src/adaptors/http1/http1_server.c:1176) Yep, those numbers look good. Now what did the client see? From the client log: 2022-09-30 13:31:40.304957 +0000 HTTP_ADAPTOR (trace) [C66] HTTP request msg-id=18 cancelled. Octets read: 210 written: 195584 (../src/adaptors/http1/http1_client.c:965) TL;DR test client is hanging since it hasn't received the full response although the server sent it. I'll go re-open the old Issue... Oh look! It's about Beer O'clock on a Friday! So long, folks!! |
ebd2031
to
834e8a3
Compare
try: | ||
self._qd_manager = QdManager(address=self.addresses[0]) | ||
except Exception as e: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't want to silently discard the exception - it will be very hard to figure out what happened if the QdManager fails in GH actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't want to silently discard the exception - it will be very hard to figure out what happened if the QdManager fails in GH actions.
You mean, it will be hard to figure out why the QdManager constructor fails in GHA ? The only one reason it could fail is if there is no self.addresses, right ? I will add a print(e) before returning None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not having a self.addresses[0] a bug, or is it OK as long as the caller checks for a None? I think that's my point of confusion: I don't understand why the exception is being ignored. If the API allows a None return code (not a bug), then I would simply comment the code to explain that is why the exception is discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands now, not all routers have a valid self.addresses[0], so it is ok to return a None. I will add a comment clarifying that. But the person who fixes #753 will make not having a self.addresses[0] illegal and require every router to have a management port on self.addresses[0]
…p2 and tcp adaptors
834e8a3
to
59ad085
Compare
…ptors