Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #742: Removed granted read buffer tracking in http2 and tcp ada… #743

Merged
merged 1 commit into from
Oct 3, 2022

Conversation

ganeshmurthy
Copy link
Contributor

…ptors

@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2022

Codecov Report

Attention: Patch coverage is 0% with 15 lines in your changes missing coverage. Please review.

Project coverage is 27.6%. Comparing base (a9b9c90) to head (59ad085).
Report is 502 commits behind head on main.

Additional details and impacted files
@@          Coverage Diff          @@
##            main    #743   +/-   ##
=====================================
  Coverage   27.6%   27.6%           
=====================================
  Files        131     131           
  Lines      31431   31418   -13     
  Branches    5033    5032    -1     
=====================================
  Hits        8679    8679           
+ Misses     21647   21634   -13     
  Partials    1105    1105           
Flag Coverage Δ
unittests 27.6% <0.0%> (+<0.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
unittests 27.6% <0.0%> (+<0.1%) ⬆️
systemtests 27.6% <0.0%> (+<0.1%) ⬆️

Copy link
Contributor

@jiridanek jiridanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the test failure -
https://github.com/skupperproject/skupper-router/actions/runs/3146738179/jobs/5115896447#step:29:2330
The timeout of 400 seconds, is it set here - https://github.com/skupperproject/skupper-router/blob/main/.github/workflows/build.yaml#L167
Is that an individual timeout per test of 400 seconds ?

Yes, SkstatTest::test_yy_query_many_links had 400 seconds to finish, but it did not finish. I logged issue for this test already #637

Also, I had some thoughts in #672, if we had that, it would be easier to debug this. Currently we cannot tell if this problem is caused by #535 or if it is a completely new issue.

Also, the PR is attempting to close all outstanding adaptor (tcp, http) listeners, connectors and non amqp connections before the router shuts down in test. This way, I don't have to track granted read buffers in the adaptor code. But those buffers are still leaking here - https://github.com/skupperproject/skupper-router/pull/743/checks#step:6:505
It looks like we will have to do the same thing (shutdown all connections) in the container image test that is running h2spec.

What the test does now is described in the comment

* Performs the DISPATCH-1940 reproducer steps using Container images. To run manually:
*
* 1. Write the `ROUTER_CONFIG` above to a file, say `h2spec_router.conf`
* - change `host: nghttpd` to `host: localhost`
* 2. Run router, nghttpd and h2spec, each in its own terminal, using docker or (for skrouterd) without it
* - `docker run --rm -it --network=host nixery.dev/nghttp2:latest nghttpd -a 0.0.0.0 --no-tls -d /tmp 8888`
* - `skrouterd -c h2spec_router.conf`
* - `docker run --rm -it --network=host summerwind/h2spec:2.6.0 -h localhost -p 24162 --verbose --insecure --timeout 10`

What you suggest is that we need to run skmanage, find out what connections are active, and terminate the connections using a management operation? Or is the trick to shutdown the nghttpd first, and then shutdown the router? How do I detect that the connection is already scrapped in the router? With a skmanage call?

Wouldn't it be simpler to teardown all connections in the router itself if it is running in a memory-debug mode?

@jiridanek
Copy link
Contributor

@ganeshmurthy When I put #750 on top of this, the leak from image tests disappears for me

@jiridanek
Copy link
Contributor

https://github.com/skupperproject/skupper-router/actions/runs/3159057422/jobs/5141859848#step:10:5046

33/33 Test #57: system_tests_http1_adaptor ...........***Timeout 1199.71 sec

What just happened?

@ganeshmurthy
Copy link
Contributor Author

https://github.com/skupperproject/skupper-router/actions/runs/3159057422/jobs/5141859848#step:10:5046

33/33 Test #57: system_tests_http1_adaptor ...........***Timeout 1199.71 sec

What just happened?

I don't think that failure has anything to do with this PR in particular but I am not sure. This PR is trying to shut down all connections just before the router shuts down, so I am hoping that the PR did not cause the error. @kgiusti, what do you think ?

Copy link
Contributor

@kgiusti kgiusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what this does, but I'm concerned that there's something not right if the router is calling close connection twice now.

I would also break this up into two separate PRs: one patch to the CI that does the adaptor/connector/connection cleanup and a follow up PR patch that removes the adaptor buffer tracking. These really are two separate fixes.

qd_manager.delete_all_entities(long_type)
retry_assertion(self.delete_adaptor_connections(qd_manager))
except Exception as e:
# We tried to delete some stuff but ran into issues. There could
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are exceptions being hit in your tests? If so, what kind?

In any case at least add a print statement indicating what happened, the exception raised, etc. rather than failing silently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also, catching Exception will catch SyntaxError, etc... Best to have explicit list, esp. because it is hard to improve this later on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes a proton ConnectionException is thrown which proton does not seem to expose ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly the problem is that qdmanager is used to do the operation. Qdmanager isn't using proton directly - it spawns a process that runs skmanage as a shell process then simply raises Exception should the process return an error code.

So we lose any access to the proton exception details.

Copy link
Contributor Author

@ganeshmurthy ganeshmurthy Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QdManager does indeed wrap all exceptions into an Exception object but the exception details are not lost. QdManager calls raise Exception("%s %s" % rc) which upon failure prints the following error message (for example a ConnectionException)
ConnectionException: Connection amqp://0.0.0.0:23860 disconnected: Condition('proton.pythonio', 'Connection refused to all addresses') - prints on a proton ConnectionException

@ganeshmurthy
Copy link
Contributor Author

I like what this does, but I'm concerned that there's something not right if the router is calling close connection twice now.

I would also break this up into two separate PRs: one patch to the CI that does the adaptor/connector/connection cleanup and a follow up PR patch that removes the adaptor buffer tracking. These really are two separate fixes.

The reason, I had all this in the same patch is because these are related. Deleting the connectors/connections is what is allowing me to take out the tracking of the granted_read_buffers from the adaptors. If I do a PR with the management change separately and another PR which just removes the granted_read_buffs, I am sure someone might wonder looking at the second PR why suddenly we are able to remove the granted_read_buffs.

@kgiusti
Copy link
Contributor

kgiusti commented Sep 30, 2022

https://github.com/skupperproject/skupper-router/actions/runs/3159057422/jobs/5141859848#step:10:5046

33/33 Test #57: system_tests_http1_adaptor ...........***Timeout 1199.71 sec

What just happened?

I don't think that failure has anything to do with this PR in particular but I am not sure. This PR is trying to shut down all connections just before the router shuts down, so I am hoping that the PR did not cause the error. @kgiusti, what do you think ?

Oh great... yet another buffer corruption issue!

From the server log:

2022-09-30 13:26:40.230373 +0000 HTTP_ADAPTOR (trace) [C1] HTTP request/response codec done. Octets read: 393570 written: 210 (../src/adaptors/http1/http1_server.c:1176)

Yep, those numbers look good. Now what did the client see? From the client log:

2022-09-30 13:31:40.304957 +0000 HTTP_ADAPTOR (trace) [C66] HTTP request msg-id=18 cancelled. Octets read: 210 written: 195584 (../src/adaptors/http1/http1_client.c:965)

TL;DR test client is hanging since it hasn't received the full response although the server sent it.

I'll go re-open the old Issue... Oh look! It's about Beer O'clock on a Friday! So long, folks!!

try:
self._qd_manager = QdManager(address=self.addresses[0])
except Exception as e:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want to silently discard the exception - it will be very hard to figure out what happened if the QdManager fails in GH actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want to silently discard the exception - it will be very hard to figure out what happened if the QdManager fails in GH actions.

You mean, it will be hard to figure out why the QdManager constructor fails in GHA ? The only one reason it could fail is if there is no self.addresses, right ? I will add a print(e) before returning None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not having a self.addresses[0] a bug, or is it OK as long as the caller checks for a None? I think that's my point of confusion: I don't understand why the exception is being ignored. If the API allows a None return code (not a bug), then I would simply comment the code to explain that is why the exception is discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands now, not all routers have a valid self.addresses[0], so it is ok to return a None. I will add a comment clarifying that. But the person who fixes #753 will make not having a self.addresses[0] illegal and require every router to have a management port on self.addresses[0]

@ganeshmurthy ganeshmurthy merged commit 220804c into skupperproject:main Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants