Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds ability to set timeouts for a knxdclient.KNXDConnection #2

Merged
merged 24 commits into from
Dec 2, 2023

Conversation

mrctrifork
Copy link

First of all, thank you for creating this library. It's nice and easy to use.

Issue

I have a server running knxd that reboots once a day and I've noticed that if one uses iterate_group_telegrams and knxd reboots it leaves the coroutine hanging; awaiting for a message that will never come since the connection was dropped. This is caused because there were no timeouts in place.

Suggested fix

I've added an optional parameter to the class KNXDConnection that can be used as a second argument to asyncio.wait_for and it is used in places where we are reading from knxd. The timeout errors will allow us to break out of the while True polling statements.

If the timeout is None old behavior is used instead.

Thank you for taking a look at this.

Miquel

@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

Copy link
Owner

@mhthies mhthies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, Thank you very much for your contribution! :)

I like the general approach and I noticed that I have had the unaddressed TODO comment to add a timeout for the open_group_socket method all along.

I've left a few comments, because I'd like to have the architecture, how the timeouts are implemented, a bit different. After writing the comments, I'm now wondering whether we need this timeout at all: Shouldn't the TCP connection be closed, when your server reboots, such that we get an asyncio.IncompleteReadError and the run() coroutine raises an exception? If that doesn't work, maybe we should try to configure the TCP keepalive on the network socket?

We'd still need some internal Event to let the other coroutines fail with some Exception as well (see my inline comment for more details), instead of waiting infinitely for a packet on a broken socket connection. But it feels cleaner to rely on the operating system's detection of connection loss than applying our own (duplicated) timeout logics.

knxdclient/client.py Outdated Show resolved Hide resolved
knxdclient/client.py Outdated Show resolved Hide resolved
if self._timeout is not None:
read_task = self._read_raw_knxpacket()
data = await asyncio.wait_for(read_task, self._timeout)
else:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool to have some kind of "keep alive" feature here, i.e. trigger some request to KNXD every few seconds or so, so that we can be sure to receive a packet from KNXD regularly. Then we don't have to rely on KNX bus activity to avoid false timeouts and can chose an even shorter timeout value.

I will take a look at the KNXD protocol spec to see if there is a function that we can (mis)use as a keepalive.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Unfortunately, according to the BCUSDK docs about EIBD's/KNXD's interface, there is no appropriate function to use as a keepalive, that reliably sends a response packet, once we opened the 'group socket' connection.

I'm somehow unconvinced to rely on KNX bus traffic to avoid running into timeouts.

Have you tested whether the event notification mechanism, you implemented today, is sufficient to detect the connection loss and stop the receive loop, when your KNXD server reboots – without the timeout here in the run() coroutine (or with a very very long timeout)?
If it doesn't work: Can you test whether setting up TCP keepalive for the socket here allows us to detect the connection loss reliably?

sock = self._writer.get_extra_info("socket")
# The following seems to depend on your OS and is specific for Linux. See https://stackoverflow.com/q/12248132/10315508
sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, after_idle_sec)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, interval_sec)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, max_fails)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the mechanism manually by connecting to a Debian docker image with KNXD installed and used knxtool to simulate some traffic which then I stopped and it seems to work.

It's possible that there is some stuff about coroutines / tasks that goes over my head as I am seeing the CI/CD failing.

The definitive test will come in 5h and 20 minutes :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested whether the event notification mechanism, you implemented today, is sufficient to detect the connection loss and stop the receive loop, when your KNXD server reboots – without the timeout here in the run() coroutine (or with a very very long timeout)?

It was enough as I kept receiving data throughout the night! 🎉

But I see how setting the TCP props on the socket will also be of use.

@mrctrifork
Copy link
Author

Hi, thanks a lot for your feedback.

I am newish writing python. My background is more TypeScript / Kotlin during these last years. I lack knowledge on how async/await actually works in python so I just attempted to make it work.

In regards this:

After writing the comments, I'm now wondering whether we need this timeout at all: Shouldn't the TCP connection be closed, when your server reboots, such that we get an asyncio.IncompleteReadError and the run() coroutine raises an exception? If that doesn't work, maybe we should try to configure the TCP keepalive on the network socket?

Well – Your judgement will be better than mine. My programming background is mostly in other languages so I just don't know the python internals about async/await. I just figured that the "program got stuck because the queue was not closed yet the connection got dropped".

I'll go over your comments when I get the chance and ping you once the changes have been submitted.

Again; thank you for your time and feedback and I'll take this as far as I can :)

@mrctrifork mrctrifork requested a review from mhthies November 28, 2023 09:34
Copy link
Owner

@mhthies mhthies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reworking your code. That looks really good to me! :)

I'd like to test if we now need the timeout mechanism at all (see my new inline comment). Apart from that, you should take a look at the failing unittest and the MyPy complaints. Then, I'm fine with merging the PR.

@mrctrifork
Copy link
Author

As I've stated in this thread

We'll have the official test some time from now :), once that happens I'll consider the new approval to be working.

And you're right. CI/CD is failing now so I'll take a look at that tomorrow

@mrctrifork
Copy link
Author

I can confirm it did the trick!

@@ -148,7 +148,7 @@ async def test_connection_failure(self) -> None:
await connection.open_group_socket()
finally:
await connection.stop()
with suppress(asyncio.CancelledError, ConnectionAbortedError):
Copy link
Owner

@mhthies mhthies Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a ConnectionAbortedError is raised here (in an explicit stop(), I would consider this a bug, so we should not ignore the exception.

EDIT: Sorry, I misread the diff. You already removed the suppressing here again. Then, everything is fine.

return await reader.readexactly(length)
if self._reader is None or self._reader.at_eof():
raise ConnectionError("No connection to KNXD has been established yet or the previous connection's "
"StreamReader is at EOF")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work when manually closing the connection via close() such that no exception is raised in this case?

@mhthies
Copy link
Owner

mhthies commented Nov 29, 2023

I can confirm it did the trick!

Then, we can remove the timeout parameter, can't we?
We can also keep it, but than we should document in the docstring of the __init__method (via an @param timeout ... line) that it should not actually be required and if it is set, it will require regular KNX bus activity.

@mhthies
Copy link
Owner

mhthies commented Nov 30, 2023

I took the liberty to do some small improvements to the code style and add a test for the new behaviour. Please take a look, if this still is in your sense; then I will merge the PR.

@mhthies mhthies merged commit 08d82c6 into mhthies:master Dec 2, 2023
4 checks passed
@mrctrifork
Copy link
Author

Good afternoon, I've been away from the computer for a bit. Thank you for your time to make these improvements and looking at the changes.


The other day I was answering your comment but I did not hit send in the end; shut down the machine and forget about everything.

Then, we can remove the timeout parameter, can't we?
We can also …

I meant that I just tried with my changes – didn't try with the TCP keepalive options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants