Skip to content

Handling network errors

Hinrich Mahler edited this page Dec 7, 2024 · 14 revisions

Background

Network errors can happen at many levels, on the local side, on the remote or somewhere along the way.

It is hard for applications to identify when an error occurs as it is not always something that can be identified in real time but rather when an additional event happens.

Network errors are a problem which each application have to deal with. Each application for its own use case and desired service level.

Where things can break?

Locally (in PTB)

For something to "break" locally we need to close the network socket. This is something that PTB doesn't do voluntarily and basically if you've encountered such a case, it's a bug.

Remotely on Telegram servers

There is a suspicion that Telegram are closing connections on some conditions. Whether it is true or not we can't really tell. The application should still be able to deal with it.

Somewhere along the way

How to mitigate?

Analysis

There is no one way to handle networking issues. Each bot has its own usage pattern and unique network behaviour.

Some bots will require providing answer to user requests within a limited time, while others can take all the time they need. Some bots will be ok with "losing" a sent message once in a while, while others must ensure that the message was sent within a limited time frame. Some bots send big files, while others only receive small files. Etc., etc.

In addition, the network connection takes an important factor. Hosting your bot in a region with poor Internet (bandwidth / packet lost / disconnections) is the obvious example, but it is not the only problem. In case the bot is hosted "far" from the Telegram Bot servers you'll experience greater latency and possibly more packet loss. TCP should be able to handle packet loss, however, there are side effects for that (when will TCP identify the loss of a packet? what will be the window size? etc.).

Hosting the bot on a slow server can also affect performance and may be observed as networking issues.

Once you've understood the usage pattern of your bot (that includes the QoS) you can go into fine-tuning the ptb library for your needs.

Tweaking PTB

Networking backend tweaks

PTB performs HTTPS requests using the telegram.request.BaseRequest interface. The method BaseRequest.do_request accepts four parameters for controlling timeouts:

  • read_timeout specifies the maximum amount of time (in seconds) to wait for a response from Telegram’s server
  • write_timeout specifies the maximum amount of time (in seconds) to wait for a write operation to complete (in terms of a network socket; i.e. POSTing a request or uploading a file)
  • connect_timeout specifies the maximum amount of time (in seconds) to wait for a connection attempt to a Telegram server
  • pool_timeout specifies the maximum amount of time (in seconds) to wait for a connection to become available from the connection pool

The built-in request.HTTPXRequest (which implements BaseRequest) uses a default of 5 seconds for {read, write, connect}_timeout and 1 second for pool_timeout.

The write_timeout is overwritten to 20 seconds when using send methods which attach files (send_audio, send_document, etc.). When sending big files, calling Bot.send_*(write_timeout=BIGGER_VALUE) might be a good idea.

When using the standard HTTPXRequest, changing the defaults of {read, write, connect, pool}_timeout & connect_timeout is done when initializing the Application.

Application.builder().token("Token").read_timeout(7).get_updates_read_timeout(42).build()

Note that there is both ApplicationBuilder.read_timeout() and ApplicationBuilder.get_updates_read_timeout (and similarly for the other timeouts) since PTB uses two different request objects for Bot.get_updates and all other bot methods.

See also the wiki page on the builder pattern.

Bot.get_updates

When using Application.run_polling()/Updater.start_polling(), getUpdates is achieved using Long Polling. This means that the get_updates request is kept alive for a (Telegram defined) timeout of X seconds to respond with new Update(s). Only if updates are available before the X seconds have passed, the request is closed right away and get_updates is called again. This also means that if a network error occurs during that timeout, it will be discovered only after the timeout has passed.

To be specific, when using the raw API (i.e. using Bot directly), the value you pass for timeout is added to the value of read_timeout and that is the exact timing when the error will be detected. In that mode, the default value of timeout is 0 and the default value of read_timeout is 2.

If you're using Updater.start_polling/Application.run_polling() then the default value of timeout is overridden to 10. That means that the average time to identify network error is 6 seconds.

Depending on the SLA you'd like to provide, if you suffer from frequent errors during getUpdates consider lowering the timeout value to catch the errors faster. Please take in mind that too short polling intervals may have undesirable side effects:

  • Telegram servers might conceive it as abusive and block your bot.
  • Increase the load on your server and network connection.

Socket options

Depending on your OS you may be able to set socket options to perform low level tweaks. In the default networking backend, HTTPXRequest, there is a corresponding parameter for this.

Stabilizing your app

When a network error occurs, be prepared to catch the raised exception and handle it according to your policy (do you want to retry? ignore? other?) or use PTBs built-in mechanism for exception handling. Note that for (network) errors that happen as part of Application.run_polling/Updater.start_polling are also processed via Application.process_error.

PTB

If you think of another way to improve stability from within ptb, please contact us (the maintainers).

Clone this wiki locally