Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[openhabcloud] Tuning of reconnection parameters #12121

Closed
ssalonen opened this issue Jan 26, 2022 · 4 comments
Closed

[openhabcloud] Tuning of reconnection parameters #12121

ssalonen opened this issue Jan 26, 2022 · 4 comments
Labels
enhancement An enhancement or new feature for an existing add-on

Comments

@ssalonen
Copy link
Contributor

ssalonen commented Jan 26, 2022

As discussed in community forums with @digitaldan

... consider tuning the reconnection parameters such that it would be more graceful to the cloud?

That would be great, the initial websocket connection to our cloud service is an expensive operation, so when we have tens of thousands of OH’s all trying to connect at once, we almost DDOS the system.

To my knowledge, currently the cloud connector re-connects based on two different logic

  1. automatically as dictated by IO.socket
  2. "manually" by the CloudClient, when socket is disconnected and we receive Socket.EVENT_ERROR

Both of the places uses exponential backoff and random jitter to distribute the reconnects.

Parameters are as follows

IO.socket reconnects (case 1):

  • defaults, that is:

https://github.com/socketio/socket.io-client-java/blob/89ef9d09cee799ffb85e0252bcaf1993f9d49af9/src/main/java/io/socket/client/Manager.java#L146-L149

this.reconnectionDelay(opts.reconnectionDelay != 0 ? opts.reconnectionDelay : 1000); // "MIN"
this.reconnectionDelayMax(opts.reconnectionDelayMax != 0 ? opts.reconnectionDelayMax : 5000); // "MAX" 
this.randomizationFactor(opts.randomizationFactor != 0.0 ? opts.randomizationFactor : 0.5); // "JITTER"

Cloud client reconnects, manually set (case 2):

reconnectBackoff.setMin(1000);
reconnectBackoff.setMax(30_000);
reconnectBackoff.setJitter(0.5);

The delay is calculated as follows

https://github.com/socketio/socket.io-client-java/blob/89ef9d09cee799ffb85e0252bcaf1993f9d49af9/src/main/java/io/socket/backo/Backoff.java#L16-L27

    public long duration() {
// [this.ms = "MIN"]
        BigInteger ms = BigInteger.valueOf(this.ms)
                .multiply(BigInteger.valueOf(this.factor).pow(this.attempts++));
        if (jitter != 0.0) {
            double rand = Math.random();
            BigInteger deviation = BigDecimal.valueOf(rand)
                    .multiply(BigDecimal.valueOf(jitter))
                    .multiply(new BigDecimal(ms)).toBigInteger();
            ms = (((int) Math.floor(rand * 10)) & 1) == 0 ? ms.subtract(deviation) : ms.add(deviation);
        }
        return ms.min(BigInteger.valueOf(this.max)).longValue();
    }

Distilling that code we get:

delay = ( 1 +- rand * JITTER ) * MIN * FACTOR ^ attempts // factor = 2

where rand is random term in range 0...1.

There is a 50%/50% to have positive/negative jitter term.

In other words,

// Note: ignoring max delay here
delay = MIN * (FACTOR ^ attempts +- rand*JITTER) // FACTOR = 2, MIN currently 1000ms, JITTER currently 0.5, rand is a random factor in range 0...1

So delay is between ( 1 - JITTER ) * MIN * FACTOR ** attempts ... ( 1 + JITTER ) * MIN * FACTOR ** attempts. In addition, currently the code caps the delay to 5000 ms so that means that on 5th connection attempt, the delay is constant.

attempt # delay (ms)
1 500...1500
2 1000...3000
3 2000...6000
4 4000...5000 (without cap would be 12 000)
5 5000..5000 (without cap would be 8000 ... 24 000

Resulting in (*)
image

Could we optimize the parameters for more graceful interaction with cloud when all/many clients are disconnected at the same time? I am not sure which one of the reconnect logics (CloudClient vs IO.socket) kicks in these cases to be honest.

We could increase the randomness by increasing jitter. We could also increase base delay min...

cc @digitaldan

(*)

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
MIN=1000.0; FACTOR=2.0; JITTER=0.5
def delay(attempts): 
    return [( 1 - JITTER ) * MIN * FACTOR ** attempts, ( 1 + JITTER ) * MIN * FACTOR ** attempts]
delays = list(map(delay, range(6)))
plt.plot(list(range(1, len(delays)+1)), delays)
plt.axhline(5000, linestyle='--') # current max delay
plt.ylim(0, 8000)
plt.ylabel('Delay [ms]')
plt.xlabel('attempt #')
plt.title(f'Delay, MIN={MIN}, JITTER={JITTER}')
plt.gca().xaxis.set_major_locator(mticker.MultipleLocator(1))
plt.savefig(f'/tmp/delay_min={MIN}_JITTER={JITTER}.png')
plt.close('all')
delays
@ssalonen ssalonen added the enhancement An enhancement or new feature for an existing add-on label Jan 26, 2022
@Flole998
Copy link
Member

Flole998 commented Feb 7, 2022

I think it's a good idea to add some kind of configuration option there. I am running 2 instances, one of those I want to reconnect as fast as possible (probably even faster than what is currently "allowed" if I could) and another instance which could very much stay offline for a long time. If there is an option to configure the "need" for reconnection (something simple like a number between 1 and 10 would be enough) then I would set one of the instances to 1 and the other one to 10. With a default of 5 that would allow users who would otherwise build their own version with lower (potentially too low) delays to decrease it and others who have instances just for testing or whatever can increase it so they don't put any unnecessary load on the system. So maybe make min configurable?

@ssalonen
Copy link
Contributor Author

Best practices of different delay strategies discussed in https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

@ssalonen
Copy link
Contributor Author

@digitaldan how would you adjust the parameters for more graceful behaviour to cloud?

Just as a suggestion to increase randomness, e.g. MIN=2000 (ms), JITTER=0.75 with MAX delay of 15000ms

Delay would be as follows:

1: 500     ... 3500],
2: 1000    ... 7000],
3: 2000    ... 14 000],
4: 4000    ... 15 000 (without cap would be 28000 )
5: 8000    ... 15 000 (without cap would be 56000)
6: 15 000  ... 15 000  (without cap would be 16000 ... 112000)

image

@ssalonen
Copy link
Contributor Author

See #14251

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature for an existing add-on
Projects
None yet
Development

No branches or pull requests

2 participants