Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the index of a seed node address in the list of nodes at the #4445

Merged

Conversation

chimp1984
Copy link
Contributor

@chimp1984 chimp1984 commented Aug 28, 2020

Use the index of a seed node address in the list of nodes at the repository to determine the hour to restart.
This should avoid the riks that multiple seed nodes restart at the same time which can lead to data loss.

repository to determine the hour to restart.
Copy link
Member

@sqrrm sqrrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

Looks ok, but I'll leave a but in case @chimp1984 wants to act on my comment

"\n%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n\n", target);
shutDown(gracefulShutDownHandler);
}
}, TimeUnit.MINUTES.toSeconds(10));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use CHECK_SHUTDOWN_SEC here?

Copy link
Contributor Author

@chimp1984 chimp1984 Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK_SHUTDOWN_SEC is 1 hour. That has risk that we miss the hour we are targeting. If one timer event is triggered just 1 sec. before our target hour and then due cpu load the next 1 hour scheduled call takes longer we would miss our target hour. E.g. once call at 13:59, next call at 15:01, 14:00 would be our target.

With 10 min. Its safe enough and does not cause performance costs. We could maybe go higher like 30 min as well...but prefer to keep it on the safe side.

@wiz
Copy link
Contributor

wiz commented Aug 29, 2020

I ran this on wizseed3 and it restarted once in a day during the 9th hour of the day, so the restart code tested fine. However, I have a few comments on the implementation used to decide when to restart. Are you sure it's a good idea to sort the list of hostnames before deriving the IDs? By doing it that way, whenever we add new seednodes to the list it will change all the IDs, and since not everyone will upgrade on the same day, this might cause 2 seednodes to have the same ID and restart at the same time, causing the exact issue we are trying to prevent from happening.

Alternatively, why not use a deterministic function to derive a number from the digest of the onion hostname? For example, since there are 1440 minutes in a day, so if you take the the md5sum of the hostname wizseed3d376esppbmbjxk2fhk2jg5fpucddrzj2kxtbxbx4vrnwclad, which is 0xd633dfb94a557650c94b995f4c42bf61, and mod that by 1440 you get 289, so the node should restart on the 289th minute of every day. This is quite safe because it only takes a node about a minute to restart, and the chance of 12 seednodes colliding in a 1440 namespace is less than 1%, which seems like it would be more reliable than only using the 24 hours of a day.

Additionally, regardless of which implementation is used, the restart code should use UTC regardless of the timezone that the server is set to. Some servers have their timezone set to local times which would offset the deterministic restart time.

@sqrrm
Copy link
Member

sqrrm commented Aug 29, 2020

@wiz it's using
int currentHour = ZonedDateTime.ofInstant(Instant.now(), ZoneId.of("GMT0")).getHour();
Does that mean it's on daylight savings or is GMT the same as UTC? I hate time zones, and daylight savings even more.

@mrosseel
Copy link
Contributor

@sqrrm this page below confirmed my thoughts that both are equivalent, using UTC seems more the standard way of doing it though.
https://www.timeanddate.com/time/gmt-utc-time.html

@wiz agreed that the hash is more elegant, but by using the ID the chance of collision is 0% (up to 24 seednodes) which is even better

@sqrrm
Copy link
Member

sqrrm commented Aug 29, 2020

@mrosseel thanks for the link, good info.

The whole point of this exercise was to have a deterministic restart schedule to have 0 chance that all nodes restart at the same time. Two nodes restarting at the same time is probably fine, three nodes might cause trouble in extreme cases. Merging.

@sqrrm sqrrm merged commit 591d68a into bisq-network:master Aug 29, 2020
@chimp1984
Copy link
Contributor Author

Re timezone: Local timezone is ignored, so it does not matter what server has set. It is just important to set it same for all as otherwise hour 3 might be same as hour 5 if timeezone of 2 nodes is 2 hours different.

Re sorting: Good point with adding a node. I could use the list as it is defined, but in the code its a HashSet, so no order guaranteed but I could change that. Hash with mod will result in a random value which will not avoid out birthday problem. It would be same as now with random restart. The birthday problem is particularily a case for hash collusions for hash use cases.

@chimp1984 chimp1984 deleted the restart-seednodes-deterministically branch August 29, 2020 17:56
@ripcurlx ripcurlx added this to the v1.3.8 milestone Aug 29, 2020
@chimp1984
Copy link
Contributor Author

There is one problem with using the order in the node list: If a node gets removed we run into the same problem. I keep the sorting for now until a "perfect" solution comes up. If a node gets added we have to update all nodes quickly and restart them.

@wiz
Copy link
Contributor

wiz commented Aug 30, 2020

@chimp1984 do you not like my suggestion of using the deterministic hash function(onion) mod 1440 ?

@chimp1984
Copy link
Contributor Author

@wiz This does not solve the problem of potential collusions. The only benefit over a random selection is that you can run it first and see the results and can check if there are collusions. But if you add a node you have to run again and if it produces a collusion you are in bigger troubles as with the sorted list model. The you need to change the mod or hash and try to get no collusion again. But with any further additions it can be the same again.... Not saying it is likely but does not provide what we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants