Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0] P2P: Use magnitude safe time types #1316

Merged
merged 3 commits into from
Jun 20, 2023
Merged

Conversation

heifner
Copy link
Member

@heifner heifner commented Jun 17, 2023

Use std::chrono types instead of tstamp so that magnitude is maintained. Fixes issue with heartbeat calculation being off on some machines causing pre-mature close of what it thought was a idle connection.

Resolves #1315

@heifner heifner added the OCI Work exclusive to OCI team label Jun 17, 2023
Copy link
Contributor

@greg7mdp greg7mdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what was the issue with tstamp. I felt that standardizing on the number of nanoseconds since epoch was nice and simple, but I'm clearly missing what the issue was here.

void connection::check_heartbeat( tstamp current_time ) {
if( latest_msg_time > 0 ) {
void connection::check_heartbeat( std::chrono::system_clock::time_point current_time ) {
if( latest_msg_time > std::chrono::system_clock::time_point::min() ) {
if( current_time > latest_msg_time + hb_timeout ) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was here. current_time was in nanoseconds. lastest_msg_time was also in nanoseconds, but hb_timeout was in milliseconds. So when the timer ran the latest_msg_time had to be within 10000 nanoseconds (10 microseconds) or it would close the connection.

Copy link
Contributor

@greg7mdp greg7mdp Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue was that hb_timeout was in milliseconds. It seems incorrect to me, as it is defined as std::chrono::system_clock::duration::rep, so I feel we should consistently use system_clock duration in tstamp (on my linux system, system_clock's duration is defined as chrono::nanoseconds.

I assume that by using std::chrono types, we will benefit from an automatic conversion in if( current_time > latest_msg_time + hb_timeout ) which will prevent the error. But tstamp was already a std::chrono duration, so I wonder if most of the changes, including this one, are really needed.

My feeling is that the change at line 610 (hb_timeout = msec;) would have been enough for the fix, as the problem was that we used count() which just returned a tick count, so we didn't benefit from the conversion from milliseconds to nanoseconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be. Either way, this is much cleaner/safer code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this makes the code cleaner or safer. I think if you changed line 610 to hb_timeout = msec.count() the issue would come back exactly as before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused at what you are suggesting should be done. time_point, milliseconds, nanoseconds types are well defined and arithmetic operations are handled correctly. tstamp (std::chrono::system_clock::duration::rep) on the other hand is defined as rep - signed arithmetic type representing the number of ticks in the clock's duration which is system dependent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is nanoseconds too sensitive in P2P?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nanoseconds is kind of ridiculous to use. However, that is what we were already using for time_message and time of handshake_message. Internally now with this change we just use whatever std::chrono::system_clock::now() returns.

Copy link
Contributor

@greg7mdp greg7mdp Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tstamp should be either defined as:

  1. std::chrono::nanoseconds (in which case we would benefit from safe type checking and conversions),
  2. or as int64_t (in which case it would be clear it is simply a number and up to the programmer to ensure we always store nanoseconds into it).

As far as I can see, the bug was that we are in case #2 (although it is somewhat obfuscated and unreliable because we use a std::chrono internal type) and that we stored milliseconds in a tstamp (line #611).

Your change fixes the bug at line 611, and in addition replaces the use of tstamp with std::chrono::system_clock::time_point. I think it makes the code less readable.

My preference would be to implement #1 above, i.e. redefine tstamp as std::chrono::nanoseconds, which I think would provide the additional safety in a cleaner way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely removed tstamp type and backported the better time_message handling from main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kevin for the change. I'm not quite sure is the initialization with std::chrono::system_clock::time_point::min() has a specific benefit, I kinda feel that 0 is more appropriate than a negative value. And I also thing that defining tstamp as std::chrono::nanoseconds and using it for our connection time tracking members would be better. But maybe we can talk about this further for 5.0.

Copy link
Member

@linh2931 linh2931 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add some tests for the original problem and/or those changes?

@heifner heifner linked an issue Jun 19, 2023 that may be closed by this pull request
@heifner heifner merged commit e9016b0 into release/4.0 Jun 20, 2023
@heifner heifner deleted the GH-1315-time-4.0 branch June 20, 2023 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4.0.2: p2p constantly disconnects
4 participants