Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of DateTime.UtcNow on Windows #50263

Merged
merged 2 commits into from
Mar 27, 2021

Conversation

GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 25, 2021

Follow-up to #44771. Builds atop work done in #45479 and #46690.
Resolves #13091.

This PR is based on a series of conversations @tarekgh and I have had recently with regard to short-circuiting our leap second handling logic where possible. For background, Windows understands that days normally have 86,400 seconds, but this can be off by ±1 second due to positive or negative leap seconds. .NET's DateTime type requires each day to have exactly 86,400 seconds. Previous .NET releases resolve this discrepancy through the following special-cased logic: "Is the current time 23:59:60.xxx? If so, I'll return a DateTime instance whose time component reads 23:59:59.999." Unfortunately, this check is not free, and it shows up in benchmarks when we access DateTime.UtcNow.

This PR relies on the following two properties of leap seconds:

  • leap seconds only ever kick in on the last second of the day, and
  • any given day will never be adjusted beyond ±1 second due to leap seconds.

This optimization works by keeping a cache which maps the last-seen Win32 FILETIME struct to its equivalent .NET DateTime struct. The cache is valid for some window of time (by default, 5 minutes). If a call to DateTime.UtcNow takes place during the cache's validity window, we don't bother asking the OS to deconstruct the FILETIME struct back into a SYSTEMTIME struct (which accounts for leap seconds). Instead, we know that we can just perform simple addition without worrying about leap second handling.

There is one condition we need to check: Does a leap second actually occur within the next 5 minutes? If so, we can't store (now, now + 5 min) into the cache, otherwise DateTime.UtcNow will start returning incorrect values as soon as midnight rolls around. To account for this, we slide the cache window backward, choosing to set the cache validity period to (today at 23:54:59, today at 23:59:59) instead of (now, now + 5 min).

The reason we use a 5 minute validity window instead of a 24-hour validity window is called out in code comments. tl;dr: The OS uses shared memory mapping to publish leap second tables to all processes, and it can extend that table without killing active processes or rebooting the machine. There is a slim but non-zero chance that this table can be updated to insert leap second data into the past. If this happens, our cache will be incorrect. After consultation with the Windows team, we decided that a 5-minute validity window is an appropriate tradeoff that improves performance in the common case and minimizes the window in which an application might encounter problems if this situation were to occur.

Method Job Toolchain Mean Error StdDev Ratio
GetUtcNow Job-JYYDLR main 65.15 ns 0.103 ns 0.097 ns 1.00
GetUtcNow Job-RHLQLD proto 27.74 ns 0.359 ns 0.336 ns 0.43

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 25, 2021

Since it's very difficult to unit test this type, I'm performing manual testing by breaking into a debugger and mutating registers. Will validate edge case conditions and report back here.

Edit: Tarek already added some UtcNow unit tests a while back when initial leap second support was added. See dotnet/corefx#33939.

// If FileTimeToSystemTime fails, call GetSystemTime and try again until it succeeds.
while (Interop.Kernel32.FileTimeToSystemTime(&fileTimeNow, &systemTimeNow) == Interop.BOOL.FALSE)
{
goto TryAgain;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this loops forever for some reason -- I guess we assume the OS is broken and it doesn't really matter that we didn't fail?

Copy link
Member Author

@GrabYourPitchforks GrabYourPitchforks Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had Environment.FailFast in here but figured that there might be a race condition where the OS is publishing new data at the same time we're running this method. If something's borked in the OS we'll probably loop forever. Hopefully we'll only loop until midnight (which is 2 seconds away at worst). :)

If this is a concern I can perhaps keep a counter limiting us to 10 tries before we failfast? Or maybe always fall back to new DateTime(GetSystemTime())?

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 25, 2021

Test case information

  • Ensure DateTime.UtcNow uses cache when inside validity period (line 30)
  • Ensure DateTime.UtcNow doesn't use cache when before validity period (line 33)
  • Ensure DateTime.UtcNow doesn't use cache when after validity period (line 33)

The three test cases above were verified by trapping calls to GetSystemTimePreciseAsFileTime (call site at line 19) in a debugger and manipulating the value before it's returned.

(For the following tests, ImageFileExecutionOptions was used to mark corerun.exe as "leap-second aware".)

  • Ensure cache early-exit is hit if seconds field reads 60 (line 215)

This test case was verified by using the w32tm tool to insert a positive leap second into the system table, then trapping calls to GetSystemTimePreciseAsFileTime to return the FILETIME corresponding to 23:59:60 on that date, then stepping through the debugger and validating that the returned SYSTEMTIME struct had its seconds field set to 60.

  • Ensure validity window is shifted back if a positive leap second would occur and that entire validity window is valid (line 253)

This test case was verified by using the w32tm tool to insert a positive leap second into the system table, then changing the system time to 23:59:30, then stepping through the debugger to ensure that the DateTime instance cached had the time value 23:54:59 and the DateTime instance returned had the time value 23:59:30.

  • Ensure validity window is shifted back if a negative leap second would occur and that entire validity window is valid (line 253)

This test case was verified by using the w32tm tool to insert a negative leap second into the system table, then changing the system time to 23:59:30, then stepping through the debugger to ensure that the DateTime instance cached had the time value 23:54:59 and the DateTime instance returned had the time value 23:59:30.

Tools and resources for testing

@mattjohnsonpint
Copy link
Contributor

In real life, leap seconds can only occur on two specific dates, June 30th and Dec 31st (UTC). Would it be worth optimizing further for that? Or do we want to accept any leap second Windows throws at us?

@GrabYourPitchforks
Copy link
Member Author

@mattjohnsonpint Good question. I don't see any good opportunity for us to take advantage of that. Even in the original abandoned PR (#44771), which was pretty aggressive about caching, we didn't look at the date component. We were actually asked to be less aggressive. :) So I knocked the cache validity window down from 23h59m59s to 5 min.

Given this, do you see an opportunity for us to use the date component to increase the size of the window? Nothing comes to mind on my end.

@tarekgh
Copy link
Member

tarekgh commented Mar 26, 2021

I want to mention a historical leap second can be inserted to the system too. if we make the cache update only at the end of June or December, that increase the risk when that leap second gets inserted.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for getting this ready.

One suggestion is to do some testing using w32tm.exe to insert some leap seconds and look how the .NET would report when doing that.

@GrabYourPitchforks
Copy link
Member Author

@tarekgh You mean 4.8? Sure, I can do that. Any other runtimes of interest?

Open question: is it possible that DateTime.UtcNow might now be recursive? Like, some debugging or introspection engine hooking allocations and and calling DateTime.UtcNow as part of its logging mechanism? If this is a possibility, I can add a guard against it.

@tarekgh
Copy link
Member

tarekgh commented Mar 26, 2021

You mean 4.8? Sure, I can do that. Any other runtimes of interest?

No, I meant try the tool with your changes. the tool can allow you insert leap seconds. so, you can try run your test and in the middle run the tool to insert a leap second and look if the test continues behave nicely :-)

Open question: is it possible that DateTime.UtcNow might now be recursive? Like, some debugging or introspection engine hooking allocations and and calling DateTime.UtcNow as part of its logging mechanism? If this is a possibility, I can add a guard against it.

I am not aware of any case like that which can cause the recursion. I am not even sure if this is possible with UtcNow.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 26, 2021

Pondered this a bit more, and it turns out there are edge case scenarios (exception filters, possibly others) where arbitrary user code might run inside UtcNow. Makes me think we should have a guard against recursive calls, just so that we don't bring down the app in these weird edge cases. It won't add any overhead to the fast "cache is valid" path, so I don't think there's any code risk from adding them.

@tarekgh
Copy link
Member

tarekgh commented Mar 26, 2021

Makes me think we should have a guard against recursive calls,

Were you able to get into this case? could you please share more info how this can happen?

@GrabYourPitchforks
Copy link
Member Author

Were you able to get into this case? could you please share more info how this can happen?

I didn't try it. It was just a thought experiment. But in theory:

try
{
    DoWork();
}
catch when (MyExceptionFilter())
{
    /* blah blah */
}

If DoWork throws an exception, MyExceptionFilter could end up running within DoWork's frame. (.NET doesn't unwind the call stack.)

If DoWork calls DateTime.UtcNow, and if DateTime.UtcNow throws an exception (perhaps an OOM due to the allocation in the cache miss path), and if MyExceptionFilter also calls DateTime.UtcNow (perhaps as part of logging), then we're in a recursive call scenario. I really don't think this is likely in practice, but I can't prove to myself that it will never happen.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 26, 2021

@tarekgh I tried all kinds of wonderful and horrible combinations, such as using w32tm to add leap seconds while this method was executing, changing the system time while this method was executing, changing the process's "60-second leap seconds" support while this method was executing, and so on. (I didn't try removing leap seconds while this method was executing, as that would violate OS invariants. The w32tm tool doesn't allow this without a system reboot anyway.)

In all cases, the cutoff point is the code below.

ulong fileTimeAtEndOfValidityPeriod = fileTimeNow + LeapSecondCache.ValidityPeriodInTicks;
Interop.Kernel32.SYSTEMTIME systemTimeAtEndOfValidityPeriod;
if (Interop.Kernel32.FileTimeToSystemTime(&fileTimeAtEndOfValidityPeriod, &systemTimeAtEndOfValidityPeriod) == Interop.BOOL.FALSE)
{
goto TryAgain;
}

If the clock / leap seconds are changed before this point, we correctly respect it. Once this point is hit, we're already down the "start calculating values to cache" code path. So if the clock / leap seconds change after this point, we'll end up caching incorrect values for 5 minutes. Then after that 5 minute window is up, we'll evict our cache and query the OS for the actual time again.

To the application, the observed behavior is that the return value of DateTime.UtcNow might jump a little bit forward or backward once the cache expires. It seems to be no different than today's behavior if I were to change the system clock by a second or two while the app is running. I didn't see any ill effect.

I'll update the code comments to clarify this.

@tarekgh
Copy link
Member

tarekgh commented Mar 26, 2021

@GrabYourPitchforks thanks for testing with w32tm tool. I think at least we have confirmed the expected behavior and what would be the worst-case scenario which still be acceptable considering these are rare scenarios. We should be good to go.

@tarekgh
Copy link
Member

tarekgh commented Mar 26, 2021

If DoWork calls DateTime.UtcNow, and if DateTime.UtcNow throws an exception (perhaps an OOM due to the allocation in the cache miss path), and if MyExceptionFilter also calls DateTime.UtcNow (perhaps as part of logging), then we're in a recursive call scenario. I really don't think this is likely in practice, but I can't prove to myself that it will never happen.

I don't think this is a specific problem for that case but it is generic issue which can happen in many ways.

CC @jkotas if he has any thoughts around that.

@mnmr
Copy link

mnmr commented Mar 26, 2021

I was wondering if there has been a discussion about changing the way leap seconds are handled?

Reading this discussion makes me think that the same code running on Windows and Linux will report different times, which is unfortunate. Google has in interesting solution where leap seconds are "distributed" over an hour, which has the nice side effect that time only ever moves forward. Is something like this being considered at all?

Apologies if this is not the right forum for voicing such an idea.

@GrabYourPitchforks
Copy link
Member Author

Google has in interesting solution where leap seconds are "distributed" over an hour, which has the nice side effect that time only ever moves forward. Is something like this being considered at all?

This is also the default behavior for Windows applications. Applications must opt in to leap second support. If your application has not opted in, the OS fudges the time it reports back to you by slowing down the clock over the last 2 seconds.

Examples:

Actual time GetSystemTime returns
23:59:59.000 23:59:59.000
23:59:59.500 23:59:59.250
23:59:60.000 23:59:59.500
23:59:60.500 23:59:59.750

The logic proposed by this PR handles this scenario fine.

@jkotas
Copy link
Member

jkotas commented Mar 26, 2021

I don't think this is a specific problem for that case but it is generic issue which can happen in many ways.

Yes, OOM can happen at atribitrary points, e.g. due to JITing.

Makes me think we should have a guard against recursive calls, just so that we don't bring down the app in these weird edge cases

Not worth it. The app is going down once you hit this situation anyway.

@GrabYourPitchforks GrabYourPitchforks merged commit eed3c76 into dotnet:main Mar 27, 2021
@GrabYourPitchforks GrabYourPitchforks deleted the datetime_utcnow branch March 27, 2021 04:52
@adamsitnik adamsitnik added the tenet-performance Performance related issue label Mar 29, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Apr 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression: DateTime.UtcNow and DateTimeOffset.UtcNow 2.5x slower
8 participants