Skip to content

Conversation

@AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Mar 29, 2020

Description

More aggressive version of PR #646

  • One division instead of two
  • To avoid overflow, subtract first observed counter from every counter
  • To avoid compatibility breaking, add adjusted first counter back to final result
  • Caches all in header, so no ABI break, also save on CRT DLL call
  • Ad-hoc std::atomic, but previous concerns addressed, now does exactly as real std::atomic

Checklist

Be sure you've read README.md and understand the scope of this repo.

If you're unsure about a box, leave it unchecked. A maintainer will help you.

  • Identifiers in product code changes are properly _Ugly as per
    https://eel.is/c++draft/lex.name#3.1 or there are no product code changes.
  • The STL builds successfully and all tests have passed (must be manually
    verified by an STL maintainer before automated testing is enabled on GitHub,
    leave this unchecked for initial submission).
  • These changes introduce no known ABI breaks (adding members, renaming
    members, adding virtual functions, changing whether a type is an aggregate
    or trivially copyable, etc.).
  • These changes were written from scratch using only this repository,
    the C++ Working Draft (including any cited standards), other WG21 papers
    (excluding reference implementations outside of proposed standard wording),
    and LWG issues as reference material. If they were derived from a project
    that's already listed in NOTICE.txt, that's fine, but please mention it.
    If they were derived from any other project (including Boost and libc++,
    which are not yet listed in NOTICE.txt), you must mention it here,
    so we can determine whether the license is compatible and what else needs
    to be done.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 29, 2020 06:03
Actually there's no quarantee in docs that QPC is non-negative
performance difference cannot be noticed
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Mar 29, 2020

I discovered the optimization is incorrect. Original code:

const long long _Part  = (_Ctr % _Freq) * period::den / _Freq;

assumes that _Freq*period::den fits 64 bit.

after change the code assumes is that min(_Freq, period::den)*_Ctr fits 64 bit, which is what the original approach protects against.

@AlexGuteniev AlexGuteniev deleted the chrono_cache_qpf_2 branch March 29, 2020 19:42
@AlexGuteniev AlexGuteniev restored the chrono_cache_qpf_2 branch March 29, 2020 19:45
@AlexGuteniev AlexGuteniev reopened this Mar 29, 2020
@AlexGuteniev AlexGuteniev deleted the chrono_cache_qpf_2 branch March 29, 2020 19:57
@AlexGuteniev AlexGuteniev restored the chrono_cache_qpf_2 branch March 30, 2020 03:57
@AlexGuteniev
Copy link
Contributor Author

Now I applied different approach, this may work

@AlexGuteniev AlexGuteniev reopened this Apr 2, 2020
@AlexGuteniev AlexGuteniev changed the title <chrono>: Cache QueryPerformanceFrequency() and one division results <chrono>: Cache QueryPerformanceFrequency() and divide just once Apr 2, 2020
@StephanTLavavej StephanTLavavej added performance Must go faster info needed We need more info before working on this labels Apr 6, 2020
@StephanTLavavej
Copy link
Member

This is paying a significant amount of implementation complexity - I think we need a strong performance justification for making this change. Can you provide performance numbers for steady_clock::now:

  1. Before <chrono>: Cache QueryPerformanceFrequency() #646
  2. After <chrono>: Cache QueryPerformanceFrequency() #646, but before this PR
  3. After this PR

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 7, 2020

@StephanTLavavej ,
See https://github.com/AlexGuteniev/benchmark_653

My results are
x86:

6042.98  before 646
5904.46  after 646
4520.97  after 653

x64:

3769.84  before 646
3649.27  after 646
2756.91  after 653

It would be simpler if it would go as vNext issue, then no fake atomic needed, it would be possible to move the most of it to .cpp.

@AlexGuteniev
Copy link
Contributor Author

Another option to achieve the same is to use _mul128 / _div128.
But then the optimization will be applied to x64 only, and it will take bringing in _mul128 / _div128 intrinsics, not currently available in "intrin0.h"

Copy link
Member

@StephanTLavavej StephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @cbezault, I pushed relatively small changes after you approved.

simply for physical proximity, so we'll maintain them together with the rest of the atomic machinery.
stl/inc/chrono Outdated
#define _CHRONO_
#include <yvals_core.h>
#if _STL_COMPILER_PREPROCESSOR
#include <intrin0.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it strange if that makes me happy?

@StephanTLavavej StephanTLavavej self-assigned this May 13, 2020
@StephanTLavavej StephanTLavavej merged commit ad1a26a into microsoft:master May 14, 2020
@StephanTLavavej
Copy link
Member

Thanks for this significant performance improvement! I would have never noticed that this was possible, so I'm glad you did. 😺

@AlexGuteniev AlexGuteniev deleted the chrono_cache_qpf_2 branch May 15, 2020 07:59
AlexGuteniev added a commit to AlexGuteniev/STL that referenced this pull request Jul 2, 2020
…nce (microsoft#653)"

This reverts commit ad1a26a.

# Conflicts:
#	stl/inc/chrono
#	stl/inc/xatomic.h
StephanTLavavej added a commit to AlexGuteniev/STL that referenced this pull request Jul 2, 2020
StephanTLavavej added a commit that referenced this pull request Jul 3, 2020
This reverts commit ad1a26a which was later modified by #694.

Fixes #971.

Co-authored-by: Stephan T. Lavavej <stl@microsoft.com>
@jpark37
Copy link

jpark37 commented Nov 14, 2020

I see this has been backed out since, but is there a reason to use atomics and cache to function-local static variables? Wouldn't it be simpler and faster (no branch) to pull those variables out of the function and initialize them before main()? I see _Query_perf_frequency() in xtime.cpp is also implemented with function static atomic.

This seems better to me, but maybe I'm missing something:

const long long freq_cached = [] {
    LARGE_INTEGER frequency;
    QueryPerformanceFrequency(&frequency);
    return frequency.QuadPart;
}();

long long _Query_perf_frequency() {
    return freq_cached;
}

@BillyONeal
Copy link
Member

@jpark That would add a dynamic initializer to the STL and I believe we don't have any of those right now. I seem to recall adding any causing issues with /clr:pure code that tried to statically link the STL in a partially trusted context, since they can't necessarily run any native code here.

But my memory is a bit foggy on the specifics. I know we had to avoid the dynamic initializer but I don't remember the exact technical limitations forcing it.

@BillyONeal
Copy link
Member

(Also I don't know if QueryPerformanceFrequency is in the "safe to be called from DllMain" subset since the default answer for "is this safe to call from DllMain" is "no".)

@jpark37
Copy link

jpark37 commented Nov 15, 2020

Thanks for the info.

Aren't people allowed to use STL before main() in general? I've seen people declare STL collections as global variables at least. Right now, if someone calls steady_clock::now() in a global variable initializer, and it's the first call into _Query_perf_frequency(), that will call QueryPerformanceFrequency().

Thinking about it more, it may be unsafe to call my version of _Query_perf_frequency() from another initializer before main() because of not being able to control the order of initialization across translation units, so freq_cached might still be 0 with unlucky order. I don't know if the story is any different with /MT vs. /MD.

@BillyONeal
Copy link
Member

Aren't people allowed to use STL before main() in general?

The STL allows people to call it before main(), but if the user is a DLL they still need to observe all the usual platform requirements regarding DllMain.

Right now, if someone calls steady_clock::now() in a global variable initializer, and it's the first call into _Query_perf_frequency(), that will call QueryPerformanceFrequency().

Right, that's a situation we'd like to avoid because it introduces initialization order issues; and QPC isn't explicitly documented as being safe in DllMain.

Thinking about it more, it may be unsafe to call my version of _Query_perf_frequency() from another initializer before main()

It looks correct to me. It might result in 2 threads which both do QPF/QPC but they should still get correct answers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants