-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add env var to override sys::Instant::actually_monotonic() for windows and unix #84448
Conversation
(rust-highfive has picked a reviewer for you, use r? to override) |
@rustbot label T-libs |
This comment has been minimized.
This comment has been minimized.
08e8214
to
79512a4
Compare
I think the general approach is very reasonable. As funny as I find "twice_a_day", I think this should use more mundane naming. I also think it needs I would suggest |
…s and linux the environment variable is read on first use and cached in an atomic
79512a4
to
70d551a
Compare
Sure, done. |
Are environment controls considered part of the standard library's stable API? |
@the8472 To answer your other question:
I think we should absolutely document this. I also think we should very explicitly say "if you need to use |
Ok. We still don't have stability markers for module documentation, so I'll add some "subject to change" note to the prose so it shouldn't be considered insta-stable. |
I think the environment variable should perhaps be an opt-in to the "maybe buggy" behavior rather than the reverse. Otherwise I would expect it to be common advice to set it - most code won't notice the performance hit of the mutex, but would likely notice the possibility of panics and want to avoid that. I think I'd feel differently if we expected this to be limited to non-tier-1 targets or similar, but from what I can tell this is likely to indefinitely be a problem for us without a - somewhat costly - solution. Perhaps that cost can be pushed up into the kernel, I'm not sure; unless we know that the function we're calling to get the timestamp is backed by a similar mutex (whether in kernel or other host environment) it seems like we should just make sure of correctness ourselves. |
Some of that was already discussed in https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/more.20broken.20clocks |
This comment has been minimized.
This comment has been minimized.
f11dd98
to
9cc0348
Compare
Transplanting my concerns/comments from Zulip: I think that people who deploy their software to any broken hardware will be inclined to add something like fn main() {
didnt_check_time_before() // but now does
env::set_var("RUST_CLOCK_ASSUME_MONOTONIC", "0");
Instant::now() // this is supposed to set up the cache.
} would observe a change in behaviour once EDIT I doubt that saying "this is meant for debugging and is not covered by stability guarantees" will prevent anybody from relying on this environment variable to make their code work on the sandybridges they may come across, to be honest. |
Do we have any data on how big of a performance impact the current workaround is? |
Down that road lie bad equilibria.
The current approach is a one-way ratchet where we mark some platform as non-monotonic when a user reports an issue and then never undo that because there is no mechanism to gauge the opposite case. Ideally we would make
Yes, the API contract of CLOCK_MONOTONIC is that it is monotonic. Windows makes similar claims and yet we treat windows as never monotonic. So we should treat this as an OS or hardware bug instead. The issue is easy to paper over, and we should have an option to do that. But making it visible should also be an option, we shouldn't guarantee to hide platform bugs forever.
The problem is with the environment though, so the intent is to give people the option to mark specific environments as unreliable, not to make it part of the code.
Do we have anything like I don't want to use unstable function because that makes it more difficult for users of I could also remove the documentation part referring to the environment variable and only leave the request to report bugs and we can then direct users to the variable when they have filed a bug. |
That depends on whether you're interested in throughput or worst case latency. It also depends on how many threads you have and how frequently they call
The kernel can do a better workaround than perfectly serializing because rdtsc actually returns a 64bit value - instead of the 128bit My optimization in #83093 does not change the general picture for unix systems, it only reduces the constant overhead, not the asymptotics. On windows we can do better since it exposes a 64bit counter value. |
Addressing this particular comment now because I recently heard a relevant analogy (VMs are cattle, not pets) from my coworker with regards to cloud instances. Any organization dealing with more than a couple of instances in the cloud will want to ensure instances have no code specific to address particularities of the specific instances their applications run on. For that reason all sorts of details end up in the application code instead. This environment variable would very much too, most likely unconditionally. I can see the same rationale applying to the applications that run on the user machines.
I don't believe so. |
I do understand the desire, but I do not want to actively support it because it leads to bad places as already outlined in my previous comment. It's similar to protocol ossification or permanent technical debt that nobody wants to clean up. If anything this makes me think we need a less stable mechanism to make this available. Maybe call it |
Since controlling this via environment variables seems contentious, here's another approach:
Instead of panics backslides would now saturate to zero. The downside is that |
I'm in favor of saturating (#88652 (comment)). |
I would be in favor of saturating by default as well. |
☔ The latest upstream changes (presumably #88652) made this pull request unmergeable. Please resolve the merge conflicts. |
let new = match env::var("RUST_CLOCK_ASSUME_MONOTONIC").as_deref() { | ||
Ok("1") => InstantReliability::AssumeMonotonic, | ||
Ok("0") => InstantReliability::AssumeBroken, | ||
Ok(_) => { | ||
eprintln!("unsupported value in RUST_CLOCK_ASSUME_MONOTONIC; using default"); | ||
InstantReliability::Default | ||
} | ||
_ => InstantReliability::Default, | ||
}; | ||
INSTANT_RELIABILITY.store(new as u8, Ordering::Relaxed); | ||
new | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just passing by - should this be a nested #[cold]
function with the outer function marked #[inline]
? Just to reduce the best-case cost of actually_monotonic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although maybe this isn't relevant at all anymore? If so nevermind
Closing in favor of #89926 |
…mulacrum make `Instant::{duration_since, elapsed, sub}` saturating and remove workarounds This removes all mutex/atomic-based workarounds for non-monotonic clocks and makes the previously panicking methods saturating instead. Additionally `saturating_duration_since` becomes deprecated since `duration_since` now fills that role. Effectively this moves the fixup from `Instant` construction to the comparisons. This has some observable effects, especially on platforms without monotonic clocks: * Incorrectly ordered Instant comparisons no longer panic in release mode. This could hide some programming errors, but since debug mode still panics tests can still catch them. * `checked_duration_since` will now return `None` in more cases. Previously it only happened when one compared instants obtained in the wrong order or manually created ones. Now it also does on backslides. * non-monotonic intervals will not be transitive, i.e. `b.duration_since(a) + c.duration_since(b) != c.duration_since(a)` The upsides are reduced complexity and lower overhead of `Instant::now`. ## Motivation Currently we must choose between two poisons. One is high worst-case latency and jitter of `Instant::now()` due to explicit synchronization; see rust-lang#83093 for benchmarks, the worst-case overhead is > 100x. The other is sporadic panics on specific, rare combinations of CPU/hypervisor/operating system due to platform bugs. Use-cases where low-overhead, fine-grained timestamps are needed - such as syscall tracing, performance profiles or sensor data acquisition (drone flight controllers were mentioned in a libs meeting) in multi-threaded programs - are negatively impacted by the synchronization. The panics are user-visible (program crashes), hard to reproduce and can be triggered by any dependency that might be using Instants for any reason. A solution that is fast _and_ doesn't panic is desirable. ---- closes rust-lang#84448 closes rust-lang#86470
The environment variable is read on first use and cached in an atomic.
Various operating systems promise monotonic clocks and so does hardware and various hypervisors, often with explicit flags such as
constant_tsc
. And if that flag is absent they already force the timers to be monotonic through atomic operations (just as the standard library does). And this tends to work on most systems.But there's the occasional broken hardware or hypervisor that makes this promise but then doesn't deliver. That's why the standard library doesn't rely on the API guarantees in some cases (e.g. windows). And in other cases (e.g. x86 linux) it does trust the OS guarantee and then this gets broken because rust trusts the os which trusts the hypervisor which trusts the hardware which is broken.
The result is that either we err on the side of caution and introduce cache contention in an operation that should be very fast and perfectly scalable even on systems that have reliable clocks or we trust too much and our guarantees get violated on some fraction of systems.
With the environment variable we can offer a way out. We can make a default decision for a particular platform depending on how common broken hardware is and then give users that encounter the opposite case a way out.
Questions: