CPU feature detection in core #3469

Amanieu · 2023-08-03T23:03:30Z

This RFC moves the is_*_feature_detected macros into core, but keeps the logic for actually doing feature detection (which requires OS support) in std.

programmerjake · 2023-08-04T02:19:12Z

I think it would be useful to have the is_*_feature_detected macros instead run code like:

macro_rules! is_x86_feature_detected {
    ("sse") => { feature_detect(X86Feature::SSE) };
    ("sse2") => { feature_detect(X86Feature::SSE2) };
    ("avx") => { feature_detect(X86Feature::AVX) };
    ("avx2") => { feature_detect(X86Feature::AVX2) };
    // ...
}
#[repr(u16)]
pub enum X86Feature {
    // arbitrary order since I didn't bother to look it up,
    // we'd probably want to base these off `cpuid` bit positions
    // or something similar, since that makes `rust_cpu_feature_detect` much simpler
    SSE,
    SSE2,
    AVX,
    AVX2,
    // ... lots of elided features
    AVX512F, // assume this is the last one
}
impl X86Feature {
    pub const MAX: Self = Self::AVX512F; // assume this is the last one
}
#[derive(Copy, Clone, Eq, PartialEq, Debug, Default)]
pub struct X86Features(pub [usize; Self::ARRAY_SIZE]);
impl X86Features {
    pub const ARRAY_SIZE: usize = X86Feature::MAX as usize / usize::BITS as usize + 1;
}
extern "Rust" {
    // this should have some kind of weak linking or something
    fn rust_cpu_feature_detect() -> X86Features;
}
#[inline]
pub fn feature_detect(feature: X86Feature) -> bool {
    const Z: AtomicUsize = AtomicUsize::new(0);
    static CACHE: [AtomicUsize; X86Features::ARRAY_SIZE] = [Z; X86Features::ARRAY_SIZE];
    static CACHE_VALID: AtomicBool = AtomicBool::new(false);
    #[cold]
    fn fill_cache() {
        for (cache, v) in CACHE.iter().zip(&unsafe { rust_cpu_feature_detect() }.0) {
            cache.store(*v, Ordering::Relaxed);
        }
        CACHE_VALID.store(true, Ordering::Release);
    }
    // intentionally only use atomic store/load to avoid needing cmpxchg or similar for cpus without support for that
    if !CACHE_VALID.load(Ordering::Acquire) {
        fill_cache();
    }
    let index = feature as usize;
    let bit = index % usize::BITS as usize;
    let index = index / usize::BITS as usize;
    (CACHE[index].load(Ordering::Relaxed) >> bit) & 1 != 0
}

Lokathor · 2023-08-04T02:24:03Z

That's approximately how std_detect works already, could you explain what in particular is different about your example code?

programmerjake · 2023-08-04T02:26:49Z

That's approximately how std_detect works already, could you explain what in particular is different about your example code?

it's not what @Amanieu proposed? in particular the code has the initialization driven by the threads calling is_*_feature_detected instead of depending on main or similar to fill it in and hope that's before you need it.

it also needs much less cache space for platforms without atomic fetch_or

Lokathor · 2023-08-04T02:28:41Z

Ah, well when you put it like that I see the difference :3

petrochenkov · 2023-08-04T03:04:08Z

text/0000-core_detect.md

+
+## Using a lang item to call back into `std`
+
+Instead of having `std` "push" the CPU features to `core` at initialization time, an alternative design would be for `core` to "pull" this information from `std` by calling a lang item defined in `std`. The problem with this approach is that it doesn't provide a clear path for how this would be exposed to no-std programs which want to do their own feature detection.


A lang item with two "grades" - weak and strong, core defines a weak version of the lang item so users are not required to provide their own lang item definition, but anyone can override it with a strong version of the same lang item (libstd will do that as well).

(A lang item can have a stable alias like #[panic_handler].)

Weak lang items is already used as name for lang items like #[panic_handler] that need not be defined when it is used, but does need to be defined when linking if it was used anywhere. Maybe use preemptible lang items as name here?

text/0000-core_detect.md

Co-authored-by: konsumlamm <44230978+konsumlamm@users.noreply.github.com>

Amanieu · 2023-08-04T13:44:42Z

I think it would be useful to have the is_*_feature_detected macros instead run code like:

This is what I described in alternative designs here.

The main issues are:

this requires a much larger API surface, which makes stabilization more difficult.
this is actually slightly slower since it requires an atomic check that the CPU features have been initialized.

it also needs much less cache space for platforms without atomic fetch_or

This could also be solved by requiring that mark_*_feature_as_detected is only called before multiple threads access the CPU features (since it's unsafe anyways). This works for static initializers, even for dynamic libraries, since they are executed before any other code.

programmerjake · 2023-08-05T01:00:35Z

I think it would be useful to have the is_*_feature_detected macros instead run code like:

This is what I described in alternative designs here.

The main issues are:

this requires a much larger API surface, which makes stabilization more difficult.

we don't have to have all that API surface stabilized, we can just stabilize the existence of rust_cpu_feature_detect and X86Features with just ARRAY_SIZE and the field .0 and stabilizing the location of each feature bit (copying cpuid locations would make that relatively uncontroversial), which imho is a rather small API surface for what it does.

this is actually slightly slower since it requires an atomic check that the CPU features have been initialized.

we'll likely want an atomic check anyway so we can be sure all the features are correctly synchronized with each other.
e.g. if we check is_aarch64_feature_detected!("v8.1a") followed by is_aarch64_feature_detected!("v8.2a") then, assuming the function computing the cpu features always returns the same value within each program run, !v8_1a && v8_2a should never be true. if we don't have a acquire-load from the same location as the function computing cpu features release-stored to after writing all features to memory, then we can end up with inconsistent features since relaxed-loads to different locations can be reordered with each other.

it also needs much less cache space for platforms without atomic fetch_or

This could also be solved by requiring that mark_*_feature_as_detected is only called before multiple threads access the CPU features (since it's unsafe anyways). This works for static initializers, even for dynamic libraries, since they are executed before any other code.

the problem with that is that users are often specifically trying to avoid static initializers due to ordering issues and stuff:
rust-lang/rust#111921

lyphyser · 2023-08-10T09:31:29Z

There's also the option of including all detection code in libcore, using raw system calls in assembly if OS interaction is needed.

It makes libcore OS-specific, but I think that might be fine as long as it doesn't use any library.

bjorn3 · 2023-08-10T09:34:20Z

Apart from Linux most OSes don't allow direct syscalls. Go tried, but had to revert back to using libc on most platforms after macOS changed the ABI of one syscall breaking every Go program in existence and OpenBSD added an exploit mitigation that makes your process crash if you try to do a syscall while outside of libc.

lyphyser · 2023-08-10T09:54:25Z

Apart from Linux most OSes don't allow direct syscalls. Go tried, but had to revert back to using libc on most platforms after macOS changed the ABI of one syscall breaking every Go program in existence and OpenBSD added an exploit mitigation that makes your process crash if you try to do a syscall while outside of libc.

I guess on those systems libcore can just link the C library. After all, it seems that no useful program can exist on those systems without linking the C library (since without system calls, the only things a program can do are to loop infinitely, waste a number of CPU cycles and then crash, or try to exploit the CPU or kernel), so might as well link it from libcore.

There's also the issue that in some cases CPU feature detection needs to access the ELF auxiliary values, but argv and envp are passed to static initializers, and the ELF auxiliary entries are guaranteed to be after the environment pointers which are after the argument pointers (at least on x86-64, but I guess all ABIs are like that), so there should be no need to call getauxval to access them.

YurySolovyov · 2023-08-10T09:57:54Z

might be worth thinking about something like AVX10 which uses point versioning like 10.1, 10.2, etc.

the8472 · 2023-08-21T16:19:20Z

AVX10 will be more complicated than that because cores can have different capabilities. We currently have no way to represent those differences. We don't even have thread pinning in std which would be a prerequisite to make that work.

YurySolovyov · 2023-08-22T05:16:36Z

I thought the whole idea of AVX10 was to give cores identical capabilities 🤔

text/0000-core_detect.md

the8472 · 2023-08-22T14:22:32Z

I thought the whole idea of AVX10 was to give cores identical capabilities 🤔

https://cdrdv2.intel.com/v1/dl/getContent/784267 Introduction section on page 14 as of revision 2.0 of the document

The same instructions will be supported across all cores. But maximum register width may vary across cores. Specifically P-cores may support ZMM registers while E-cores may be limited to YMM.

And in the CPUID table on page 15 I'm not seeing anything to query the minimum set across all cores instead of the current CPU...

Co-authored-by: Nilstrieb <48135649+Nilstrieb@users.noreply.github.com>

m-ou-se · 2024-07-30T15:28:10Z

@rfcbot merge

rfcbot · 2024-07-30T15:28:13Z

Team member @m-ou-se has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

joshtriplett · 2024-07-30T15:29:56Z

Could this RFC please add an unresolved question about the efficiency of having (say) a hundred of these at startup, and whether this can be reasonably optimized?

rfcbot · 2024-07-30T15:47:54Z

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot · 2024-08-09T15:49:01Z

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

programmerjake · 2024-08-09T22:10:52Z

in unresolved questions, can you add the problem that some features imply other features (e.g. avx2 implies avx) and we should try to make updates synchronized so we don't see a transient state where avx2 is enabled but avx is disabled.

(I would have posted this on the tracking issue, but there isn't one yet...)

Lokathor · 2024-08-09T22:19:07Z

The implementation of feature detection by std sounds separate from if core-only code can call upon feature detection and expect other linked code to provide it.

programmerjake · 2024-08-09T23:52:18Z

The implementation of feature detection by std sounds separate from if core-only code can call upon feature detection and expect other linked code to provide it.

If this is in reply to #3469 (comment), I meant that the proposed implementation is just reading from some atomics in core, so we should synchronize updates to those atomics to prevent transient inconsistent features.

clarfonthey · 2024-08-09T23:56:07Z

Feels like multiple problems could be solved at once if the features were represented using some kind of bitflags-type struct that was applied over a slice of atomics. Things like transitive features (avx2 -> avx) are pretty naturally representable this way, since you can just make each feature set the bits it affects. If you make these relations between features clear to the people who need to set them, they should in theory be able to do a minimum number of set calls, rather than say, setting 100 flags at once like people were worried about.

Lokathor · 2024-08-10T00:14:24Z

@programmerjake yes that was my intention.

My point is that this RFC is extending an existing and stable part of std to also work in core, and that's all it's doing. Any concerns about how the std detection system works should be tracked separately in other issues or rfcs or whatever appropriate location.

programmerjake · 2024-08-10T00:36:33Z

@programmerjake yes that was my intention.

My point is that this RFC is extending an existing and stable part of std to also work in core, and that's all it's doing.

no, it's also adding the core::arch::mark_*_feature_as_detected APIs...my concern is about how those are implemented, since if they just naively fetch_or the one bit for each feature and don't think about which order to set the features in, it's very easy to end up reading avx2 as true since it happened to be set first but avx as false since that was set slightly later.

clarfonthey · 2024-08-10T21:16:30Z

@programmerjake yes that was my intention.
My point is that this RFC is extending an existing and stable part of std to also work in core, and that's all it's doing.

no, it's also adding the core::arch::mark_*_feature_as_detected APIs...my concern is about how those are implemented, since if they just naively fetch_or the one bit for each feature and don't think about which order to set the features in, it's very easy to end up reading avx2 as true since it happened to be set first but avx as false since that was set slightly later.

Right, which is why I mentioned bitfields. If your atomics are integers instead of bools, then you can just set both bits at once and it works as expected.

Of course, this does require explicitly specifying these interdependencies between features, but I'd argue we should be doing that even without this RFC. It doesn't make sense for avx to be ever unavailable when avx2 is available, and I'd imagine many people already rely on this.

Add RFC for core_detect

2ae475b

ehuss added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Aug 4, 2023

petrochenkov reviewed Aug 4, 2023

View reviewed changes

konsumlamm reviewed Aug 4, 2023

View reviewed changes

text/0000-core_detect.md Outdated Show resolved Hide resolved

text/0000-core_detect.md Outdated Show resolved Hide resolved

text/0000-core_detect.md Outdated Show resolved Hide resolved

Apply suggestions from code review

57e0c0b

Co-authored-by: konsumlamm <44230978+konsumlamm@users.noreply.github.com>

Noratrieb reviewed Aug 22, 2023

View reviewed changes

text/0000-core_detect.md Outdated Show resolved Hide resolved

Amanieu added the I-libs-api-nominated Indicates that an issue has been nominated for prioritizing at the next libs-api team meeting. label Jul 23, 2024

Update text/0000-core_detect.md

5a46a19

Co-authored-by: Nilstrieb <48135649+Nilstrieb@users.noreply.github.com>

rfcbot added proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. labels Jul 30, 2024

rfcbot added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Jul 30, 2024

rfcbot removed the proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. label Jul 30, 2024

Add concern about performance.

6f74c8b

Amanieu removed the I-libs-api-nominated Indicates that an issue has been nominated for prioritizing at the next libs-api team meeting. label Aug 6, 2024

rfcbot added finished-final-comment-period The final comment period is finished for this RFC. and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Aug 9, 2024

rfcbot added the to-announce label Aug 9, 2024

hanna-kruppe mentioned this pull request Oct 15, 2024

Simple seedable insecure random number generation, stable across Rust versions rust-lang/libs-team#394

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU feature detection in core #3469

CPU feature detection in core #3469

Amanieu commented Aug 3, 2023

programmerjake commented Aug 4, 2023 •

edited

Loading

Lokathor commented Aug 4, 2023

programmerjake commented Aug 4, 2023 •

edited

Loading

Lokathor commented Aug 4, 2023

petrochenkov Aug 4, 2023

petrochenkov Aug 4, 2023

bjorn3 Aug 10, 2023

Amanieu commented Aug 4, 2023

programmerjake commented Aug 5, 2023

lyphyser commented Aug 10, 2023

bjorn3 commented Aug 10, 2023 •

edited

Loading

lyphyser commented Aug 10, 2023 •

edited

Loading

YurySolovyov commented Aug 10, 2023

the8472 commented Aug 21, 2023

YurySolovyov commented Aug 22, 2023

the8472 commented Aug 22, 2023 •

edited

Loading

m-ou-se commented Jul 30, 2024

rfcbot commented Jul 30, 2024 •

edited by Amanieu

Loading

joshtriplett commented Jul 30, 2024

rfcbot commented Jul 30, 2024

rfcbot commented Aug 9, 2024

programmerjake commented Aug 9, 2024 •

edited

Loading

Lokathor commented Aug 9, 2024

programmerjake commented Aug 9, 2024

clarfonthey commented Aug 9, 2024

Lokathor commented Aug 10, 2024

programmerjake commented Aug 10, 2024

clarfonthey commented Aug 10, 2024 •

edited

Loading


		## Using a lang item to call back into `std`

		Instead of having `std` "push" the CPU features to `core` at initialization time, an alternative design would be for `core` to "pull" this information from `std` by calling a lang item defined in `std`. The problem with this approach is that it doesn't provide a clear path for how this would be exposed to no-std programs which want to do their own feature detection.

CPU feature detection in core #3469

Are you sure you want to change the base?

CPU feature detection in core #3469

Conversation

Amanieu commented Aug 3, 2023

programmerjake commented Aug 4, 2023 • edited Loading

Lokathor commented Aug 4, 2023

programmerjake commented Aug 4, 2023 • edited Loading

Lokathor commented Aug 4, 2023

petrochenkov Aug 4, 2023

Choose a reason for hiding this comment

petrochenkov Aug 4, 2023

Choose a reason for hiding this comment

bjorn3 Aug 10, 2023

Choose a reason for hiding this comment

Amanieu commented Aug 4, 2023

programmerjake commented Aug 5, 2023

lyphyser commented Aug 10, 2023

bjorn3 commented Aug 10, 2023 • edited Loading

lyphyser commented Aug 10, 2023 • edited Loading

YurySolovyov commented Aug 10, 2023

the8472 commented Aug 21, 2023

YurySolovyov commented Aug 22, 2023

the8472 commented Aug 22, 2023 • edited Loading

m-ou-se commented Jul 30, 2024

rfcbot commented Jul 30, 2024 • edited by Amanieu Loading

joshtriplett commented Jul 30, 2024

rfcbot commented Jul 30, 2024

rfcbot commented Aug 9, 2024

programmerjake commented Aug 9, 2024 • edited Loading

Lokathor commented Aug 9, 2024

programmerjake commented Aug 9, 2024

clarfonthey commented Aug 9, 2024

Lokathor commented Aug 10, 2024

programmerjake commented Aug 10, 2024

clarfonthey commented Aug 10, 2024 • edited Loading

programmerjake commented Aug 4, 2023 •

edited

Loading

programmerjake commented Aug 4, 2023 •

edited

Loading

bjorn3 commented Aug 10, 2023 •

edited

Loading

lyphyser commented Aug 10, 2023 •

edited

Loading

the8472 commented Aug 22, 2023 •

edited

Loading

rfcbot commented Jul 30, 2024 •

edited by Amanieu

Loading

programmerjake commented Aug 9, 2024 •

edited

Loading

clarfonthey commented Aug 10, 2024 •

edited

Loading