Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAS Atomic is not NUMA aware #97

Open
SoilRos opened this issue Dec 12, 2023 · 3 comments
Open

CAS Atomic is not NUMA aware #97

SoilRos opened this issue Dec 12, 2023 · 3 comments

Comments

@SoilRos
Copy link

SoilRos commented Dec 12, 2023

I just measured my machine (Ryzen Milan) and got very similar results as https://github.com/nviennot/core-to-core-latency#dual-amd-epyc-7r13-48-cores-milan-3rd-gen-2021-q1.

However, I was not happy with this asymmetric result on the dual socket case so I re-implementing the CAS benchmark (but in c++ since I don't know rust). I found out the strange behavior is just artifact from the fact that the atomic variable is being stored in one Numa domain (first touch policy) but used in another one. The solution is to create a new atomic variable on every new cycle or to move the page containing with the atomic to the ping/pong threads.

@nviennot
Copy link
Owner

It's actually a feature, not a bug!

Granted, the location of the variable is completely arbitrary, so it would make sense to sample all a bunch of locations.

But if we did sample all of them, how do we show the results in a meaningful way? I'm not sure how to do this correctly. We could easily lose the visual representation that the more distance from that variable, the more the latency goes up.

I'm open to suggestion. What do you think would make things better?

At the very least, we could provide an offset (or seed) to move the location of the variable around and see the effect.

@SoilRos
Copy link
Author

SoilRos commented Dec 12, 2023

A result that depends on how the OS gods woke up today seems like a bug to me ;-) For instance, my picture was blue in the second socket because the process seems to have started there. Tomorrow when I restart the machine or install another OS I may get a different picture.

Now, your alternative sounds interesting if one wanted to measure the the latency to the farthest main memory. But the name of this repository implies that we want to measure how much time does it take a core A to send/receive a message to a core B. Then, I would (perhaps naively) assume that the message is owned by one of the two cores A or B, and not by another arbitrary core C (which in this case happened to have started the program in an arbitrary position).

Naturally, this is my point of view where I am used to fine tune the placement of variables in memory because it yields more preference. But I understand that in other OSs like darwin (for macOS) you cannot even do this as they will force you to interleave memory or not even allow you to pin threads.

@SoilRos
Copy link
Author

SoilRos commented Dec 12, 2023

Dual Socket AMD EPYC 7713

Arbitrary placement

core-to-core-latency-milan(1)

Core A (ping) placement

core-to-core-latency-milan(2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants