Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate 128 bit CUDA concurrency issues on certain chips like the AD102 or GA104 - when shared ram is not properly isolated #31

Open
obriensystems opened this issue Jan 11, 2025 · 3 comments
Assignees

Comments

@obriensystems
Copy link
Member

obriensystems commented Jan 11, 2025

see #12

Issue may be that the thread max copy code was outside the if check on the thread
https://github.com/ObrienlabsDev/performance/blob/main/gpu/nvidia/cuda/cpp/128bit/collatz_cuda/kernel_collatz.cu#L100

    if (threadIndex < threads) {
...
            } while (!((current0 == 1ULL) && (current1 == 0ULL)));
            // move max copy inside the thread if check (to avoid concurrency issues)
            //
            _output0[threadIndex] = max0;
            _output1[threadIndex] = max1;
    }
-            _output0[threadIndex] = max0;
-            _output1[threadIndex] = max1;
@obriensystems
Copy link
Member Author

obriensystems commented Jan 12, 2025

still occurring
RTX-A4000

GPU01:Sec: 4411 path: 650 GlobalMax: 0:871673828443: 21714:6140004720918243904 last search: 871673886723
GPU01:Sec: 6894 path: 330 GlobalMax: 0:1356612632585: 1073741824:6867851452468 last search: 1356612669443
should be 41764:...

RUN 2

GPU01:Sec: 4723 path: 650 GlobalMax: 0:871673828443: 21714:6140004720918243904 last search: 871673886723
GPU01:Sec: 14641 path: 1029 GlobalMax: 0:2674309547647: 41764:10130355336659361648 last search: 2674309550083
below
GPU01:Sec: 19707 path: 334 GlobalMax: 0:3588713153587: 9223372036854775808:16149209191144 last search: 3588713175043

@obriensystems
Copy link
Member Author

RTX-A4000

  • right to the max 127th bit - will be easier to triage
GPU01:Sec: 243 path: 555 GlobalMax: 0:45871962271: 4:8554672607184627540 last search: 45871974403
GPU01:Sec: 273 path: 770 GlobalMax: 0:51739336447: 6:3959152699356688744 last search: 51739340803
GPU01:Sec: 304 path: 222 GlobalMax: 0:57321458927: 9223372036854775808:3305444168212 last search: 57321492483
GPU00:Sec: 3228 path: 365 GlobalMax: 0:595778435027 : 9223372036854775808:12883345023220 last search: 595778498563
GPU00:Sec: 7353 path: 374 GlobalMax: 0:1348118669267 : 9223372036854775808:12956552620720 last search: 1348118732803
GPU00:Sec: 24326 path: 311 GlobalMax: 0:4416404150227 : 9223372036854775808:63668029459696 last search: 4416404213763
GPU00:Sec: 39226 path: 304 GlobalMax: 0:7091587479507 : 9223372036854775808:153351250882900 last search: 7091587543043

@obriensystems
Copy link
Member Author

an a dual RTX-4090 consumer system i built - running for 3 days 24/7 at 25% TDP - I started to see the overflow issue again.
However on an RTX-A6000 professional card - it has been running no problem for a week

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant