Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on copy while setting ZGEMM_DEFAULT_Q and ZGEMM_DEFAULT_P to 2048 #2150

Closed
quickwritereader opened this issue Jun 3, 2019 · 12 comments

Comments

@quickwritereader
Copy link
Contributor

How to reproduce:
I set ZGEMM_DEFAULT_Q and ZGEMM_DEFAULT_P to 2048

sed -i -e "3040s/#define[ \t]\+ZGEMM_DEFAULT_Q[ \t]\+[0-9a-z_]\+/#define ZGEMM_DEFAULT_Q 2048/g" param.h
sed -i -e "3026s/#define[ \t]\+ZGEMM_DEFAULT_P[ \t]\+[0-9a-z_]\+/#define ZGEMM_DEFAULT_P 2048/g" param.h
make TARGET=GENERIC DEBUG=1 NO_LAPACK=1

It should segfault if it is built for generic. if it passes while building for other targets then try to run it against huge K from 1900-2200 on benchmark/zgemm.goto

Here is seg on generic build
#0 0x7F8A0CAEBE08
#1 0x7F8A0CAEAF90
#2 0x7F8A0C2054AF
#3 0x49E971 in zgemm_oncopy at zgemm_ncopy_2.c:173 (discriminator 1)
#4 0x41EEC3 in zgemm_nn at level3.c:346 (discriminator 2)
#5 0x41C419 in zgemm_ at gemm.c:430
#6 0x40C831 in zchk1_ at zblat3.f:544
#7 0x41B6C4 in MAIN__ at zblat3.f:290

@quickwritereader quickwritereader changed the title Segfault on copy Segfault on copy while setting ZGEMM_DEFAULT_Q and ZGEMM_DEFAULT_P to 2048 Jun 3, 2019
@quickwritereader
Copy link
Contributor Author

for now, I will postpone fixing it. just opened issue

@brada4
Copy link
Contributor

brada4 commented Jun 3, 2019

Would be nice to see syscall made from ncopy.c
Adresses 0-2 say it was there, while code in that file says there was none.

@martin-frbg
Copy link
Collaborator

Probably need valgrind to look at this. (P and Q tend to be small by default, perhaps larger values lead to an overflow somewhere)

@brada4
Copy link
Contributor

brada4 commented Jun 3, 2019

somewhere

That syscall comes out of nowhere in a source file that has no syscalls....

        ctemp1  = *(a_offset + 0);
        ctemp2  = *(a_offset + 1);
-> here the syscall....
        *(b_offset + 0) = ctemp1;
        *(b_offset + 1) = ctemp2;

@brada4
Copy link
Contributor

brada4 commented Jun 3, 2019

Does not repeat for me. ./zgemm.goto 55 4400 55
@quickwritereader did you run "make clean" after patching param.h so that benchmark directory is clean?

@quickwritereader
Copy link
Contributor Author

quickwritereader commented Jun 3, 2019

for generic it segfaulted while building, for power9 it did while running, try running with step 1
I did for generic after clone so benchmark was already clean

@brada4
Copy link
Contributor

brada4 commented Jun 3, 2019

1900 2200 3 - no crash, valgrind is against mprotecting odd-sized allocations though.

@quickwritereader
Copy link
Contributor Author

quickwritereader commented Jun 3, 2019

My bad, I forgot to state
export OMP_NUM_THREADS=1
Then build
make TARGET=GENERIC DEBUG=1 NO_LAPACK=1

@martin-frbg
Copy link
Collaborator

The zblat3 crash is reproducible on (multithreaded) Kaby Lake with (at least) gcc 4.8.5, valgrind reports an "invalid write of size 8" at that location but no earlier errors.

@brada4
Copy link
Contributor

brada4 commented Jun 4, 2019

gcc7 and sandybridge - no problem.

@quickwritereader
Copy link
Contributor Author

quickwritereader commented Jun 5, 2019

Ok so far everything is ok. just buffer-size smaller. that's all.
I thought it is calculated based on those numbers.
Here is where things start not working:
./driver/others/parameter.c
zgemm_r = (((BUFFER_SIZE - ((ZGEMM_P * ZGEMM_Q * 16 + GEMM_OFFSET_A + GEMM_ALIGN) & ~GEMM_ALIGN)) / (ZGEMM_Q * 16)) - 15) & ~15;

@martin-frbg
Copy link
Collaborator

Well, it is probably safe to assume that one cannot make these parameters arbitrarily large and still expect things to work - BUFFER_SIZE itself is a somewhat dubious "magic" number that is defined in the common_XX.h for each architecture. (If we had a real "Developer manual" in the wiki, the interdependencies of these values would be stated there I guess... as far as I remember none of K.Goto's publications touched on these details of the implementation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants