Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing against other blas version #19

Open
MaximumProgrammer opened this issue Feb 8, 2018 · 18 comments
Open

Testing against other blas version #19

MaximumProgrammer opened this issue Feb 8, 2018 · 18 comments

Comments

@MaximumProgrammer
Copy link

Hello,
i am missing the point to test against other libraries like openblas, because where should i add the according references for example in cmake.

best regards

@MaximumProgrammer
Copy link
Author

Ok i found it, i have to change the path of Linking in the directory of test_problems and Makefile, i also have to mark BLASFEO_TESTING = 1 but then i am getting this kind of error,

[ 84%] Built target blasfeo
[ 86%] Building C object test_problems/CMakeFiles/s_blas.dir/test_s_blas.c.o
[ 88%] Linking C executable s_blas
[ 88%] Built target s_blas
Scanning dependencies of target s_aux
[ 90%] Building C object test_problems/CMakeFiles/s_aux.dir/test_s_aux.c.o
[ 92%] Linking C executable s_aux
../libblasfeo.a(s_aux_ext_dep_lib4.c.o): In function PRINT_TO_STRING_TRAN_STRVEC': s_aux_ext_dep_lib4.c:(.text+0x9d8): multiple definition of PRINT_TO_STRING_TRAN_STRVEC'
../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o):s_aux_ext_dep_libref.c:(.text+0x1a0): first defined here
../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o): In function PRINT_TO_STRING_STRMAT': s_aux_ext_dep_libref.c:(.text+0x164): undefined reference to PRINT_TO_STRING_MAT'
../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o): In function PRINT_TO_STRING_TRAN_STRVEC': s_aux_ext_dep_libref.c:(.text+0x1b4): undefined reference to PRINT_TO_STRING_MAT'
collect2: error: ld returned 1 exit status
test_problems/CMakeFiles/s_aux.dir/build.make:96: recipe for target 'test_problems/s_aux' failed
make[2]: *** [test_problems/s_aux] Error 1
CMakeFiles/Makefile2:203: recipe for target 'test_problems/CMakeFiles/s_aux.dir/all' failed
make[1]: *** [test_problems/CMakeFiles/s_aux.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2
nvidia@tegra-ubuntu:/USB_Drive/TX2_Programs/blasfeo/build$ sudo cmake-gui
QXcbConnection: XCB error: 145 (Unknown), sequence: 164, resource id: 0, major code: 139 (Unknown), minor code: 20

@roversch
Copy link
Contributor

This build error should be fixed with bf6f17d . Could you check again please?

@MaximumProgrammer
Copy link
Author

Ok thx i am going go check it,

@MaximumProgrammer
Copy link
Author

Afterewards i am getting this kind of error
[ 86%] Built target blasfeo
[ 86%] Linking C executable s_blas
/usr/bin/ld: cannot open output file s_blas: Permission denied
collect2: error: ld returned 1 exit status
test_problems/CMakeFiles/s_blas.dir/build.make:95: recipe for target 'test_problems/s_blas' failed
make[2]: *** [test_problems/s_blas] Error 1
CMakeFiles/Makefile2:128: recipe for target 'test_problems/CMakeFiles/s_blas.dir/all' failed
make[1]: *** [test_problems/CMakeFiles/s_blas.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

and

[ 96%] Linking C executable d_blas
CMakeFiles/d_blas.dir/test_d_blas.c.o: In function main': test_d_blas.c:(.text.startup+0x9c): undefined reference to openblas_set_num_threads'
collect2: error: ld returned 1 exit status
test_problems/CMakeFiles/d_blas.dir/build.make:95: recipe for target 'test_problems/d_blas' failed
make[2]: *** [test_problems/d_blas] Error 1
CMakeFiles/Makefile2:202: recipe for target 'test_problems/CMakeFiles/d_blas.dir/all' failed
make[1]: *** [test_problems/CMakeFiles/d_blas.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

I am guess in link.txt -lopenblas is missing, because it should be:

/usr/bin/cc -O2 -fPIC -DLA=HIGH_PERFORMANCE -DTARGET=ARMV8A_ARM_CORTEX_A57 -DLA_HIGH_PERFORMANCE -DEXT_DEP -DOS_LINUX -DREF_BLAS_OPENBLAS -I/opt/openblas/include -DTARGET_ARMV8A_ARM_CORTEX_A57 -march=armv8-a+crc+crypto+fp+simd CMakeFiles/d_blas.dir/test_d_blas.c.o -o d_blas -rdynamic ../libblasfeo.a -lm -lopenblas

, then it is possible to compile.

Now if i do the test im getting this kind of output,

BLAS performance test - float precision

Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).

Testing BLAS version for VFPv4 instruction set, 32 bit (optimized for ARM Cortex A15): theoretical peak 26.4 Gflops

n sgemm_blasfeo sgemm_blas

n Gflops % Gflops %

4 0.22 0.83 inf inf
8 0.85 3.21 inf inf
12 1.71 6.47 inf inf
16 2.68 10.16 inf inf
20 3.08 11.66 inf inf
24 4.00 15.16 inf inf
28 4.92 18.65 inf inf
32 2.60 9.86 inf inf
36 2.75 10.42 inf inf
40 3.18 12.05 inf inf
44 3.54 13.41 inf inf
48 3.72 14.09 inf inf
52 3.52 13.35 inf inf
56 3.50 13.25 inf inf
60 3.90 14.79 inf inf
64 3.96 15.01 inf inf
68 3.90 14.77 inf inf
72 4.25 16.10 inf inf
76 4.46 16.90 inf inf
80 4.36 16.50 inf inf
84 4.08 15.44 inf inf
88 4.55 17.24 inf inf
92 4.62 17.48 inf inf
96 4.63 17.55 inf inf
100 4.55 17.25 inf inf
104 4.63 17.52 inf inf
108 4.71 17.86 inf inf
112 4.74 17.94 inf inf
116 4.58 17.35 inf inf
120 4.75 17.99 inf inf
124 4.88 18.47 inf inf
128 5.01 18.96 inf inf
132 4.87 18.46 inf inf
136 5.03 19.04 inf inf
140 4.88 18.48 inf inf
144 4.95 18.73 inf inf
148 5.09 19.27 inf inf
152 5.14 19.46 inf inf
156 4.99 18.91 inf inf
160 5.03 19.06 inf inf
164 5.10 19.32 inf inf
168 4.98 18.86 inf inf
172 5.44 20.62 inf inf
176 5.10 19.32 inf inf
180 5.31 20.10 inf inf
184 5.36 20.30 inf inf

Best regards.

@MaximumProgrammer
Copy link
Author

I guess it should be possible to change this kind of lines

ifeq ($(REF_BLAS), OPENBLAS)
LIBS += /opt/openblas/lib/libopenblas.a -pthread -lgfortran -lm
endif

ifeq ($(REF_BLAS), BLIS)
LIBS += /opt/netlib/liblapack.a /opt/blis/lib/libblis.a -lgfortran -lm -fopenmp
endif

ifeq ($(REF_BLAS), NETLIB)
LIBS += /opt/netlib/liblapack.a /opt/netlib/libblas.a -lgfortran -lm
endif

ifeq ($(REF_BLAS), MKL)
LIBS += -Wl,--start-group /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.a /opt/intel/mkl/lib/intel64/libmkl_core.a /opt/intel/mkl/lib/intel64/libmkl_sequential.a -Wl,--end-group -ldl -lpthread -lm
endif

ifeq ($(REF_BLAS), ATLAS)
LIBS += /opt/atlas/lib/liblapack.a /opt/atlas/lib/libcblas.a /opt/atlas/lib/libf77blas.a /opt/atlas/lib/libatlas.a -lgfortran -lm
endif

in Makefile from test_problems

Best regards.

@tmmsartor
Copy link
Contributor

tmmsartor commented Feb 22, 2018

I know that at now the distinction is very blurry but BLASFEO_TESTING = 1 is for testing,
while I guess you want to benchmark/compare BLASFEO against openblas or others.

In any case you are right the CMakeList.txt was outdated, I should have fixed the problem with #25.

If you clone that branch then you can run i.e. cmake -DBLASFEO_BENCHMARKS=ON -DREF_BLAS=OPENBLAS to test againstopenblas.

It would be great if you can test this in your system.

@MaximumProgrammer
Copy link
Author

Not really, the best thing would be to control most of the variables from cmake or cmake-gui.

So last bugs are fixed, but if i do so then i am getting this kind of output:

sudo cmake -DBLASFEO_BENCHMARKS=ON -DREF_BLAS=OPENBLAS
-- The C compiler identification is GNU 5.4.0
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Configuring done
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:

BLASFEO_BENCHMARKS

Best regards.

@tmmsartor
Copy link
Contributor

tmmsartor commented Feb 23, 2018

Hi, but did you pull my branch?

git remote add tmmsartor https://github.com/tmmsartor/blasfeo.git
git fetch tmmsartor
git checkout cmake_benchmarks

I also tested on a ARM core (A53) against OpenBlas and it is working.

@MaximumProgrammer
Copy link
Author

Ok here we go:

BLAS performance test - double precision

Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).

Testing BLAS version for NEONv2 instruction set, 64 bit (optimized for ARM Cortex A57): theoretical peak 13.2 Gflops

n dgemm_blasfeo dgemm_blas

n Gflops % Gflops

4 0.07 0.54 0.02 0.17
8 0.29 2.19 0.07 0.56
12 0.46 3.48 0.12 0.88
16 0.77 5.80 0.14 1.03
20 0.91 6.88 0.16 1.20
24 1.18 8.97 0.18 1.33
28 1.27 9.64 0.19 1.44
32 1.48 11.24 0.20 1.55
36 1.55 11.73 0.34 2.57
40 1.64 12.43 0.42 3.15
44 1.74 13.15 0.43 3.26
48 1.83 13.89 0.51 3.88
52 1.87 14.20 0.49 3.72
56 1.95 14.77 0.59 4.45
60 1.98 14.99 0.56 4.25
64 2.07 15.68 0.66 5.02
68 2.03 15.41 0.61 4.65
72 2.11 15.97 0.70 5.27
76 2.14 16.19 0.65 4.96
80 2.17 16.42 0.75 5.72
84 2.18 16.54 0.71 5.38
88 2.33 17.68 0.79 5.95
92 2.25 17.07 0.75 5.69
96 2.27 17.16 0.85 6.44
100 2.28 17.29 0.79 5.98
104 2.31 17.50 0.87 6.56
108 2.32 17.60 0.83 6.29
112 2.35 17.77 0.91 6.91
116 2.33 17.68 0.87 6.57
120 2.36 17.91 0.92 6.97
124 2.42 18.31 0.86 6.55

128 2.34 17.72 0.98 7.39
132 2.39 18.11 1.04 7.88
136 2.39 18.11 1.13 8.55
140 2.42 18.35 1.05 7.96
144 2.40 18.19 1.17 8.87
148 2.41 18.23 1.11 8.41
152 2.42 18.33 1.19 9.04
156 2.43 18.38 1.11 8.42
160 2.40 18.21 1.28 9.67
164 2.42 18.32 1.20 9.05
168 2.45 18.55 1.29 9.80
172 2.46 18.60 1.21 9.19
176 2.46 18.62 1.31 9.96
180 2.46 18.65 1.25 9.46
184 2.49 18.85 1.34 10.15
188 2.48 18.80 1.26 9.56
192 2.48 18.81 1.42 10.78
196 2.49 18.87 1.34 10.13
200 2.52 19.08 1.42 10.78
204 2.51 19.05 1.33 10.09
208 2.52 19.13 1.43 10.85
212 2.52 19.13 1.37 10.38
216 2.54 19.25 1.45 10.96
220 2.54 19.23 1.36 10.32
224 2.54 19.21 1.52 11.52
228 2.55 19.31 1.44 10.88
232 2.56 19.41 1.53 11.59
236 2.56 19.41 1.44 10.91
240 2.57 19.47 1.54 11.63
244 2.58 19.54 1.48 11.22
248 2.59 19.61 1.56 11.80
252 2.60 19.66 1.47 11.12
256 2.58 19.55 1.61 12.19
260 2.60 19.67 1.53 11.60
264 2.60 19.73 1.62 12.30
268 2.60 19.73 1.53 11.62
272 2.60 19.73 1.63 12.36
276 2.61 19.79 1.58 11.97
280 2.62 19.89 1.66 12.57
284 2.62 19.87 1.57 11.88
288 2.61 19.78 1.72 13.00
292 2.63 19.92 1.63 12.33
296 2.64 19.96 1.71 12.97
300 2.64 19.98 1.62 12.27

I guess there is still something wrong, because this test was done on Jetson TX2, it has about 1.5 Flops for single precision, so it should be about the only the half. https://www.aetina.com/products-detail.php?i=210

@MaximumProgrammer
Copy link
Author

Second test for Nvidia Jetson TX2

BLAS performance test - float precision

Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).

Testing BLAS version for VFPv4 instruction set, 32 bit (optimized for ARM Cortex A15): theoretical peak 26.4 Gflops

n sgemm_blasfeo sgemm_blas

n Gflops % Gflops %

4 0.22 0.83 0.05 0.19
8 0.85 3.22 0.19 0.71
12 1.70 6.44 0.36 1.35
16 2.68 10.13 0.47 1.80
20 3.07 11.63 0.30 1.15
24 1.79 6.79 0.20 0.76
28 2.31 8.76 0.23 0.88
32 2.71 10.25 0.22 0.85
36 2.58 9.77 0.39 1.49
40 3.00 11.35 0.46 1.74
44 3.36 12.71 0.50 1.88
48 3.56 13.49 0.55 2.08
52 3.31 12.54 0.55 2.08
56 3.65 13.84 0.65 2.47
60 3.92 14.84 0.65 2.44
64 3.97 15.05 0.76 2.90
68 3.78 14.31 0.76 2.86
72 4.10 15.53 0.80 3.04
76 4.15 15.72 0.81 3.08
80 4.29 16.26 0.90 3.43
84 4.16 15.74 0.86 3.26
88 4.41 16.72 0.95 3.58
92 4.59 17.40 0.94 3.55
96 4.46 16.90 1.08 4.09
100 4.44 16.83 1.01 3.81
104 4.62 17.50 1.07 4.04
108 4.76 18.02 1.10 4.15
112 4.61 17.47 1.18 4.49
116 4.61 17.47 1.11 4.20
120 4.76 18.03 1.16 4.40
124 4.85 18.38 1.15 4.37
128 4.69 17.78 1.30 4.93
132 4.60 17.43 1.39 5.26
136 4.70 17.82 1.45 5.50
140 4.82 18.28 1.43 5.41
144 4.72 17.88 1.56 5.92
148 4.78 18.12 1.47 5.56
152 4.89 18.52 1.54 5.84
156 4.99 18.88 1.52 5.75
160 4.87 18.45 1.74 6.60
164 4.85 18.36 1.61 6.09
168 4.95 18.75 1.71 6.47
172 5.04 19.07 1.67 6.33
176 4.93 18.69 1.82 6.91
180 4.86 18.42 1.73 6.56
184 4.98 18.85 1.78 6.75
188 5.02 19.02 1.76 6.66
192 4.96 18.80 2.03 7.70
196 4.94 18.73 1.86 7.03
200 5.01 18.98 1.92 7.26
204 5.05 19.11 1.85 7.01
208 4.99 18.89 2.01 7.63
212 4.99 18.92 1.89 7.14
216 5.04 19.11 1.98 7.50
220 5.10 19.33 1.96 7.43
224 5.04 19.10 2.20 8.34
228 5.05 19.11 2.03 7.71
232 5.08 19.26 2.13 8.06
236 5.13 19.41 2.10 7.97
240 5.11 19.35 2.22 8.40
244 5.11 19.35 2.12 8.01
248 5.15 19.51 2.18 8.27
252 5.18 19.63 2.16 8.19
256 5.09 19.28 2.42 9.17
260 5.14 19.47 2.27 8.58
264 5.19 19.65 2.32 8.77
268 5.22 19.76 2.27 8.59
272 5.19 19.66 2.42 9.16
276 5.21 19.73 2.29 8.66
280 5.24 19.84 2.35 8.90
284 5.26 19.92 2.32 8.80
288 5.21 19.75 2.55 9.66
292 5.24 19.84 2.39 9.06
296 5.27 19.96 2.48 9.38
300 5.30 20.06 2.44 9.25

@MaximumProgrammer
Copy link
Author

Best regards and thank you.

@giaf
Copy link
Owner

giaf commented Mar 3, 2018

Hey,

first of all, which cores of the TX2 are you running on? ARM Cortex A57 or Denver? If Denver, the code is not optimized for that, I have no clue what the architecture is.

Then, you need to set by hand the frequency of the processor, to get meaningful percentages w.r.t. theoretical maximum (e.g. it should be 2.0 GHz for the A57), this is done in the file test_param.h as reported in your print out above.

Also, you need to choose by hand the routine you want to benchmark and the relative number of flops.

Last point, the A57 @2.0 GHz has 8 (16) Gflops in double (single) precision respectively.

@giaf
Copy link
Owner

giaf commented Mar 3, 2018

Please also note that, in case of the ARM Cortex A57 target in BLASFEO, not all routines have already been optimized. E.g. dgemm_nt is fully optimized, but dgemm_nn is not, and it is simply a fallback to the GENERIC target.

You can check out the source code in the folder kernels/armv8a to see which kernels have already been optimized in assembly for the target architecture.

@RoyiAvital
Copy link

Could you please specify the MKL version in your tests?
Also, could you use MKL_DIRECT_CALL for the tests?

@giaf
Copy link
Owner

giaf commented Sep 13, 2019

In the make build system (which is the recommended one), you can specify the path to the installation folder of your chosen MKL version here https://github.com/giaf/blasfeo/blob/master/Makefile.external_blas#L56

When you choose MKL as external BLAS, the MKL_DIRECT_CALL_SEQ (for the single threaded library version) is always set by default, as you can see from the here https://github.com/giaf/blasfeo/blob/master/Makefile.rule#L409
If you want to use the parallel version and MKL_DIRECT_CALL, just edit that line accordingly

@RoyiAvital
Copy link

RoyiAvital commented Sep 13, 2019

I was talking about the performance graphs in the project website.
I now understand they all use MKL_DIRECT_CALL_SEQ. Yet the MKL version isn't specified.

By the way, amazing to see how good the performance are. Bravo!

@giaf
Copy link
Owner

giaf commented Sep 13, 2019

MKL is version 2019.1.144. The other BLAS implementations are about form the same time.
We should update them with more recent versions, also BLASFEO performance improved for many routines in the mean while.

@tmmsartor we should add all BLAS version in there.

@RoyiAvital
Copy link

RoyiAvital commented May 5, 2020

Could the performance of MKL with Multi Threading be added as well (Using -DMKL_DIRECT_CALL and not only -DMKL_DIRECT_CALL_SEQ)?
It will be interesting to see. As it seems from performance on Intel site that even for those sizes Multi Threading should help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants