About Openblas #27

nickwxwu · 2015-12-28T09:12:08Z

hi sh1r0:
I'm very interesting in your project.This project is very wondeful.It works very well with eigen,but it seems not work with openblas. I ran it in android, but it crashed infunction "cblas_sgemm".

nickwxwu · 2015-12-28T09:24:57Z

hi sh1r0:
It seems using eigen in mobile is more popular than using openblas.Is eigen more efficient than openblas?

sh1r0 · 2015-12-28T09:31:41Z

How did you get it to work? Did you cross compile the OpenBLAS library with hard float support?
I tried this outdated pre-built one before, and got that to work while it was horribly slow. Also, I'm sure that the latest OpenBLAS can be built for Android and works for being linked to executables. However, it's troublesome to be used in jni calls (might be related to this). If you or anyone has any idea about dealing with this issue, please feel free to let me know.
Thanks.

sh1r0 · 2015-12-28T09:41:35Z

AFAIK, Eigen can be simply used as a header-only library, and is quite competitive with other BLAS-like libraries (refer to the benchmark, and note that OpenBLAS is based on GotoBLAS). I'm not going to say that Eigen is the best choice in all the cases, but it's a simple and great one at least in my case.

bhack · 2015-12-28T10:03:25Z

There is a specific openblas branch for "deep learning" at https://github.com/xianyi/OpenBLAS/tree/optimized_for_deeplearning?files=1

nickwxwu · 2015-12-28T12:44:17Z

I modified the flag "-mfloat-abi=hard" to "softfp"(it came error when openblas cross compile with hard float while caffe with softfp)@sh1r0

I tried the outdated pre-built one and https://github.com/xianyi/OpenBLAS/tree/optimized_for_deeplearning?files=1
and failed.
I wonder the way i use is ok...
I linked the libopenblas.so and produce libcaffe.so and libcaffe_jni.so.Then I use "System.loadlibrary("caffe");System.loadlibrary("caffe_jni")" to load these two library.

sh1r0 · 2015-12-28T13:52:24Z

To use the pre-built OpenBLAS:

get this and extract to android_lib/
comment out android_lib/openblas-android/include/openblas_config.h:20
remove all *.so* in android_lib/openblas-android/lib
modify scripts/build_caffe.sh as shown below

@@ -19,7 +19,7 @@ OPENCV_ROOT=${ANDROID_LIB_ROOT}/opencv/sdk/native/jni
PROTOBUF_ROOT=${ANDROID_LIB_ROOT}/protobuf
GFLAGS_HOME=${ANDROID_LIB_ROOT}/gflags
BOOST_HOME=${ANDROID_LIB_ROOT}/boost_1.56.0
-export OpenBLAS_HOME=${ANDROID_LIB_ROOT}/openblas
+export OpenBLAS_HOME=${ANDROID_LIB_ROOT}/openblas-android
export EIGEN_HOME=${ANDROID_LIB_ROOT}/eigen3

rm -rf "${BUILD_DIR}"
@@ -40,7 +40,7 @@ cmake -DCMAKE_TOOLCHAIN_FILE="${WD}/android-cmake/android.toolchain.cmake" \
   -DUSE_LMDB=OFF \
   -DUSE_LEVELDB=OFF \
   -DUSE_HDF5=OFF \
-      -DBLAS=eigen \
+      -DBLAS=open \
   -DBOOST_ROOT="${BOOST_HOME}" \
   -DGFLAGS_INCLUDE_DIR="${GFLAGS_HOME}/include" \
   -DGFLAGS_LIBRARY="${GFLAGS_HOME}/lib/libgflags.a" \

re-build caffe

On the other hand, regarding the master or optimized_for_deeplearning branch of OpenBLAS, hard float support is required. And as I said, it works for native executables but not for jni libs. If you want to build this project with hard float support, you can simply set the flag in the shell export ANDROID_ABI="armeabi-v7a-hard with NEON" and re-build everything.

nickwxwu · 2015-12-29T02:48:11Z

Thank you very much@sh1r0. It worked with OpenBLAS-0.2.15.tar.gz when I had compile all dependencies with hard float support, with your help. But it seemed to show that using openblas is more faster than eigen in the forwarding of caffe model( 400-800ms faster). I thought may the version eigen is 3.2.5 and it was not the latest,but the openblas was the latest.
Later ,I'll test this using the latest eigen.
For all ,thanks.

nickwxwu · 2015-12-29T03:16:26Z

I used the latest version of eigen (3.2.7), but got the same result... I wonder some flag (like "neon" etc) need to be set to eigen when compiling caffe with eigen.

sh1r0 · 2015-12-29T04:16:52Z

Hi @wuxuewu , good to know that. Do you mean that you have succeeded in getting jni work with hard float? Could you share experience? Thanks.
BTW, I think the version of eigen might be minor to performance. :p

sh1r0 · 2015-12-29T05:40:21Z

@wuxuewu
I tried to run the cpp_classification example on my phone, and simply used time to do simple benchmarks. The results below are the best three of each build (both are built by armeabi-v7a-hard with NEON).

=======  OpenBLAS  ======
0m10.57s real     0m4.76s user     0m4.83s system
0m10.68s real     0m4.35s user     0m4.81s system
0m11.03s real     0m4.46s user     0m4.73s system

=======   Eigen    ======
0m10.99s real     0m3.48s user     0m3.48s system
0m10.85s real     0m3.30s user     0m3.70s system
0m10.38s real     0m3.58s user     0m3.18s system

nickwxwu · 2015-12-29T06:31:24Z

Hi sh1r0:
Yes, I have succeeded in getting jni work with hard float. Just followed your instruction in the build.sh with all compiling with " armeabi-v7a-hard with NEON " .
The results above you showd seems that openblas is a bit slower than eigen, I did not try the cpp_classification example.(what's the version of openblas and eigen you used ?)
I use the caffe lib with openblas and eigen in the caffe-demo-for-android project, and the caffe_mobile.cpp print logs are below,and i test several times while the results did not change.
===== Eigen ========
Prediction time: 2043.39ms

===== OpenBLAS =====
Prediction time: 1458.48ms

note: caffe model, and cpu mode, eigen 3.2.7, OpenBLAS 0.2.15
sorry, i want to know if the eigen should to be compiled alone or if setting some compile flag for eigen in the build_caffe.sh?

sh1r0 · 2015-12-29T07:00:08Z

Hi @wuxuewu ,
Wow, that's weird. First, I use OpenBLAS v0.2.15 and Eigen v3.2.5.
Second, did you use the build_openblas.sh to build?
In my experience, armeabi-v7a-hard with NEON is okay for building everything. However, during runtime, the results are totally wrong. Could you provide some of your prediction results by jni calls?
(EDIT: caffe/examples/images/cat.jpg is a good candidate for the tests.)
For the last question, the answer is no. There is no need to build eigen alone.

nickwxwu · 2015-12-30T01:58:23Z

Hi sh1r0,
I used to test the openblas and eigen with two mobile I have (A and B),and got results below:

phone A phone B
---------- openblas - 8 ----------
502ms 1330ms
458ms 1280ms
584ms 1530ms
4168ms 1400ms
4822ms 1420ms

------------openblas - 4 -----------
409ms 1300ms
445ms 1490ms
385ms 1410ms
385ms 1360ms
376ms 1410ms
365ms 1340ms
367ms 1440ms

------------- eigen -----------
539ms 2170ms
526ms 2100ms
535ms 2160ms
564ms 2220ms
551ms 2160ms
528ms 2210ms
537ms 2140ms

phone A: AArch64, android 6.0, 8 core
phone B: Armv7 rev 1, android4.4.2, 4 core
(phone C: Armv7 rev 5, android4.4.2, 8 core. results same as phone B)
openblas - 8: compile with TARGET=ARMV7 USE_THREAD=ON NUM_THREADS=8
openblas - 4: compile with TARGET=ARMV7 USE_THREAD=ON NUM_THREADS=4

nickwxwu · 2015-12-30T02:09:52Z

I count the time with the following change in caffe_mobile.cpp, because I found predicting on phone A the function "clock()" was not precise.The log output was "Prediction time: 3900ms" while I saw the app returned results less than one second. So I used the following way to count the time.(The log would output and could get the time in the window logcat of eclipse)
"
VLOG(1)<<"wxw";
const vector<Blob*>& result = caffe_net->Forward(dummy_bottom_vec, &loss);
VLOG(1)<<"wxw";
"

sh1r0 · 2015-12-30T04:21:32Z

Hi @wuxuewu , it seems that your prediction results are correct? I mean, for example, caffe/examples/images/cat.jpg is classified as tabby cat (top-1), right? Could you provide your script for building OpenBLAS and possibly your adaptions for building this project? It'll be great to integrate it.
Regrading Forwarding time in caffe_mobile.cpp, I think it counts the real cpu time (sum up all your multi-core cpu time) rather than the wall time, I'll try to fix this.
Thanks.

nickwxwu · 2015-12-30T06:22:32Z

Hi @sh1r0 :
the script of building OpenBLAS is below:
"
#!/usr/bin/env sh

if [ -z "$NDK_ROOT" ] && [ "$#" -eq 0 ]; then
echo 'Either $NDK_ROOT should be set or provided as argument'
echo "e.g., 'export NDK_ROOT=/path/to/ndk' or"
echo " '${0} /path/to/ndk'"exit 1
else
NDK_ROOT="${1:-${NDK_ROOT}}"
fi

#export OPENBLAS_NUM_THREADS=1
TOOLCHAIN_DIR=$NDK_ROOT/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/bin
WD=$(readlink -f "dirname $0/..")
INSTALL_DIR=${WD}/android_lib
N_JOBS=8

cd OpenBLAS

make clean
make -j${N_JOBS}
CC="$TOOLCHAIN_DIR/arm-linux-androideabi-gcc --sysroot=$NDK_ROOT/platforms/android-19/arch-arm"
CROSS_SUFFIX=$TOOLCHAIN_DIR/arm-linux-androideabi-
HOSTCC=gcc NO_LAPACK=1 TARGET=ARMV7
USE_THREAD=ON NUM_THREADS=4

rm -rf "$INSTALL_DIR/openblas"
make PREFIX="$INSTALL_DIR/openblas" install
"

I used the "caffe/examples/images/cat.jpg" to predict, but I did not focus on the result of prediction. I modified the last layer of the caffe model with only 4 outputs, but I did not change the synset_words.txt remained 1000 classifications. Does that matter?

nickwxwu · 2015-12-30T06:29:35Z

the script of building OpenBLAS is below:

#!/usr/bin/env sh

if [ -z "$NDK_ROOT" ] && [ "$#" -eq 0 ]; then
echo 'Either $NDK_ROOT should be set or provided as argument'
echo "e.g., 'export NDK_ROOT=/path/to/ndk' or"
echo " '${0} /path/to/ndk'"exit 1
else
NDK_ROOT="${1:-${NDK_ROOT}}"
fi

#export OPENBLAS_NUM_THREADS=1
TOOLCHAIN_DIR=$NDK_ROOT/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/bin
WD=$(readlink -f " dirname $0 /..")
INSTALL_DIR=${WD}/android_lib
N_JOBS=8

cd OpenBLAS

make clean
make -j${N_JOBS} \
CC="$TOOLCHAIN_DIR/arm-linux-androideabi-gcc --sysroot=$NDK_ROOT/platforms/android-19/arch-arm" \
CROSS_SUFFIX=$TOOLCHAIN_DIR/arm-linux-androideabi- \
HOSTCC=gcc NO_LAPACK=1 TARGET=ARMV7 \
USE_THREAD=ON NUM_THREADS=4

rm -rf "$INSTALL_DIR/openblas"
make PREFIX="$INSTALL_DIR/openblas" install

sh1r0 · 2015-12-30T07:12:37Z

@wuxuewu
OK, It seems that your script is almost the same as mine.
Did you fine-tune the caffemodel for your own purpose?
I'm curious about the prediction results. Could you provide your results of PredictTopK by feeding caffe/examples/images/cat.jpg (through jni calls) into the standard caffenet model provided by bvlc?

BTW, what NDK version do you use?
Thanks.

nickwxwu · 2015-12-30T08:16:25Z

Hi @sh1r0 ,
I did not fine-tune the caffemodel yet. Just now, I tested with the standard caffemodel, and caffe/examples/images/cat.jpg is classified as tabby, tabby cat (top-1), I used top-1 of function predict_top_k in the caffe_mobile.cpp

My NDK version is r10e.

nickwxwu · 2015-12-30T08:20:21Z

And the time it took was almost the same as above in phone B (phone B: Armv7 rev 1, android4.4.2, 4 core) . I just tested it on phone B.

sh1r0 · 2015-12-30T09:25:09Z

Hi @wuxuewu ,
That's weird, I always get the incorrect results. Could you provide your prebuilt libcaffe.so and libcaffe_jni.so for me to check if my device is the real problem?
Thanks.

sh1r0 · 2015-12-30T17:25:10Z

I just got another phone to test, the results were (unsurprisingly?) incorrect, too. Perhaps, device is not the problem. My tests follow this ("armeabi-v7a-hard with NEON" is used in 2nd step).
Did I miss anything special about reproducing your results? Also, could you try to build with the latest master branch (follow the steps in the link above), and let me know if that works for you?
Thanks.

Note: This attached image is my prediction result of caffe/examples/images/cat.jpg using caffe-android-demo app with substitute libs.

nickwxwu · 2015-12-31T01:31:51Z

I think maybe the key of the question is the caffemodel. You could use another caffemodel... I use caffemodel downloading from http://dl.caffe.berkeleyvision.org/.Sorry, I could not upload files because of my company's rules... But I'll try to build with the latest master branch, and let you know.

sh1r0 · 2015-12-31T04:59:17Z

@wuxuewu ,
I do not think the problem is the model. The cpp_classification example (executable) works fine with both armeabi-v7a with NEON and armeabi-v7a-hard with NEON build. Also, a clean caffe-android-demo (where the libs are built with armeabi-v7a with NEON) works. All I did to my demo app as I mentioned in the last comment was to change jni libs with armeabi-v7a-hard with NEON ones.
And to be specific, there are numeric issues when the native methods are calling from java, as the prediction results are "fixed" no matter what the input image is.
(My models are all downloaded by using scripts provided from official caffe.)

cd caffe
./scripts/download_model_binary.py models/bvlc_reference_caffenet

nickwxwu · 2015-12-31T06:04:13Z

I downloaded the caffe on Dec,22. And the caffe zip name is caffe-462c0b8e6575f72e50307ac61c116ea28c09eaad. I did not find any numeric issues when the native methods builded with armeabi-v7a-hard with NEON are calling from java. Because it does not using jfloat in this branch version. So I think maybe the problem is the jni call...

sh1r0 · 2015-12-31T06:27:31Z

Why you need to download caffe?

Because it does not using jfloat in this branch version.

Sorry, I cannot get the idea. jfloat is never used in official caffe. But in this project, I make a jni wrapper for java to call native methods. And yes, all problems should be related to the jni calls.

So, if possible, let me know the results of your build with the latest master branch.
Thanks.

sh1r0 · 2015-12-31T07:11:04Z

Hi @wuxuewu ,
I think I found the problem, OS! I just had a try on my Macbook, and I got it. Sorry for bothering you so much, and thanks for your help. Just a quick question, what kind of environment (OS) do you use? All my trials on Ubuntu 14.04 (both real and virtual machines) failed, and made me think that armeabi-v7a-hard with NEON builds did not work at all.
EDIT: I still cannot make OpenBLAS works, while armeabi-v7a-hard with NEON is okay for Eigen to produce correct results. I'm really confused. 😕

sh1r0 · 2015-12-31T10:35:33Z

Hi @wuxuewu ,
I think I eventually found the problem, that is, multi-thread support of OpenBLAS (NUM_THREADS). Therefore, I set NUM_THREADS=1 as single-threaded.

I cannot get the clear idea why multi-threading not works on my devices. Both of my devices are quad-core. It's really a pity that the computation power is not fully utilized.

bhack · 2015-12-31T10:41:58Z

@sh1r0 Is the issue related to the fact that "The JNI interface pointer (JNIEnv *) is only valid in the current thread."? Have you tested with openmp flags? See https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded

sh1r0 · 2015-12-31T10:51:37Z

@bhack According to the reports above from @wuxuewu , I think NUM_THREADS with value greater than 1 works for him. However, some people mentioned in OpenMathLib/OpenBLAS#363 that OpenBLAS for android works only if single-threaded (?).
I've never used openmp flag before. Probably, I'll have a try later. Thanks.

bhack · 2015-12-31T10:56:50Z

If the native code in caffe called by jni use threads openblas need to parallelize with openmp

sh1r0 · 2016-01-01T17:14:19Z

@bhack Thanks for your information. I just updated the master branch to support OpenMP.

bhack · 2016-01-01T19:33:05Z

@sh1r0 Next step CUDA support on android tegra k1 and x1 could be very useful.

sh1r0 · 2016-01-02T09:19:57Z

@bhack
Recently, I got NVIDIA CodeWorks for Android 1R4 which contains cuda toolkit for tegra devices, but I failed to get that work by cmake at my very first trial. I'll do a deep investigation later (probably after #23).

xianyi · 2016-02-26T17:56:51Z

Great work! I also want to play caffe on android :)

sh1r0 mentioned this issue Dec 28, 2015

Different results in caffe-android-lib and caffe-lib #28

Closed

sh1r0 mentioned this issue Dec 29, 2015

Performance #19

Closed

sh1r0 closed this as completed Jan 1, 2016

jainanshul mentioned this issue Feb 4, 2016

Why does JNI code require armeabi-v7a-hard-softfp and not armeabi-v7a-hard- ABI? #37

Closed

sh1r0 mentioned this issue Aug 15, 2016

getConfidenceScore returns NaN #57

Closed

ashwinyes mentioned this issue Apr 10, 2017

ARMV7 (with hard float flag) did not run with correct result OpenMathLib/OpenBLAS#1145

Open

About Openblas #27

About Openblas #27

Comments

nickwxwu commented Dec 28, 2015

nickwxwu commented Dec 28, 2015

sh1r0 commented Dec 28, 2015

sh1r0 commented Dec 28, 2015

bhack commented Dec 28, 2015

nickwxwu commented Dec 28, 2015

sh1r0 commented Dec 28, 2015

nickwxwu commented Dec 29, 2015

nickwxwu commented Dec 29, 2015

sh1r0 commented Dec 29, 2015

sh1r0 commented Dec 29, 2015

nickwxwu commented Dec 29, 2015

sh1r0 commented Dec 29, 2015

nickwxwu commented Dec 30, 2015

nickwxwu commented Dec 30, 2015

sh1r0 commented Dec 30, 2015

nickwxwu commented Dec 30, 2015

nickwxwu commented Dec 30, 2015

sh1r0 commented Dec 30, 2015

nickwxwu commented Dec 30, 2015

nickwxwu commented Dec 30, 2015

sh1r0 commented Dec 30, 2015

sh1r0 commented Dec 30, 2015

nickwxwu commented Dec 31, 2015

sh1r0 commented Dec 31, 2015

nickwxwu commented Dec 31, 2015

sh1r0 commented Dec 31, 2015

sh1r0 commented Dec 31, 2015

sh1r0 commented Dec 31, 2015

bhack commented Dec 31, 2015

sh1r0 commented Dec 31, 2015

bhack commented Dec 31, 2015

sh1r0 commented Jan 1, 2016

bhack commented Jan 1, 2016

sh1r0 commented Jan 2, 2016

xianyi commented Feb 26, 2016