Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Openblas #27

Closed
nickwxwu opened this issue Dec 28, 2015 · 35 comments
Closed

About Openblas #27

nickwxwu opened this issue Dec 28, 2015 · 35 comments

Comments

@nickwxwu
Copy link

hi sh1r0:
I'm very interesting in your project.This project is very wondeful.It works very well with eigen,but it seems not work with openblas. I ran it in android, but it crashed infunction "cblas_sgemm".

@nickwxwu
Copy link
Author

hi sh1r0:
It seems using eigen in mobile is more popular than using openblas.Is eigen more efficient than openblas?

@sh1r0
Copy link
Owner

sh1r0 commented Dec 28, 2015

How did you get it to work? Did you cross compile the OpenBLAS library with hard float support?
I tried this outdated pre-built one before, and got that to work while it was horribly slow. Also, I'm sure that the latest OpenBLAS can be built for Android and works for being linked to executables. However, it's troublesome to be used in jni calls (might be related to this). If you or anyone has any idea about dealing with this issue, please feel free to let me know.
Thanks.

@sh1r0
Copy link
Owner

sh1r0 commented Dec 28, 2015

AFAIK, Eigen can be simply used as a header-only library, and is quite competitive with other BLAS-like libraries (refer to the benchmark, and note that OpenBLAS is based on GotoBLAS). I'm not going to say that Eigen is the best choice in all the cases, but it's a simple and great one at least in my case.

@bhack
Copy link

bhack commented Dec 28, 2015

There is a specific openblas branch for "deep learning" at https://github.com/xianyi/OpenBLAS/tree/optimized_for_deeplearning?files=1

@nickwxwu
Copy link
Author

I modified the flag "-mfloat-abi=hard" to "softfp"(it came error when openblas cross compile with hard float while caffe with softfp)@sh1r0

I tried the outdated pre-built one and https://github.com/xianyi/OpenBLAS/tree/optimized_for_deeplearning?files=1
and failed.
I wonder the way i use is ok...
I linked the libopenblas.so and produce libcaffe.so and libcaffe_jni.so.Then I use "System.loadlibrary("caffe");System.loadlibrary("caffe_jni")" to load these two library.

@sh1r0
Copy link
Owner

sh1r0 commented Dec 28, 2015

To use the pre-built OpenBLAS:

  1. get this and extract to android_lib/
  2. comment out android_lib/openblas-android/include/openblas_config.h:20
  3. remove all *.so* in android_lib/openblas-android/lib
  4. modify scripts/build_caffe.sh as shown below
@@ -19,7 +19,7 @@ OPENCV_ROOT=${ANDROID_LIB_ROOT}/opencv/sdk/native/jni
PROTOBUF_ROOT=${ANDROID_LIB_ROOT}/protobuf
GFLAGS_HOME=${ANDROID_LIB_ROOT}/gflags
BOOST_HOME=${ANDROID_LIB_ROOT}/boost_1.56.0
-export OpenBLAS_HOME=${ANDROID_LIB_ROOT}/openblas
+export OpenBLAS_HOME=${ANDROID_LIB_ROOT}/openblas-android
export EIGEN_HOME=${ANDROID_LIB_ROOT}/eigen3

rm -rf "${BUILD_DIR}"
@@ -40,7 +40,7 @@ cmake -DCMAKE_TOOLCHAIN_FILE="${WD}/android-cmake/android.toolchain.cmake" \
   -DUSE_LMDB=OFF \
   -DUSE_LEVELDB=OFF \
   -DUSE_HDF5=OFF \
-      -DBLAS=eigen \
+      -DBLAS=open \
   -DBOOST_ROOT="${BOOST_HOME}" \
   -DGFLAGS_INCLUDE_DIR="${GFLAGS_HOME}/include" \
   -DGFLAGS_LIBRARY="${GFLAGS_HOME}/lib/libgflags.a" \
  1. re-build caffe

On the other hand, regarding the master or optimized_for_deeplearning branch of OpenBLAS, hard float support is required. And as I said, it works for native executables but not for jni libs. If you want to build this project with hard float support, you can simply set the flag in the shell export ANDROID_ABI="armeabi-v7a-hard with NEON" and re-build everything.

@nickwxwu
Copy link
Author

Thank you very much@sh1r0. It worked with OpenBLAS-0.2.15.tar.gz when I had compile all dependencies with hard float support, with your help. But it seemed to show that using openblas is more faster than eigen in the forwarding of caffe model( 400-800ms faster). I thought may the version eigen is 3.2.5 and it was not the latest,but the openblas was the latest.
Later ,I'll test this using the latest eigen.
For all ,thanks.

@nickwxwu
Copy link
Author

I used the latest version of eigen (3.2.7), but got the same result... I wonder some flag (like "neon" etc) need to be set to eigen when compiling caffe with eigen.

@sh1r0
Copy link
Owner

sh1r0 commented Dec 29, 2015

Hi @wuxuewu , good to know that. Do you mean that you have succeeded in getting jni work with hard float? Could you share experience? Thanks.
BTW, I think the version of eigen might be minor to performance. :p

@sh1r0
Copy link
Owner

sh1r0 commented Dec 29, 2015

@wuxuewu
I tried to run the cpp_classification example on my phone, and simply used time to do simple benchmarks. The results below are the best three of each build (both are built by armeabi-v7a-hard with NEON).

=======  OpenBLAS  ======
0m10.57s real     0m4.76s user     0m4.83s system
0m10.68s real     0m4.35s user     0m4.81s system
0m11.03s real     0m4.46s user     0m4.73s system

=======   Eigen    ======
0m10.99s real     0m3.48s user     0m3.48s system
0m10.85s real     0m3.30s user     0m3.70s system
0m10.38s real     0m3.58s user     0m3.18s system

@nickwxwu
Copy link
Author

Hi sh1r0:
Yes, I have succeeded in getting jni work with hard float. Just followed your instruction in the build.sh with all compiling with " armeabi-v7a-hard with NEON " .
The results above you showd seems that openblas is a bit slower than eigen, I did not try the cpp_classification example.(what's the version of openblas and eigen you used ?)
I use the caffe lib with openblas and eigen in the caffe-demo-for-android project, and the caffe_mobile.cpp print logs are below,and i test several times while the results did not change.
===== Eigen ========
Prediction time: 2043.39ms

===== OpenBLAS =====
Prediction time: 1458.48ms

note: caffe model, and cpu mode, eigen 3.2.7, OpenBLAS 0.2.15
sorry, i want to know if the eigen should to be compiled alone or if setting some compile flag for eigen in the build_caffe.sh?

@sh1r0
Copy link
Owner

sh1r0 commented Dec 29, 2015

Hi @wuxuewu ,
Wow, that's weird. First, I use OpenBLAS v0.2.15 and Eigen v3.2.5.
Second, did you use the build_openblas.sh to build?
In my experience, armeabi-v7a-hard with NEON is okay for building everything. However, during runtime, the results are totally wrong. Could you provide some of your prediction results by jni calls?
(EDIT: caffe/examples/images/cat.jpg is a good candidate for the tests.)
For the last question, the answer is no. There is no need to build eigen alone.

@sh1r0 sh1r0 mentioned this issue Dec 29, 2015
@nickwxwu
Copy link
Author

Hi sh1r0,
I used to test the openblas and eigen with two mobile I have (A and B),and got results below:

phone A phone B
---------- openblas - 8 ----------
502ms 1330ms
458ms 1280ms
584ms 1530ms
4168ms 1400ms
4822ms 1420ms

------------openblas - 4 -----------
409ms 1300ms
445ms 1490ms
385ms 1410ms
385ms 1360ms
376ms 1410ms
365ms 1340ms
367ms 1440ms

------------- eigen -----------
539ms 2170ms
526ms 2100ms
535ms 2160ms
564ms 2220ms
551ms 2160ms
528ms 2210ms
537ms 2140ms

phone A: AArch64, android 6.0, 8 core
phone B: Armv7 rev 1, android4.4.2, 4 core
(phone C: Armv7 rev 5, android4.4.2, 8 core. results same as phone B)
openblas - 8: compile with TARGET=ARMV7 USE_THREAD=ON NUM_THREADS=8
openblas - 4: compile with TARGET=ARMV7 USE_THREAD=ON NUM_THREADS=4

@nickwxwu
Copy link
Author

I count the time with the following change in caffe_mobile.cpp, because I found predicting on phone A the function "clock()" was not precise.The log output was "Prediction time: 3900ms" while I saw the app returned results less than one second. So I used the following way to count the time.(The log would output and could get the time in the window logcat of eclipse)
"
VLOG(1)<<"wxw";
const vector<Blob*>& result = caffe_net->Forward(dummy_bottom_vec, &loss);
VLOG(1)<<"wxw";
"

@sh1r0
Copy link
Owner

sh1r0 commented Dec 30, 2015

Hi @wuxuewu , it seems that your prediction results are correct? I mean, for example, caffe/examples/images/cat.jpg is classified as tabby cat (top-1), right? Could you provide your script for building OpenBLAS and possibly your adaptions for building this project? It'll be great to integrate it.
Regrading Forwarding time in caffe_mobile.cpp, I think it counts the real cpu time (sum up all your multi-core cpu time) rather than the wall time, I'll try to fix this.
Thanks.

@nickwxwu
Copy link
Author

Hi @sh1r0 :
the script of building OpenBLAS is below:
"
#!/usr/bin/env sh

if [ -z "$NDK_ROOT" ] && [ "$#" -eq 0 ]; then
echo 'Either $NDK_ROOT should be set or provided as argument'
echo "e.g., 'export NDK_ROOT=/path/to/ndk' or"
echo " '${0} /path/to/ndk'"exit 1
else
NDK_ROOT="${1:-${NDK_ROOT}}"
fi

#export OPENBLAS_NUM_THREADS=1
TOOLCHAIN_DIR=$NDK_ROOT/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/bin
WD=$(readlink -f "dirname $0/..")
INSTALL_DIR=${WD}/android_lib
N_JOBS=8

cd OpenBLAS

make clean
make -j${N_JOBS}
CC="$TOOLCHAIN_DIR/arm-linux-androideabi-gcc --sysroot=$NDK_ROOT/platforms/android-19/arch-arm"
CROSS_SUFFIX=$TOOLCHAIN_DIR/arm-linux-androideabi-
HOSTCC=gcc NO_LAPACK=1 TARGET=ARMV7
USE_THREAD=ON NUM_THREADS=4

rm -rf "$INSTALL_DIR/openblas"
make PREFIX="$INSTALL_DIR/openblas" install
"

I used the "caffe/examples/images/cat.jpg" to predict, but I did not focus on the result of prediction. I modified the last layer of the caffe model with only 4 outputs, but I did not change the synset_words.txt remained 1000 classifications. Does that matter?

@nickwxwu
Copy link
Author

the script of building OpenBLAS is below:

#!/usr/bin/env sh

if [ -z "$NDK_ROOT" ] && [ "$#" -eq 0 ]; then
echo 'Either $NDK_ROOT should be set or provided as argument'
echo "e.g., 'export NDK_ROOT=/path/to/ndk' or"
echo " '${0} /path/to/ndk'"exit 1
else
NDK_ROOT="${1:-${NDK_ROOT}}"
fi

#export OPENBLAS_NUM_THREADS=1
TOOLCHAIN_DIR=$NDK_ROOT/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/bin
WD=$(readlink -f " dirname $0 /..")
INSTALL_DIR=${WD}/android_lib
N_JOBS=8

cd OpenBLAS

make clean
make -j${N_JOBS} \
CC="$TOOLCHAIN_DIR/arm-linux-androideabi-gcc --sysroot=$NDK_ROOT/platforms/android-19/arch-arm" \
CROSS_SUFFIX=$TOOLCHAIN_DIR/arm-linux-androideabi- \
HOSTCC=gcc NO_LAPACK=1 TARGET=ARMV7 \
USE_THREAD=ON NUM_THREADS=4

rm -rf "$INSTALL_DIR/openblas"
make PREFIX="$INSTALL_DIR/openblas" install

@sh1r0
Copy link
Owner

sh1r0 commented Dec 30, 2015

@wuxuewu
OK, It seems that your script is almost the same as mine.
Did you fine-tune the caffemodel for your own purpose?
I'm curious about the prediction results. Could you provide your results of PredictTopK by feeding caffe/examples/images/cat.jpg (through jni calls) into the standard caffenet model provided by bvlc?

BTW, what NDK version do you use?
Thanks.

@nickwxwu
Copy link
Author

Hi @sh1r0 ,
I did not fine-tune the caffemodel yet. Just now, I tested with the standard caffemodel, and caffe/examples/images/cat.jpg is classified as tabby, tabby cat (top-1), I used top-1 of function predict_top_k in the caffe_mobile.cpp

My NDK version is r10e.

@nickwxwu
Copy link
Author

And the time it took was almost the same as above in phone B (phone B: Armv7 rev 1, android4.4.2, 4 core) . I just tested it on phone B.

@sh1r0
Copy link
Owner

sh1r0 commented Dec 30, 2015

Hi @wuxuewu ,
That's weird, I always get the incorrect results. Could you provide your prebuilt libcaffe.so and libcaffe_jni.so for me to check if my device is the real problem?
Thanks.

@sh1r0
Copy link
Owner

sh1r0 commented Dec 30, 2015

I just got another phone to test, the results were (unsurprisingly?) incorrect, too. Perhaps, device is not the problem. My tests follow this ("armeabi-v7a-hard with NEON" is used in 2nd step).
Did I miss anything special about reproducing your results? Also, could you try to build with the latest master branch (follow the steps in the link above), and let me know if that works for you?
Thanks.

Note: This attached image is my prediction result of caffe/examples/images/cat.jpg using caffe-android-demo app with substitute libs.

@nickwxwu
Copy link
Author

I think maybe the key of the question is the caffemodel. You could use another caffemodel... I use caffemodel downloading from http://dl.caffe.berkeleyvision.org/.Sorry, I could not upload files because of my company's rules... But I'll try to build with the latest master branch, and let you know.

@sh1r0
Copy link
Owner

sh1r0 commented Dec 31, 2015

@wuxuewu ,
I do not think the problem is the model. The cpp_classification example (executable) works fine with both armeabi-v7a with NEON and armeabi-v7a-hard with NEON build. Also, a clean caffe-android-demo (where the libs are built with armeabi-v7a with NEON) works. All I did to my demo app as I mentioned in the last comment was to change jni libs with armeabi-v7a-hard with NEON ones.
And to be specific, there are numeric issues when the native methods are calling from java, as the prediction results are "fixed" no matter what the input image is.
(My models are all downloaded by using scripts provided from official caffe.)

cd caffe
./scripts/download_model_binary.py models/bvlc_reference_caffenet

@nickwxwu
Copy link
Author

I downloaded the caffe on Dec,22. And the caffe zip name is caffe-462c0b8e6575f72e50307ac61c116ea28c09eaad. I did not find any numeric issues when the native methods builded with armeabi-v7a-hard with NEON are calling from java. Because it does not using jfloat in this branch version. So I think maybe the problem is the jni call...

@sh1r0
Copy link
Owner

sh1r0 commented Dec 31, 2015

Why you need to download caffe?

Because it does not using jfloat in this branch version.

Sorry, I cannot get the idea. jfloat is never used in official caffe. But in this project, I make a jni wrapper for java to call native methods. And yes, all problems should be related to the jni calls.

So, if possible, let me know the results of your build with the latest master branch.
Thanks.

@sh1r0
Copy link
Owner

sh1r0 commented Dec 31, 2015

Hi @wuxuewu ,
I think I found the problem, OS! I just had a try on my Macbook, and I got it. Sorry for bothering you so much, and thanks for your help. Just a quick question, what kind of environment (OS) do you use? All my trials on Ubuntu 14.04 (both real and virtual machines) failed, and made me think that armeabi-v7a-hard with NEON builds did not work at all.
EDIT: I still cannot make OpenBLAS works, while armeabi-v7a-hard with NEON is okay for Eigen to produce correct results. I'm really confused. 😕

@sh1r0
Copy link
Owner

sh1r0 commented Dec 31, 2015

Hi @wuxuewu ,
I think I eventually found the problem, that is, multi-thread support of OpenBLAS (NUM_THREADS). Therefore, I set NUM_THREADS=1 as single-threaded.

I cannot get the clear idea why multi-threading not works on my devices. Both of my devices are quad-core. It's really a pity that the computation power is not fully utilized.

@bhack
Copy link

bhack commented Dec 31, 2015

@sh1r0 Is the issue related to the fact that "The JNI interface pointer (JNIEnv *) is only valid in the current thread."? Have you tested with openmp flags? See https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded

@sh1r0
Copy link
Owner

sh1r0 commented Dec 31, 2015

@bhack According to the reports above from @wuxuewu , I think NUM_THREADS with value greater than 1 works for him. However, some people mentioned in OpenMathLib/OpenBLAS#363 that OpenBLAS for android works only if single-threaded (?).
I've never used openmp flag before. Probably, I'll have a try later. Thanks.

@bhack
Copy link

bhack commented Dec 31, 2015

If the native code in caffe called by jni use threads openblas need to parallelize with openmp

@sh1r0
Copy link
Owner

sh1r0 commented Jan 1, 2016

@bhack Thanks for your information. I just updated the master branch to support OpenMP.

@sh1r0 sh1r0 closed this as completed Jan 1, 2016
@bhack
Copy link

bhack commented Jan 1, 2016

@sh1r0 Next step CUDA support on android tegra k1 and x1 could be very useful.

@sh1r0
Copy link
Owner

sh1r0 commented Jan 2, 2016

@bhack
Recently, I got NVIDIA CodeWorks for Android 1R4 which contains cuda toolkit for tegra devices, but I failed to get that work by cmake at my very first trial. I'll do a deep investigation later (probably after #23).

@xianyi
Copy link

xianyi commented Feb 26, 2016

Great work! I also want to play caffe on android :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants