Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATL_dJIK0x0x48NN0x0x0_aX_bX () from /usr/lib/libblas.so.3 #1144

Closed
root-sudip opened this issue Apr 6, 2017 · 25 comments
Closed

ATL_dJIK0x0x48NN0x0x0_aX_bX () from /usr/lib/libblas.so.3 #1144

root-sudip opened this issue Apr 6, 2017 · 25 comments

Comments

@root-sudip
Copy link

I am trying to learn a Dictionary by using Sci-kit Learn. It is working fine for 100 Images but If I use 200 Images Then I am getting Segmentation fault error.

I have used GDB to debug my code. and got this error :0x00007ffff3059f50 in ATL_dJIK0x0x48NN0x0x0_aX_bX () from /usr/lib/libblas.so.3

How to fix this problem??

@brada4
Copy link
Contributor

brada4 commented Apr 6, 2017

You are using ATLAS, not OpenBLAS.

@martin-frbg
Copy link
Collaborator

... as evidenced by the ATL prefix. Try sudo update-alternatives --list libblas.so.3 to confirm

@brada4
Copy link
Contributor

brada4 commented Apr 6, 2017

@root-sudip can you manage to check for same problem using OpenBLAS and/or Netlib BLAS?
If OpenBLAS joins the crash party we could get into deeper debugging, otherwise catch at least 'bt' inside gdb, your CPU and OS and compiler details, and post in other bug report to respective library.

@root-sudip
Copy link
Author

@martin-frbg I did that one . but still getting same error.

@martin-frbg
Copy link
Collaborator

The update-alternatives command is just to show what libblas.so.3 actually is - there are several implementations of BLAS around, such as the non-optimized original "netlib" version, ATLAS BLAS (what you appear to be using now), and OpenBLAS. So far your problem appears to have nothing to do with OpenBLAS, and maybe not even with ATLAS - it could be simply that you are calling some BLAS function with the wrong arguments.

@brada4
Copy link
Contributor

brada4 commented Apr 7, 2017

@root-sudip can you run same test with non-ATLAS BLAS to sort out if problem is resource leak in ATLAS or in calling code (anything between where you open image files and where sklearn.so calls libblas.so)?

@martin-frbg
Copy link
Collaborator

Btw in case you have not already done this, add "-g" to the compiler options when you build your own code so that gdb can show you (with the "bt" command to get a backtrace) which line in your code, i.e. which invocation of a BLAS function causes the problem. Perhaps this will already be sufficient to identify the error - it could be something as simple as an array that is too small to hold the result.

@root-sudip
Copy link
Author

root-sudip commented Apr 10, 2017

@martin-frbg sorry for delay reply cause I was not getting the access of server.
so, After doing sudo update-alternatives --list libblas.so.3 commend, I am getting this output

/usr/lib/atlas-base/atlas/libblas.so.3
/usr/lib/libblas/libblas.so.3

Now, what I will do ?? @martin-frbg

@martin-frbg
Copy link
Collaborator

This confirms that your problem has nothing to do specifically with OpenBLAS. What you could try is run
sudo update-alternatives --config libblas.so.3 to switch to something other than ATLAS provided either OpenBLAS or the netlib reference implementation of BLAS and LAPACK is already on your server. (Use something like `sudo apt-get install libopenblas-base to install it first if necessary, see for instance http://blog.nguyenvq.com/blog/2014/11/10/optimized-r-and-python-standard-blas-vs-atlas-vs-openblas-vs-mkl/ for details). Then you could check if the error persists or if it was some fault of ATLAS.

I see now that Sci-Kit appears to be a python package, so my comment about recompiling with "-g" to get better backtraces is probably pointless. Still perhaps the "bt" command in gdb (or "up" to traverse that call stack) can point to the line in your code that called the crashing function ?
And apparently you opened an issue ticket at scikit-learn/scikit-learn#8717 as well, which is probably the best course of action as people there will be more familiar with scikit and any preliminary setup necessary for training with larger sets. (The only thing in your code that stands out to me - a total stranger to scikit - is that you have n_components=100 in the "dico = MiniBatchDictionaryLearning..." line, but this is probably not related to the number of input images anyway...)

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

You run

sudo update-alternatives --config libblas.so.3

And change system default BLAS implementation away from ATLAS.
Still no OpenBLAS involved.

@root-sudip
Copy link
Author

okk thanks for quick reply, I will do your suggestion as soon as possible. @martin-frbg and @brada4

@root-sudip
Copy link
Author

@brada4 by using that command I am getting this options:

Selection Path Priority Status

  • 0 /usr/lib/atlas-base/atlas/libblas.so.3 35 auto mode
    1 /usr/lib/atlas-base/atlas/libblas.so.3 35 manual mode
    2 /usr/lib/libblas/libblas.so.3 10 manual mode

Press enter to keep the current choice[*], or type selection number: ??

Which option I will select ??

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

One that has no ATLAS ? Please dont make more blank posts. It is burdening to read them, especially when you are not reporting an issue with OpenBLAS and we are just making some effort to help you sort out who other is at fault.

@martin-frbg
Copy link
Collaborator

Looks like you would need to install OpenBLAS (or any other implementation except the ATLAS you currently have) first. Or just wait for some response to your Sci-kit ticket, as it seems very likely that the problem is in your code and not in the libraries it calls.

@root-sudip
Copy link
Author

How to install OpenBlas for ubuntu 14.04. If have problem in my code then it didn't work for 100 Images . I am facing problem only for above 100 Images. @martin-frbg

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

You can install old version from Ubuntu using 'apt install libopenblas-dev', then fix up alternatives as mentioned above.
And please produce backtrace.

@root-sudip
Copy link
Author

yes, I have installed it and configured properly. bt still getting same error. sorry I don't understand what is backtrace?? @brada4

@martin-frbg
Copy link
Collaborator

The list of function calls leading up to the segmentation fault - this should tell you the line in your SciKit program that made the fatal call. (Do you get "the same error" including a function name that starts with ATL, or is the function named differently now ?)

@martin-frbg
Copy link
Collaborator

By the way I see no error handling (in the test code you posted on the SciKit ticket) for the case that reading from an image fails. (For instance, does Image.open() work without a corresponding Image.close(), or would you run out of available file descriptors at some point ?)

@brada4
Copy link
Contributor

brada4 commented Apr 11, 2017

At the point GDB captures crash of your program you type 'backtrace' or short version of it - 'bt'

@root-sudip
Copy link
Author

root-sudip commented Apr 11, 2017

#0  0x00007fff4b70ef50 in ATL_dJIK0x0x48NN0x0x0_aX_bX ()
   from /usr/lib/libblas.so.3
#1  0x00007fff4b72cc82 in ATL_dNCmmJIK () from /usr/lib/libblas.so.3
#2  0x00007fff4b75185a in ATL_dgemmNN () from /usr/lib/libblas.so.3
#3  0x00007fff4b7f075a in ATL_rtrsmLLN () from /usr/lib/libblas.so.3
#4  0x00007fff4b79a375 in ATL_dtrsm () from /usr/lib/libblas.so.3
#5  0x00007fff4bef1da9 in dtrsm_ () from /usr/lib/libf77blas.so.3
#6  0x00007fff4c2b1f36 in dgetrf_ () from /usr/lib/liblapack.so.3
#7  0x00007fff49d6d154 in dlu_c (p=..., l=..., u=..., a=..., m=51515050, 
    n=110, k=110, piv=..., info=0, permute_l=1, m1=1)
    at scipy/linalg/src/lu.f:27
#8  0x00007fff49d69ca5 in f2py_rout__flinalg_dlu_c (capi_self=<optimized out>, 
    capi_args=<optimized out>, capi_keywds=<optimized out>, f2py_func=
    0x7fff49d6d0c0 <dlu_c>)
    at build/src.linux-x86_64-3.4/scipy/linalg/_flinalgmodule.c:1286
Python Exception <class 'RuntimeError'> Type does not have a target.: 
#9  0x000000000048a487 in PyObject_Call (kw=, arg=
    (<numpy.ndarray at remote 0x7fff39476170>,), 
    func=<fortran at remote 0x7fff4a7079b8>) at ../Objects/abstract.c:2040
#10 do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffffd750, 
    func=<fortran at remote 0x7fff4a7079b8>) at ../Python/ceval.c:4466
#11 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd750)
    at ../Python/ceval.c:4264
Python Exception <class 'RuntimeError'> Type does not have a target.: 
#12 PyEval_EvalFrameEx (f=f@entry=, throwflag=throwflag@entry=0)
---Type <return> to continue, or q <return> to quit---
    at ../Python/ceval.c:2838
#13 0x000000000048e45b in PyEval_EvalCodeEx (_co=<optimized out>, 
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kws=0x7fff517e7e18, kwcount=1, 
    defs=0x7fff4a6eff00, defcount=3, kwdefs=0x0, closure=0x0)
    at ../Python/ceval.c:3588
#14 0x000000000048a673 in fast_function (nk=<optimized out>, 
    na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffd960, 
    func=<function at remote 0x7fff4a71a9d8>) at ../Python/ceval.c:4344
#15 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd960)
    at ../Python/ceval.c:4262
Python Exception <class 'RuntimeError'> Type does not have a target.: 
#16 PyEval_EvalFrameEx (f=f@entry=, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2838
#17 0x000000000048e45b in PyEval_EvalCodeEx (_co=<optimized out>, 
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kws=0x1d41508, kwcount=0, defs=0x7fff43b0aaa0, 
    defcount=2, kwdefs=0x0, closure=0x0) at ../Python/ceval.c:3588
#18 0x000000000048a673 in fast_function (nk=<optimized out>, 
    na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffdb70, 
    func=<function at remote 0x7fff43b18488>) at ../Python/ceval.c:4344
#19 call_function (oparg=<optimized out>, pp_stack=0x7fffffffdb70)
    at ../Python/ceval.c:4262
Python Exception <class 'RuntimeError'> Type does not have a target.: 
#20 PyEval_EvalFrameEx (f=f@entry=, throwflag=throwflag@entry=0)
---Type <return> to continue, or q <return> to quit---
    at ../Python/ceval.c:2838
#21 0x000000000048e45b in PyEval_EvalCodeEx (_co=<optimized out>, 
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kws=0x1d41268, kwcount=1, defs=0x7fff43d51b40, 
    defcount=6, kwdefs=0x0, closure=0x0) at ../Python/ceval.c:3588
#22 0x000000000048a673 in fast_function (nk=<optimized out>, 
    na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffdd80, 
    func=<function at remote 0x7fff43b18510>) at ../Python/ceval.c:4344
#23 call_function (oparg=<optimized out>, pp_stack=0x7fffffffdd80)
    at ../Python/ceval.c:4262
Python Exception <class 'RuntimeError'> Type does not have a target.: 
#24 PyEval_EvalFrameEx (f=f@entry=, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2838
#25 0x000000000048e45b in PyEval_EvalCodeEx (_co=<optimized out>, 
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kws=0x1d40ed8, kwcount=11, defs=0x7fff416993d0, 
    defcount=16, kwdefs=0x0, closure=0x0) at ../Python/ceval.c:3588
#26 0x000000000048a673 in fast_function (nk=<optimized out>, 
    na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffdf90, 
    func=<function at remote 0x7fff40d7b400>) at ../Python/ceval.c:4344
#27 call_function (oparg=<optimized out>, pp_stack=0x7fffffffdf90)
    at ../Python/ceval.c:4262
Python Exception <class 'RuntimeError'> Type does not have a target.: 
#28 PyEval_EvalFrameEx (f=f@entry=, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2838
---Type <return> to continue, or q <return> to quit---
#29 0x000000000048e45b in PyEval_EvalCodeEx (_co=<optimized out>, 
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kws=0x7ffff7f3e5c0, kwcount=0, 
    defs=0x7fff40d70990, defcount=1, kwdefs=0x0, closure=0x0)
    at ../Python/ceval.c:3588
#30 0x000000000048a673 in fast_function (nk=<optimized out>, 
    na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffe1a0, 
    func=<function at remote 0x7fff40d7b8c8>) at ../Python/ceval.c:4344
#31 call_function (oparg=<optimized out>, pp_stack=0x7fffffffe1a0)
    at ../Python/ceval.c:4262
Python Exception <class 'RuntimeError'> Type does not have a target.: 
#32 PyEval_EvalFrameEx (f=f@entry=, throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2838
#33 0x000000000048e45b in PyEval_EvalCodeEx (_co=<optimized out>, 
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0) at ../Python/ceval.c:3588
#34 0x000000000048f15b in PyEval_EvalCode (
Python Exception <class 'RuntimeError'> Type does not have a target.: 
    co=co@entry=<code at remote 0x7ffff7f00390>, globals=globals@entry=, 
Python Exception <class 'RuntimeError'> Type does not have a target.: 
    locals=locals@entry=) at ../Python/ceval.c:775
#35 0x0000000000559730 in run_mod.31601 (mod=mod@entry=0xa777b0, 
Python Exception <class 'RuntimeError'> Type does not have a target.: 
Python Exception <class 'RuntimeError'> Type does not have a target.: 
Python Exception <class 'RuntimeError'> Type does not have a target.: 
    filename=filename@entry=, globals=globals@entry=, locals=locals@entry=, 
    flags=flags@entry=0x7fffffffe460, arena=arena@entry=0x9b75f0)
    at ../Python/pythonrun.c:2180
---Type <return> to continue, or q <return> to quit---
#36 0x00000000004793c5 in PyRun_FileExFlags (fp=fp@entry=0x9ef830, 
    filename_str=filename_str@entry=0x7ffff7f40050 "noise.py", 
Python Exception <class 'RuntimeError'> Type does not have a target.: 
Python Exception <class 'RuntimeError'> Type does not have a target.: 
    start=start@entry=257, globals=globals@entry=, locals=locals@entry=, 
    closeit=closeit@entry=1, flags=flags@entry=0x7fffffffe460)
    at ../Python/pythonrun.c:2133
#37 0x00000000004797a2 in PyRun_SimpleFileExFlags (fp=fp@entry=0x9ef830, 
    filename=<optimized out>, closeit=closeit@entry=1, 
    flags=flags@entry=0x7fffffffe460) at ../Python/pythonrun.c:1606
#38 0x000000000047989c in PyRun_AnyFileExFlags (fp=fp@entry=0x9ef830, 
    filename=<optimized out>, closeit=closeit@entry=1, 
    flags=flags@entry=0x7fffffffe460) at ../Python/pythonrun.c:1292
#39 0x00000000005bfaa0 in run_file (p_cf=0x7fffffffe460, 
    filename=0x9a3280 L"noise.py", fp=0x9ef830) at ../Modules/main.c:319
#40 Py_Main (argc=argc@entry=2, argv=argv@entry=0x9a2010)
    at ../Modules/main.c:751
#41 0x000000000047d9f4 in main (argc=2, argv=<optimized out>)
    at ../Modules/python.c:69

@brada4

@martin-frbg
Copy link
Collaborator

Seems the backtrace does not tell as much as we hoped, but there appear to be ways to make gdb show actual python code and data in backtraces, see e.g. http://grapsus.net/blog/post/Low-level-Python-debugging-with-GDB
Please follow the suggestions on your SciKit ticket first, and if that does not lead anywhere try posting that backtrace there. (Seems they suspect there is something wrong with your installation of SciKit).
(And from the top of the backtrace it seems you are still using ATLAS , so again whatever is at the root of your segmentation fault it is not an OpenBLAS problem)

@glemaitre
Copy link

FYI, we are still trying to figure out the issue. But it seems that he is using a scikit-learn wheel which is built using a static version of ATLAS. So, it should not be related with OpenBLAS and it should also explain why using alternative does not change anything ;)

@brada4
Copy link
Contributor

brada4 commented Apr 13, 2017

@glemaitre thanks for sharing. Actually here the offending call comes from scipy lu (backtrace element eight) and hits the alternating library where I tried to supplant OpenBLAS. It could happen that 2 different blas and/or lapack libraries (incl minor version differences) are loaded in same process leading to certain crash. Other puzzling things are source of libblas.so.3gf, which is not to be found in ubuntu packages.
Probably tracing ld.so can sched some light on self-inflicted DLL hell. Also LD_PRELOAD-ing one implementation may or may not keep others out.

@glemaitre
Copy link

@brada4 Thanks a lot. You put us on the good track. There was a big mismatch of scipy version. @root-sudip install it from the debian repo (0.13.3 version) which uses the blas of the system and python was reporting a 0.19 version install from the wheels build with openblas. I don't really know things can get jammed that way but problem solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants