Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis is having a bad day (with openblas 0.2.7) [PILEDRIVER codes broken] #3834

Closed
staticfloat opened this issue Jul 25, 2013 · 12 comments
Closed

Comments

@staticfloat
Copy link
Member

I've run the test suite on every Linux box I have, and I can't seem to reproduce the error that Travis is hitting on all the current builds. I have a sneaking suspicion it has to do with the new openblas 0.2.7 deb I push to my PPA, but all my local tests pass with it, so I'm having a hard time tracking down exactly where the failure is.

Can anyone else reproduce these errors?

@staticfloat
Copy link
Member Author

I've gotten ssh access to a Travis worker. Let's see if we can't nail down what exactly is going on here, and also debug some of the cause of any other strange Travis errors we encounter. I have access for 48 hours, so let's make the most of it!

My first question, is for @aviks and perhaps @JeffBezanson, as I'm seeing some really strange behavior regarding multiprocessing. When I start up a worker, (even if it's just addprocs(1)), my memory consumption jumps through the roof (~200MB per process, so ~400MB including the master process). Travis workers apparently have 32 CPUs, so when we start up 8 workers with 3 GB of memory, things get a little hairy.

Running pmap against the process yields this, which looks like there's an awful lot of memory that isn't due to shared libraries or anything (although at least 80MB is)

I should also mention that when I start up, say, 5 processes via addprocs(5), all 32 cpus spike to 100%, absolutely consumed by kernel work. Each Julia process also spins up a bunch of threads. It looks to me like each worker spins up NCPUs threads, which could explain the spike of cpu usage, as we start up 256 threads!

I can confirm that the error only occurs when OpenBLAS 0.2.7 is installed, but it's strange that it's only happening on Travis machines. @dmbates, @andreasnoackjensen, @ViralBShah, you guys are my go-to people for linalg.jl errors. It looks like this is happening at the @test_approx_eq q'*full(q, false) eye(elty, 5) line in linalg.jl when eltype is Float64. Any ideas why this would be? This is getting compiled on a 64-bit machine, but with USE_BLAS64=0.

@staticfloat
Copy link
Member Author

Oh, and if any of you guys want to SSH into this thing, email me, I'll give you access to poke around. I can replicate the memory behavior on other machines, which makes me think it might not be a bug (although at least one of my machines doesn't exhibit the symptoms, I can't quite figure out why not) but the linalg error I've only been able to reproduce on Travis machines.

@aviks
Copy link
Member

aviks commented Jul 26, 2013

You probably meant @amitmurthy to ask about the multiprocessing issues.

@amitmurthy
Copy link
Contributor

addprocs() calls blas_set_num_threads(1) on both the master process as well as the workers.

Is it possible that this isn't working with OpenBLAS 0.2.7 on specifically the Travis machines?

@ViralBShah
Copy link
Member

I have reverted 0.2.7 until the lapack symbols in openblas are fixed.

@staticfloat
Copy link
Member Author

Issue filed OpenMathLib/OpenBLAS#263, identifying the bad guy as dgeqp3_. I will rollback my PPA somehow, (PPA's don't natively have rollback! It's such a pain to do it every time. I guess it just goes to show that I need to test on Travis before pushing to the PPA)

@xianyi
Copy link

xianyi commented Jul 28, 2013

I plan to roll back bulldozer & piledriver kernels to barcelona kernel.

@staticfloat
Copy link
Member Author

I've patched OpenBLAS to disable AVX kernels for now, as I wasn't able to quickly figure out how to remove pile driver codes from the dynamic arch binary. Once xianyi rolls them back, I'll disable the patch, and we should be good to go.

@staticfloat
Copy link
Member Author

It looks like they are eventually going to release a 0.2.8 fix to OpenBLAS to address some of these issues. @xianyi, if you like, just let me know when 0.2.8 is ready to test, and I'll make sure it works on all of my systems. (Linux with AMD, Linux with Intel, OSX, etc...)

@ViralBShah
Copy link
Member

@staticfloat Can you test 0.2.8-rc1 on AMD?

@staticfloat
Copy link
Member Author

Yes, I am doing that right now. Unfortunately, I had to wait for the space requirements for the PPA to be increased before I could get OpenBLAS 0.2.8-rc1 uploaded. I should know whether it works or not before I go to bed tonight.

@staticfloat
Copy link
Member Author

All tests pass now on 0.2.8-rc1. Once 0.2.8 proper is released, I will update the PPA, and we can move to 0.2.8 in deps/Versions.make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants