Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling scs-gpu from source on Python, using Windows. #186

Open
b-grimaud opened this issue Nov 23, 2021 · 13 comments
Open

Compiling scs-gpu from source on Python, using Windows. #186

b-grimaud opened this issue Nov 23, 2021 · 13 comments

Comments

@b-grimaud
Copy link

Specifications

  • OS: Windows 10
  • SCS Version: 3.0.0
  • Compiler: ?

Description

I am trying to make use of a GPU to speed up SCS, but unfortunately the GPU-equipped machine I have access to is shared, and I have to install it on Windows.

It seems that Visual Studio C++ and Windows 10 SDK are required to compile, but apparently that doesn't work.
The only answer I could find related to that suggested removing Visual Studio entirely which, unsurprisingly, doesn't work.
Building from source without any options (python setup.py install) seems to work, so the issue might be GPU related.

How to reproduce

As instructed in the docs :

git clone --recursive https://github.com/bodono/scs-python.git
cd scs-python
python setup.py install --scs --gpu

Additional information

I understand that SCS probably hasn't been tested or used on Windows that much, especially for GPU uses. I am asking in case someone did manage to compile from source, with GPU, outside of Linux.
The environment I'm using is Python 3.8.12 with cudatoolkit 10.1.243 and cudnn 7.6.5, installed as part of tensorflow-gpu.
CUDA works fine with ML uses in that environment.

Output

The entire output is (very) verbose, but here's the final part :
error: Command "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DPYTHON -DCTRLC=1 -DCOPYAMATRIX -DGPU_TRANSPOSE_MAT=1 -DPY_GPU -DINDIRECT=1 -Iscs/include -Iscs/linsys -IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5/include -Iscs/linsys/gpu/ -Iscs/linsys/gpu/indirect -IC:\Users\M T\anaconda3\envs\scs_gpu\lib\site-packages\numpy\core\include -IC:\Users\M T\anaconda3\envs\scs_gpu\include -IC:\Users\M T\anaconda3\envs\scs_gpu\include -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\winrt -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\winrt /Tcscs/linsys/gpu\gpu.c /Fobuild\temp.win-amd64-3.8\Release\scs/linsys/gpu\gpu.obj -O3" failed with exit status 2

@bodono
Copy link
Member

bodono commented Nov 24, 2021

I don't see much in that error to tell us what's happening. Can you post a little more to take a look at.

Also, the windows special case for gpu is handled here: https://github.com/bodono/scs-python/blob/master/setup.py#L206

First, is the 'CUDA_PATH' env variable set? And if so you should make sure that the include and lib directories are as SCS expect (otherwise you can edit the code to point to the right place, if it's a general fix I would be happy to accept a PR).

@bodono
Copy link
Member

bodono commented Nov 24, 2021

By the way, it is often the case that the GPU version is not actually faster than the vanilla direct version, so bear that in mind.

@b-grimaud
Copy link
Author

So, the dirty way to fix this was to point directly to an environment install of CUDA, as CUDA_PATH otherwise points to the regular Windows install. I don't know if there's a way to make it work otherwise, because both os.environ['CUDA_PATH'] and os.getenv('CUDA_PATH') point to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5
I manually edited include_dirs and library_dirs to point to 'C:\\Users\_user_\\anaconda3\\envs\\_myenv_\\include' and 'C:\\Users\_user_\\anaconda3\\envs\\_myenv_\\Lib\\x64'.
I then encountered a fatal error LNK1158: cannot run ‘rc.exe’ error, which I solved by following this, and that was good enough to have SCS installed in this environment.

Now, trying to solve with gpu=True, I first encountered NotImplementedError GPU direct solver not yet available, pass use_indirect=True, which was indeed solved by passing that argument.
And now, I get ImportError DLL load failed while importing _scs_gpu: The specified module could not be found.. No further info on what DLL might be missing, but the install might not be as complete as I thought. Solving on CPU works just fine otherwise.

I assume this is still related to the compiling process, but I can move it to a new issue if needed.

I will definitely benchmark CPU and GPU performances once it is working, I'll keep you updated.

@bodono
Copy link
Member

bodono commented Nov 25, 2021

It sounds like you got the paths right for the install, so you probably need to add the paths where the cuda binaries live to the PATH variable (or whatever the equivalent is for windows), eg

set PATH=C:\Users\_user_\anaconda3\envs\_myenv_\Lib\x64;%PATH%

@b-grimaud
Copy link
Author

That did the trick ! I added C:\Users\_user_\anaconda3\envs\_myenv_\Lib\x64 to the (USER, not SYSTEM) PATH variable.

The solver now works on some problems, but crashes on others :

===============================================================================
                                     CVXPY
                                    v1.1.17
===============================================================================
(CVXPY) Dec 03 08:52:03 AM: Your problem has 500 variables, 1 constraints, and 0 parameters.
(CVXPY) Dec 03 08:52:03 AM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Dec 03 08:52:03 AM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Dec 03 08:52:03 AM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
-------------------------------------------------------------------------------
                                  Compilation
-------------------------------------------------------------------------------
(CVXPY) Dec 03 08:52:03 AM: Compiling problem (target solver=SCS).
(CVXPY) Dec 03 08:52:03 AM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCS
(CVXPY) Dec 03 08:52:03 AM: Applying reduction Dcp2Cone
(CVXPY) Dec 03 08:52:03 AM: Applying reduction CvxAttr2Constr
(CVXPY) Dec 03 08:52:03 AM: Applying reduction ConeMatrixStuffing
(CVXPY) Dec 03 08:52:03 AM: Applying reduction SCS
(CVXPY) Dec 03 08:52:03 AM: Finished problem compilation (took 3.290e-02 seconds).
-------------------------------------------------------------------------------
                                Numerical solver
-------------------------------------------------------------------------------
(CVXPY) Dec 03 08:52:03 AM: Invoking solver SCS  to obtain a solution.
------------------------------------------------------------------
               SCS v3.0.0 - Splitting Conic Solver
        (c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem:  variables n: 998, constraints m: 1497
cones:    l: linear vars: 993
          q: soc vars: 504, qsize: 2
settings: eps_abs: 1.0e-05, eps_rel: 1.0e-05, eps_infeas: 1.0e-07
          alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
          max_iters: 10000, normalize: 1, warm_start: 0
          acceleration_lookback: 0, acceleration_interval: 0
lin-sys:  sparse-indirect GPU
          nnz(A): 4473, nnz(P): 0
 ** On entry to cusparseCreate(): CUDA context cannot be initialized

 ** On entry to cusparseCreateCsr() parameter number 5 (csrRowOffsets) had an illegal value: NULL pointer

 ** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: NULL pointer

 ** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: NULL pointer

 ** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: NULL pointer

 ** On entry to cusparseCreateCsr() parameter number 5 (csrRowOffsets) had an illegal value: NULL pointer

scs/linsys/gpu/indirect\private.c:346:scs_init_lin_sys_work
ERROR_CUDA (*): an illegal memory access was encountered
ERROR: init_lin_sys_work failure
Failure:could not initialize work

I recursively try to solve problems of varying complexity, and using data of varying sizes, so I would understand if GPU support is more fit towards solving single, larger problems.
I also recently found out about the warm_start argument in CVXPY, I don't know if that could apply here ?

@bodono
Copy link
Member

bodono commented Dec 3, 2021

Great, glad it's (kind of) working for you!

It sounds like the solver has a GPU memory leak if this is happening after some number of solves, is it easy enough to send me the script that runs this?

@b-grimaud
Copy link
Author

Sure ! The script itself involves several modules, but the actual problem solving part is as follows :

def acceleration_minimization_norm1(measure, sigma0,px, nn = 0):
    """
    Parameters
    ----------
    measure : array (n, 2)
        experimental data (noisy)
    sigma0 : int
        estimated precision of localization (in nanometers)
    px : int
        pixel size (in micrometers)
    nn : int, optional
        number of data points discarded at the extremities of the solution
    
    Returns
    -------
    solution : array (n-2*nn, 2)
        filtered solution with minimization of norm 1 of the acceleration with the difference between measured data and solution inferior or equal to the theoretical noise.
    """
    measure = px*measure
    n = len(measure)       
    variable = cp.Variable((n, 2))
    objective = cp.Minimize(cp.atoms.norm1(variable[2:,0]+variable[:-2,0] - 2*variable[1:-1,0])+cp.atoms.norm1(variable[2:,1]+variable[:-2,1] - 2*variable[1:-1,1]))
    constraints = [ cp.atoms.norm(variable - measure, 'fro')**2 <= n*sigma0**2*10**-6]
    prob = cp.Problem(objective, constraints)
    
    prob.solve(solver='SCS',verbose=True,gpu=True,use_indirect=True,max_iters=10000) 
    solution = variable.value
    if nn == 0:
        return solution
    else:
        return solution[nn:n-nn]

That function is then called as part of another module that loops over CSV files containing data.
Thanks a lot for the help !

@bodono
Copy link
Member

bodono commented Dec 6, 2021

I've profiled the code quite deeply now and I don't see a memory leak anywhere. Does it always crash on the same problem? Does it crash if the problem is called outside of the loop?

@bodono
Copy link
Member

bodono commented Dec 6, 2021

Also, could you cd scs directory (inside scs-python) and tell me what commit hash you are at, using git log.

@b-grimaud
Copy link
Author

Here's the full output of git log :

commit 807a79e6a36079d11da4db1dff54aeb56b1beb21 (HEAD -> master, tag: 3.0.0, origin/master, origin/HEAD)
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date:   Sun Oct 10 13:56:44 2021 +0100
pull in gpu fixes
commit 6a7bfb43307efcae40c75a713b95a8fd93136ba2
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date:   Sun Oct 3 23:58:15 2021 +0100
fix seg fault
commit a52260ee635977dd59653eed35e610a46db55bd3
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date:   Sun Oct 3 23:43:44 2021 +0100
update to latest scs
commit da854af76e04d0dcb7a56de876e1451c0749d968
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date:   Sun Oct 3 13:55:10 2021 +0100
update badge link
commit b9972fe8400e6cdee5e85be251ac8029102db2ec
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date:   Sun Oct 3 13:52:35 2021 +0100
update to latest scs

Running problems independently, out of the loop, seems to prevent such crashes.
Some of them did cause similar crashes after solving the problem, not before, if I let them run for a very high number of iterations, but I haven't been able to reliably reproduce this situation.

While looking at the verbose output in details, I found out that it doesn't look like the problems are actually getting solved, some metrics stay the same even if I push the maximum iterations to run for a while, and CVXPY always end up finding a solution that is "unbounded".

------------------------------------------------------------------
 iter | pri res | dua res |   gap   |   obj   |  scale  | time (s)
------------------------------------------------------------------
     0| 3.16e+04  4.93e+02  1.50e+06 -8.08e+05  1.00e-01  2.22e-01
   250| 8.59e+16  0.00e+00  3.47e-01  1.73e-01  1.00e+06  6.51e-01 
   500| 8.59e+16  0.00e+00  5.20e-01  2.60e-01  1.00e+06  1.07e+00 
   750| 8.59e+16  0.00e+00  3.87e-01  1.94e-01  1.00e+06  1.50e+00 
  1000| 8.59e+16  0.00e+00  3.97e-01  1.98e-01  1.00e+06  1.92e+00 
                                |
 90000| 8.59e+16  0.00e+00  5.56e-01  2.78e-01  1.00e+06  1.52e+02 
 90250| 8.59e+16  0.00e+00  4.44e-01  2.22e-01  1.00e+06  1.52e+02 
 90500| 8.59e+16  0.00e+00  5.06e-01  2.53e-01  1.00e+06  1.53e+02
                                |
 99000| 8.59e+16  0.00e+00  4.35e-01  2.18e-01  1.00e+06  1.65e+02 
 99250| 8.59e+16  0.00e+00  5.24e-01  2.62e-01  1.00e+06  1.66e+02 
 99500| 8.59e+16  0.00e+00  2.32e-02  1.16e-02  1.00e+06  1.66e+02 
 99750| 8.59e+16  0.00e+00  5.52e-01  2.76e-01  1.00e+06  1.66e+02 
100000| 8.59e+16  0.00e+00  1.59e-01 -7.94e-02  1.00e+06  1.67e+02 
------------------------------------------------------------------
status:  unbounded (inaccurate - reached max_iters)
timings: total: 1.68e+02s = setup: 9.60e-01s + solve: 1.67e+02s
         lin-sys: 1.02e+02s, cones: 1.95e+01s, accel: 0.00e+00s
------------------------------------------------------------------
objective = -inf (inaccurate)
------------------------------------------------------------------

Or, on some problems :

-------------------------------------------------------------------------------
                                    Summary
-------------------------------------------------------------------------------
(CVXPY) Dec 15 05:27:32 PM: Problem status: optimal_inaccurate
(CVXPY) Dec 15 05:27:32 PM: Optimal value: 0.000e+00
(CVXPY) Dec 15 05:27:32 PM: Compilation took 3.780e-02 seconds
(CVXPY) Dec 15 05:27:32 PM: Solver (including time spent in interface) took 4.099e+02 seconds

The problems that result in "unbonded" rather than just "inaccurate" are also noticeably slower to reach the same number of iterations.

@bodono
Copy link
Member

bodono commented Dec 17, 2021

Thanks for sending this. I ran your code for about a week continuously on randomly generated data on my own GPU machine and was unable to reproduce this. However, examining your output it looks like the data types are getting confused, eg the GPU is expecting a particular integer or floating point width and it's getting passed something different.

For one of these problems instances where it takes a very long time to solve could you pass the argument write_data_filename=tmp to the solver and then email me the dumped tmp file? It contains all the data that SCS needs to solve the problem.

@b-grimaud
Copy link
Author

I'll try running it with different data on my side, and see how it goes !

I'll send the tmp file by mail, I let it run for 10 000 iterations to shorten wait time.

@bodono
Copy link
Member

bodono commented Dec 23, 2021

Running the data you sent me using my GPU I get:

Reading data from /usr/local/google/home/bodonoghue/Downloads/tmp
------------------------------------------------------------------
               SCS v3.0.0 - Splitting Conic Solver
        (c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem:  variables n: 41586, constraints m: 62379
cones:    l: linear vars: 41581
          q: soc vars: 20798, qsize: 2
settings: eps_abs: 1.0e-05, eps_rel: 1.0e-05, eps_infeas: 1.0e-07
          alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
          max_iters: 10000, normalize: 1, warm_start: 0
          acceleration_lookback: 10, acceleration_interval: 10
lin-sys:  sparse-indirect GPU
          nnz(A): 187119, nnz(P): 0
------------------------------------------------------------------
 iter | pri res | dua res |   gap   |   obj   |  scale  | time (s)
------------------------------------------------------------------
     0| 1.73e+02  1.00e+00  6.58e+05 -3.29e+05  1.00e-01  2.82e-01
   250| 1.20e+00  3.74e-02  3.95e+01  2.24e+02  1.00e-01  8.04e+00
   500| 5.62e-01  1.82e-02  2.11e+01  3.01e+02  1.00e-01  1.61e+01
   750| 3.60e-01  4.69e-03  7.87e+00  3.08e+02  1.00e-01  2.41e+01
  1000| 2.68e-01  3.36e-03  1.25e+01  3.14e+02  1.00e-01  3.18e+01
  1250| 2.13e-01  1.91e-03  1.54e+01  3.19e+02  1.00e-01  3.98e+01
  1500| 1.75e-01  1.63e-03  1.07e+01  3.27e+02  1.00e-01  4.83e+01
  1750| 1.52e-01  1.55e-03  1.08e+01  3.30e+02  1.00e-01  5.66e+01
  2000| 2.38e-01  1.53e-02  6.32e+00  3.36e+02  1.00e-01  6.48e+01
  2250| 1.14e-01  4.21e-03  3.65e+00  3.39e+02  1.00e-01  7.29e+01
  2500| 1.03e-01  1.07e-03  1.12e+01  3.37e+02  1.00e-01  8.11e+01
  2750| 9.39e-02  9.07e-04  7.41e+00  3.40e+02  1.00e-01  8.92e+01
  3000| 9.06e-02  9.33e-04  7.44e+00  3.40e+02  1.00e-01  9.72e+01
  3250| 8.05e-02  9.27e-04  7.37e+00  3.41e+02  1.00e-01  1.05e+02
  3500| 6.97e-02  9.29e-04  3.62e+00  3.44e+02  1.00e-01  1.14e+02
  3750| 6.31e-02  6.35e-04  3.77e+00  3.45e+02  1.00e-01  1.22e+02
  4000| 5.75e-02  6.22e-04  3.42e+00  3.46e+02  1.00e-01  1.30e+02
  4250| 5.13e-02  5.53e-04  4.41e+00  3.46e+02  1.00e-01  1.39e+02
  4500| 4.67e-02  4.88e-04  3.52e+00  3.47e+02  1.00e-01  1.47e+02
  4750| 4.29e-02  4.99e-04  2.90e+00  3.48e+02  1.00e-01  1.55e+02
  5000| 3.91e-02  5.19e-04  3.93e+00  3.47e+02  1.00e-01  1.64e+02
  5250| 3.57e-02  4.60e-04  2.54e+00  3.49e+02  1.00e-01  1.72e+02
  5500| 8.90e-02  5.66e-03  2.69e+00  3.49e+02  1.00e-01  1.80e+02
  5750| 2.93e-02  1.90e-03  2.50e+00  3.49e+02  1.00e-01  1.89e+02
  6000| 2.69e-02  1.34e-03  2.74e+00  3.49e+02  1.00e-01  1.97e+02
  6250| 2.49e-02  5.10e-04  1.95e+00  3.50e+02  1.00e-01  2.05e+02
  6500| 2.32e-02  2.78e-04  7.21e-01  3.50e+02  1.00e-01  2.13e+02
  6750| 2.33e-01  1.49e-02  1.38e+00  3.50e+02  1.00e-01  2.21e+02
  7000| 1.98e-02  2.44e-04  8.66e-01  3.51e+02  1.00e-01  2.30e+02
  7250| 1.86e-02  2.20e-04  8.51e-01  3.51e+02  1.00e-01  2.38e+02
  7500| 1.72e-02  2.37e-04  1.76e+00  3.50e+02  1.00e-01  2.47e+02
  7750| 1.61e-02  1.76e-04  1.55e+00  3.50e+02  1.00e-01  2.55e+02
  8000| 1.48e-02  1.65e-04  1.33e+00  3.51e+02  1.00e-01  2.63e+02
  8250| 1.39e-02  1.57e-04  6.87e-01  3.51e+02  1.00e-01  2.71e+02
  8500| 1.30e-02  1.61e-04  6.60e-01  3.51e+02  1.00e-01  2.79e+02
  8750| 1.28e-02  1.92e-04  9.15e-01  3.51e+02  1.00e-01  2.87e+02
  9000| 1.23e-02  1.61e-04  3.27e-01  3.51e+02  1.00e-01  2.95e+02
  9250| 1.16e-02  1.59e-04  1.04e+00  3.51e+02  1.00e-01  3.03e+02
  9500| 1.08e-02  1.40e-04  7.23e-01  3.51e+02  1.00e-01  3.11e+02
  9750| 1.01e-02  1.42e-04  8.37e-01  3.51e+02  1.00e-01  3.20e+02
 10000| 9.33e-03  1.40e-04  6.07e-01  3.51e+02  1.00e-01  3.28e+02
------------------------------------------------------------------
status:  solved (inaccurate - reached max_iters)
timings: total: 3.29e+02s = setup: 7.37e-01s + solve: 3.28e+02s
         lin-sys: 3.06e+02s, cones: 4.48e+00s, accel: 3.43e+00s
------------------------------------------------------------------
objective = 351.298981 (inaccurate)
------------------------------------------------------------------

In other words it's clearly different to what you're getting and appears to be working correctly. My guess is that something is wrong in the types we assume that CUDA is using, but only for some versions of CUDA or some GPUs, see similar issue here: bodono/scs-python#54.

I would recommend you stick to the cpu direct version for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants