Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implementation of MATLAB's find() #147

Closed
akfite opened this issue Mar 14, 2022 · 18 comments
Closed

[FEA] Implementation of MATLAB's find() #147

akfite opened this issue Mar 14, 2022 · 18 comments

Comments

@akfite
Copy link

akfite commented Mar 14, 2022

First of all, let me just say that I'm excited to see this project. I'm a long-time user of MATLAB but very new to CUDA, and I love that you are lowering the barrier of entry to GPU programming.

Is your feature request related to a problem? Please describe.
Slicing a matrix is often paired with MATLAB's built-in find function. I think it would be a valuable addition to this project.

Describe the solution you'd like
Similar to how you have implemented linspace, meshgrid, etc, it would be great to see a similar syntax & functionality to the above link.

Describe alternatives you've considered
I just stumbled upon this library today--I read through all the help docs and didn't see this anywhere. Hopefully I didn't miss it!

@cliffburdick
Copy link
Collaborator

cliffburdick commented Mar 14, 2022

Hi @akfite, you're in luck, and I just haven't updated the documentation yet. We do have a find function, and you can see how it's used in the unit tests here:

https://github.com/NVIDIA/MatX/blob/main/test/00_operators/ReductionTests.cu#L520

Because it's C++, it's slightly clunkier than MATLAB's find since we don't do any dynamic memory allocations for output tensors. Instead, you have to provide an output tensor that's at least as large as the most things you expect to find(), or worst case, the same size as the input tensor. Then the num_found output variable will tell you how many values in that tensor are populated for you to use. This API is subject to change to make it easier, but it likely won't be too much different, and you can use it now.

@akfite
Copy link
Author

akfite commented Mar 14, 2022

Awesome, thanks for the quick reply. I'll use your unit tests as a guide. Looking forward to using this code & I appreciate your help!

@akfite akfite closed this as completed Mar 14, 2022
@akfite
Copy link
Author

akfite commented Mar 17, 2022

As far as I can tell, this implementation of find acts like a mask to return the values that match the selection operator. But what is really desired is the index to the values that were matched--that is how MATLAB's find works.

The desired functionality here is to get all the indices that match some selector, and then to be able to use those indices together with the input tensor to create a new tensor or view. That would open up all kinds of useful ways to slice & dice tensors.

@akfite akfite reopened this Mar 17, 2022
@cliffburdick
Copy link
Collaborator

Hi @akfite, understood. This should be doable using different iterator types in CUB. We'll get back to you.

@cliffburdick
Copy link
Collaborator

@akfite I had previously opened this same issue, but I'll leave both open for tracking purposes: #135

@akfite
Copy link
Author

akfite commented Mar 17, 2022

@cliffburdick Sorry, you're right; I should have looked closer at the outstanding issues.

I will say that #135 describes returning the indices as an optional output, but I think it would be more general-form to only return the indices. If one desired the masked values (as it returns now), it would be easily obtained by applying the indices to the input matrix. find is such a foundational operator in MATLAB that it would be wise to align as closely as possible to its pattern of usage--especially given the parallels that already exist.

@cliffburdick
Copy link
Collaborator

Understood. I have it working and will post the code/examples shortly for you. I was going to default to calling the functions find and find_idx, but haven't settled on that yet.

@cliffburdick
Copy link
Collaborator

Hi @akfite , you can find the test for it here: https://github.com/NVIDIA/MatX/blob/main/test/00_operators/ReductionTests.cu#L553

The one caveat is we'll be adding a way to select these items shortly. It looks like MATLAB's find will return an absolute index regardless of the rank of the array, so it's reasonable for us to do that as well. For example, if you have a 10x5 tensor, index 49 will be the last element and not an array of (9, 4). If you need to translate an absolute index into a coordinate index, we have the utility function GetIdxFromAbs to return this. I'll post an example soon.

@akfite
Copy link
Author

akfite commented Mar 18, 2022

Excellent! I think returning an absolute index is perfectly reasonable, especially if you have a function to do the conversion. Thanks for turning this around so quickly.

@akfite akfite closed this as completed Mar 18, 2022
@cliffburdick
Copy link
Collaborator

cliffburdick commented Mar 18, 2022

Hi @akfite, please see the unit test here:
https://github.com/NVIDIA/MatX/blob/main/test/00_operators/ReductionTests.cu#L588

This shows a complete example of both the find() and the find_idx() functions. find_idx() can be used with a new select() operator, which will index a tensor with a list of indices in another tensor:

    find_idx(t1o_idx, num_found, t1, GT{thresh});
    cudaStreamSynchronize(0);

    auto t1o_slice = t1o.Slice({0}, {num_found()});
    auto t1o_idx_slice = t1o_idx.Slice({0}, {num_found()});
    (t1o_slice = select(t1o_slice, t1o_idx_slice)).run();

The slices are here to only select num_found values. Remember that you need to synchronize the stream before the select() since Slice is done on the host and num_found is asynchronously computed on the device. It would be a race condition without that.

@akfite
Copy link
Author

akfite commented Mar 18, 2022

The example makes sense--thank you for that. I'm looking forward to trying it out. Unfortunately I still can't compile the unit tests, I've been at it for a few hours and at this point I'm just gonna have to pick it up again tomorrow. I'm trying to get it to work with VS2022 on Windows 10 but have some weird issues. I tried following the build instructions but it wasn't working for me--specifically, I used:

cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=ON -DMATX_BUILD_EXAMPLES=ON -DMATX_BUILD_DOCS=OFF ..

which fails in cmake because "No CMAKE_CUDA_COMPILER could be found", even though nvcc is on the path. If I specify the architecture:

cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=ON -DMATX_BUILD_EXAMPLES=ON -DMATX_BUILD_DOCS=OFF -DCMAKE_CUDA_ARCHITECTURES=52 ..

it now successfully finds the CUDA compiler, but now cmake gives another error that:

CMake Error at CMakeLists.txt:110 (if):
  if given arguments:

    "NOT" "CMAKE_BUILD_TYPE" "OR" "STREQUAL" "Debug"

  Unknown arguments specified

So I tried

cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=ON -DMATX_BUILD_EXAMPLES=ON -DMATX_BUILD_DOCS=OFF -DCMAKE_CUDA_ARCHITECTURES=52 -DCMAKE_BUILD_TYPE=Debug ..

and that got me a .sln, but ultimately only 9 of 20 projects compile in VS2022. It seems some projects are failing because of -fvisibility=hidden being passed to nvcc, which Google leads me to believe may be caused by a pybind11 config? I don't know if I'm doing something dumb or if something is weird in the CMake. I attached the output from VS2022 for reference.

Hopefully tomorrow I'll figure it out so I can actually test this out! 🤞

buildlog.txt

@cliffburdick
Copy link
Collaborator

hi @akfite, officially we don't support Windows, mostly because it's not tested at all. that said, I can try some basic troubleshooting. which version of cmake is this?

@akfite
Copy link
Author

akfite commented Mar 18, 2022

Hey, @cliffburdick I'm on cmake 3.22.1.

I commented out # target_compile_options(matx INTERFACE -fvisibility=hidden) and it gave me a new error, at least--now I get errors like:

12>C:\dev\cuda\repos\MatX\build\test>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin\nvcc.exe" -gencode=arch=compute_52,code=\"compute_52,compute_52\" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.30.30705\bin\HostX64\x64" -x cu   -IC:\dev\cuda\repos\MatX\include -IC:\dev\cuda\repos\MatX\include\kernels -IC:\dev\cuda\repos\MatX\test\include -IC:\dev\cuda\repos\MatX\examples -IC:\Users\Austin\AppData\Local\Programs\Python\Python310\include -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\include" -I"C:\dev\cuda\repos\MatX\build\_deps\gtest-src\googletest\include" -I"C:\dev\cuda\repos\MatX\build\_deps\gtest-src\googletest" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\include"     --keep-dir x64\Debug  -maxrregcount=0  --machine 64 --compile -cudart static /W4 /WX --expt-relaxed-constexpr "-isystem=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/include" -std=c++17 -Xcompiler="/EHsc -Zi -Ob0" -g  -D_WINDOWS -D"MATX_ROOT=\"C:/dev/cuda/repos/MatX\"" -DINDEX_64_BIT -DMATX_ENABLE_CUTLASS=0 -DMATX_ENABLE_CUTENSOR=0 -DMATX_ENABLE_PYBIND11=1 -DMATX_ENABLE_FILEIO=1 -DMATX_ENABLE_VIZ=0 -D"CMAKE_INTDIR=\"Debug\"" -Dmatx_test_EXPORTS -D_MBCS -DWIN32 -D_WINDOWS -D"MATX_ROOT=\"C:/dev/cuda/repos/MatX\"" -DINDEX_64_BIT -DMATX_ENABLE_CUTLASS=0 -DMATX_ENABLE_CUTENSOR=0 -DMATX_ENABLE_PYBIND11=1 -DMATX_ENABLE_FILEIO=1 -DMATX_ENABLE_VIZ=0 -D"CMAKE_INTDIR=\"Debug\"" -Dmatx_test_EXPORTS -Xcompiler "/EHsc /W4 /nologo /Od /Fdmatx_test.dir\Debug\vc143.pdb /FS /Zi /RTC1 /MDd /GR" -o matx_test.dir\Debug\SVD.obj "C:\dev\cuda\repos\MatX\test\00_solver\SVD.cu"
12>nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified

I'll keep fiddling with it. Worst case, I think I'd at least be able to use your code as a guide to implement the same calls to cub to solve my specific problem if I can't get this to work on Windows. But if you have any ideas, that would be welcome.

@akfite
Copy link
Author

akfite commented Mar 18, 2022

My best guess at the moment is that this is the root cause: pybind/pybind11#1710

@cliffburdick
Copy link
Collaborator

@akfite I'll try that out, and as long as it doesn't affect the Linux builds, you can see if it works for you. Just out of curiosity, are you trying to build a native windows app, or could you use something like WSL instead? MatX compiles just find on WSL in our testing.

@akfite
Copy link
Author

akfite commented Mar 18, 2022

Alright, thank you! My end goal is to use this library to accelerate some CPU-bound compute code in MATLAB with a MEX interface. I was able to compile in WSL but I can't get WSL to detect my GPU with lspci, and I think that's why the unit tests immediately error (out of memory error) for me. In any case, it's code that would be distributed to other engineers working in mixed Windows & Linux environments, so I need the solution to support both.

For the most compute-heavy task I couldn't figure out how to return a dynamically-sized array of indices, so I think you've solved the hard part for me. But if I could get the library working I think it would let me do a lot more compute work in CUDA than I would normally do, so I'm still interested in figuring it out.

@cliffburdick
Copy link
Collaborator

Hi @akfite, understood. WSL seems like a separate issue, but it should work there.

That being said, I looked at your issue and the solution proposed there doesn't work. I think the answer is that all of this should work if we provide the correct flags for windows, which that issue is supposed to do. The problem is not that nvcc is being passed an incorrect flag, but rather that fvisibility is not valid on VS. We use the option -forward-unknown-to-host-compiler, so any unknown parameter (of which this is one), nvcc will automatically forward to VS.

It also appears that VS doesn't export symbols in the same way gcc does, so it's not as straightforward as simply passing a different flag to the compiler. I think to get proper Windows support we'll need to comb through all the compiler options and make sure they are appropriate for both Windows and Linux. I can't think of anything in the code that should be bound to Linux, so we shouldn't have to change much there.

Do you mind opening an issue describing the use case?

@akfite
Copy link
Author

akfite commented Mar 18, 2022

@cliffburdick Gotcha. Thanks for looking into this. Please see #153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants