Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect GPU grid dimension #338

Closed
roastduck opened this issue Dec 5, 2020 · 0 comments
Closed

Incorrect GPU grid dimension #338

roastduck opened this issue Dec 5, 2020 · 0 comments

Comments

@roastduck
Copy link
Contributor

I tiled a n x m matrix into 32 x 32 tiles, with each tile mapped to a GPU block, and each element in a tile mapped to a GPU thread. There should be ceil(n / 32) * ceil(m / 32) = ((n + 31) / 32) * (m + 31) / 32)) blocks. However, TACO generates (n + 31) / 32) * (m + 31) / 32) blocks. Please note those parentheses: ((1024 + 31) / 32) * (1024 + 31) / 32)) = 1024, while (1024 + 31) / 32 * (1024 + 31) / 32 = 1055.

The following code is generated using the latest commit (ed3488f).

// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C){
  int A1_dimension = (int)(A->dimensions[0]);
  int A2_dimension = (int)(A->dimensions[1]);
  double* __restrict__ A_vals = (double*)(A->vals);
  int B1_dimension = (int)(B->dimensions[0]);
  int B2_dimension = (int)(B->dimensions[1]);
  double* __restrict__ B_vals = (double*)(B->vals);
  int C2_dimension = (int)(C->dimensions[1]);
  double* __restrict__ C_vals = (double*)(C->vals);

  int32_t f = blockIdx.x;
  int32_t g = (threadIdx.x % (1024));
  if (threadIdx.x >= 1024) {
    return;
  }

  int32_t i0 = f / (B2_dimension + 31) / 32;
  int32_t j0 = f % (B2_dimension + 31) / 32;
  int32_t i1 = g / 32;
  int32_t i = i0 * 32 + i1;
  if (i >= A1_dimension)
    return;

  int32_t j1 = g % 32;
  int32_t j = j0 * 32 + j1;
  if (j >= B2_dimension)
    return;

  int32_t jC = i * C2_dimension + j;
  double tkC_val = 0.0;
  for (int32_t k = 0; k < B1_dimension; k++) {
    int32_t kA = i * A2_dimension + k;
    int32_t jB = k * B2_dimension + j;
    tkC_val = tkC_val + A_vals[kA] * B_vals[jB];
  }
  C_vals[jC] = C_vals[jC] + tkC_val;
}

int compute(taco_tensor_t *C, taco_tensor_t *A, taco_tensor_t *B) {
  int C1_dimension = (int)(C->dimensions[0]);
  int C2_dimension = (int)(C->dimensions[1]);
  double* __restrict__ C_vals = (double*)(C->vals);
  int A1_dimension = (int)(A->dimensions[0]);
  int B2_dimension = (int)(B->dimensions[1]);

  for (int32_t pC = 0; pC < (C1_dimension * C2_dimension); pC++) {
    C_vals[pC] = 0.0;
  }

  computeDeviceKernel0<<<(A1_dimension + 31) / 32 * (B2_dimension + 31) / 32, 1024>>>(A, B, C);  // <----- LOOK AT HERE
  cudaDeviceSynchronize();
  return 0;
}

Using the following command:

taco \
    "C(i, j) = A(i, k) * B(k, j)" \
    -s="split(i,i0,i1,32)" \
    -s="split(j,j0,j1,32)" \
    -s="reorder(i0,j0,i1,j1,k)" \
    -s="fuse(i0,j0,f)" \
    -s="fuse(i1,j1,g)" \
    -s="parallelize(f,GPUBlock,NoRaces)" \
    -s="parallelize(g,GPUThread,NoRaces)"
guilhermeleobas pushed a commit to Quansight-Labs/taco that referenced this issue May 27, 2021
* Reimplemented parts of attribute query lowering and some optimizations

* Add a test case for mixing sparse/dense formats in matrix multiply

The test case does A=BC, and tries all permutations of Dense, CSR, CSC, and COO.
It is disabled for now, enable it once sparse output works.

* Add in hoisted workspace reuse and remove guard for divisible bound and split

* Fix some workspaces tests

* Use CUDA_LIBRARIES instead of hardcoding the path to libcudart

Hardcoded paths don't work when using Debian's packaged version of cuda,
as the library paths don't match.  CMake's find_package(CUDA) sets
CUDA_LIBRARIES to the path of libcudart, so just use that instead.

* Add TACO_NVCC var to complement TACO_NVCCFLAGS

This is useful for passing specific arguments to nvcc.  In my case,
I wanted to force nvcc to use a specific version of g++.

* Updated automated test workflow

* Updated automated test workflow

* fix -s arg parser

* Prototypes automatically generating code to to have sparse iteration over a dense workspace

* don't run autoscheduling commands if manual schedule is provided in command line tool fixes tensor-compiler#336

* fix fuse bound calculation, which was unnecessarily enlarged. Fixes tensor-compiler#337

* Fixes bugs in check for accelerating workspace

* Fixes bug in concreteNotation check. All workspace tests pass.

* Removes print statements

* fix handling of operator precedence in CUDA backend. Fixes tensor-compiler#338

* Only hoists out malloc + free from where statement when possible. Emits loop to zero every element in a temporary when it is hoisted before the producer is called. Changes the codegens to keep pointer names constant

* Fix build failures on ubuntu 16.04

* Fix python bindings when building with clang++-10

Fix a few instances of this build error in pytaco:
.../python_bindings/src/pyTensor.cpp:406:53: error: unknown type name 'nullptr_t'; did you mean 'std::nullptr_t'?

* Use exceptions for error reporting in all cases

Previously, exceptions were used only when the Python bindings were
enabled.  This meant that C++ applications could only handle errors
gracefully when the Python bindings were enabled.

Change it to consistently use exceptions in all cases.

* Adds negation to pytaco tensor interface

* Removes initialization loop from before producer when accelerating a dense workspacE

* Places index list size above the producer loop when accelerating a dense workspace. This should make the transition to multithreading easier and fixes a bug in the original code

* Fixes workspace reset

* If underived variables are used to index a workspace, we allocate space for the workspace based on the size of the sizes of the input tensors

* Relaxes requirements for spmm transformation

* Checks if first mode of last tensor has locate for spmm transform

* Changes SPMM tranform requirement. Unsure about this

* Fix whitespace in tools/taco.cpp.  (No functional changes)

* Report an error properly in the taco CLI tool.

* Use the existing Lexer to parse scheduling directives

Add a schedule parser function.
Add test cases for the schedule parser function.
Use the function in the taco command-line tool.
Return usage messages when the user passes in the wrong number of parameters.

* Silence a warning about cmake policy CMP0054.

* lower,index_notation: fix compilation warnings

Fix a few compilation warnings caused by taking copies of loop variables
instead of references.

* index_notation,error: deduplicate dimension checking routines

Currently, there are two dimension checking methods in TACO. The first
returned a boolean, and the second returned a user readable string
detailing the error. Both methods had nearly identical code. Therefore,
this commit merges them into a single function that returns a boolean
and the error, if it exists.

* lower: fix a bug causing undefined variables when applying fuse

Fixes tensor-compiler#355.

This commit fixes a bug where the fuse transformation would not generate
necessary locator variables when applied to iteration over two dense
variables.

* Revert "lower: fix a bug causing undefined variables when applying fuse"

* Add -help and -help=schedule parameters to CLI

* lower: fix a bug causing undefined variables when applying fuse

Fixes tensor-compiler#355.

This commit fixes a bug where the fuse transformation would not generate
necessary locator variables when applied to iteration over two dense
variables.

Additionally, this commit adds a test for when a dense iteration results
in a transposition of a tensor.

* Emit unsequenced insertion code

* Zeroelss updates

* Emit code to use attribute query results during assembly

* include,src: introduce a true break statement, rename current to continue

The current `ir::Break` statement actually translates to a `continue`.
This commit renames this to `ir::Continue`, and adds a new `ir::Break`
node that actually translates to a `break`. This new node will be used
by upcoming windowing work.

* Don't emit append code if using ungrouped insertion

* Clear the needsCompile flag in tensor->compileSource()

Fixes tensor-compiler#366.

* Add an error message for invalid input tensor names.

* Fix warnings in python bindings

* tensor,codegen: fix a bug where kernel cache could be modified

This commit fixes a bug where upon recompilation of an index statement,
entries in the kernel cache could be inadvertently modified, leading to
confusing segfaults.

An example of the bug is included in the added test, where the second
call to `c(i, j) = a(i, j)` would hit the cache, but then find a module
that had code that corresponded to `c(i, j) = a(i, j) + b(i, j)`.

* Implemented assemble scheduling command + don't sort sparse accelerator if performing reduction

* Assume inputs are zeroless when computing attribute queries

* Replace workspaces in attribute queries

* Enable parallelization of forall statements with results assembled by ungrouped insertion

* Fixed various bugs

* Fixed various bugs

* Deleted redundant code

* Fix workspaces test on ubuntu 16.04

Fixes: tensor-compiler#380

* Add code coverage targets to cmake

* Fix warnings in Release builds

* Fixed attribute query compute code not being emitted + optimize computation of Boolean temporaries when always assigned true

* Emit init_edges code

* Added parallel SpGEMM test

* Fixed heuristic for inserting accelerators for workspaces indexed by derived index variables

* Removed debug print statements

* Updated CMake requirements

* Added correctness checks for ungrouped insertion

* Fix a bug in CLI parsing of bound()

This bug was introduced in tensor-compiler#352.

* Strengthened precondition for assemble command

* Remove pybind11

* Make pybind11 a submodule

* Modify cmake

* fix cmake

* Removes forcecast in function overload

* Add comment to python code explaining when conversion happen

* Don't emit atomic pragma for non-reduction assignments

* *: add support for windowing of tensors

This commit adds support for windowing of tensors in the existing index
notation DSL. For example:

```
A(i, j) = B(i(1, 4), j) * C(i, j(5, 10))
```

causes `B` to be windowed along its first mode, and `C` to be windowed
along its second mode. In this commit any mix of windowed and
non-windowed modes are supported, along with windowing the same tensor
in different ways in the same expression. The windowing expressions
correspond to the `:` operator to slice dimensions in `numpy`.

Currently, only windowing by integers is supported.

Windowing is achieved by tying windowing information to particular
`Iterator` objects, as these are created for each `Tensor`-`IndexVar`
pair. When iterating over an `Iterator` that may be windowed, extra
steps are taken to either generate an index into the windowed space, or
to recover an index from a point in the windowed space.

* Update Cmake to pull python binding during any build

* Add a SpTV+openmp+atomics test case for tensor-compiler#316

* Improve CI test coverage

Add a build step that covers the OpenMP and Python features.

Make it run `make test` to run all available test suites.

* Raise internal error if trying to generate code to assemble sparse accelerator in parallel

* *: add the ability to stride window access

This commit extends the windowing syntax to include an optional third
parameter to a window expression on an index variable:

```
a(i) = b(i(0, n, 5 /* stride */))
```

This stride parameters means that the window should be accessed along
the provided stride, which defaults to 1.

Striding is implemented with a similar idea as windowing, where
coordinates in the stride are mapped to a canonical index space of `[0,n)`.
For compressed modes, coordinates that don't match the stride are
skipped.

* Fixed various bugs

* Fixed removal of redundant loops

* lower: fix a bug when using OpenMP and windowing

Fixes tensor-compiler#409.

This commit fixes a bug where position loops parallelized with OpenMP
that operated over windowed tensor modes would fail to compile.

This commit also fixes some compilation errors compiling windowing tests
on Ubuntu.

* Unbreak cmake build of python bindings

* Remove redundant allocation

* *: add support for using arbitrary indexing sets to window tensors

This commit adds support for using vectors to index arbitrary dimensions
of tensors. It works by packing the vector into a sparse tensor, and
coiterating over the sparse tensor to efficiently filter the chosen
dimensions. The syntax of indexing sets look as follows:

```
A(i) = B(i({1, 3, 5}))
```

which means that only elements 1, 3, and 5 from `B` will be used in the
computation.

* index_notation: implement the `divide` transformation

The divide transformation divides a loop up into `n` equal components,
whereas split breaks a loop up into a components of size `n`.

It also enables support for the transformation in the TACO CLI.

* Enable CI tests for array_algebra branch

* Suppress GCC warnings

* Fixed heuristic for inserting sparse accelerator

* Revert "Fixed heuristic for inserting sparse accelerator"

This reverts commit 4e264ce.

* Fixed heuristic for inserting sparse accelerator

* Fix package version issue in CI tests

Run "apt-get update" to update the package list.

* cuda: fix windowing test with cuda

Fixes tensor-compiler#422.

This commit ensures that the allocation clearing logic is applied to
the CUDA backend as well. The windowing test caught this because TACO
was automatically parallelizing the loop onto the GPU.

* index_notation,tensor: small bugfixes for index sets

* Fixes a runtime error when using index sets on tensors not of integer types
* Fixes a compile error when using a vector typed variable as argument
  for an index set.

* Allow CLI precompute() to specify the workspace name

* Add tracking/reporting of build info

* CLI tool treats double hyphens as a single hyphen

* lowerer_impl: fix some striding bugs

Fixes some formulaic errors in generated striding code along with a test
that revealed them.

* Better error message for guarding unguardable loops

* Use full precision when IR printing float constants

The default precision when printing a floating point value is 6 digits.
This causes a lot of double values to get truncated. Print these with
full precision to avoid losing data.

* Don't emit redundant code to append edges when inserting into result

* index_notation: fix a bug where windows would be dropped through `+=`

Fixes tensor-compiler#451.

This commit fixes a bug where windows applied to index variables would
be dropped when assigned to via the `+=` operator.

* Fixed printing of scheduling commands in command-line tool output

* Fixed precompute transformation and attempt at fixing tensor-compiler#389. Also generate more optimized attribute query code for parallel sparse tensor addition

* Modified MTTKRP test to use schedule with precompute

* assemble command now no longer uses fresh index variables in inserted attribute query computations by default

* Fixed typo in command-line tool usage

* Fixed assemble command with dense arrays + improved heuristics for determining whether result needs to be explicitly zero-initialized

* Fixed how parallelize command checks for races

* Fixing merge issues

Co-authored-by: Stephen Chou <s3chou@csail.mit.edu>
Co-authored-by: Mark Glines <mark@glines.org>
Co-authored-by: Olivia Hsu <owhsu@stanford.edu>
Co-authored-by: Stephen Chou <stephenchouca@users.noreply.github.com>
Co-authored-by: roastduck <rd0x01@gmail.com>
Co-authored-by: Rawn <rawnhenry@gmail.com>
Co-authored-by: Ryan Senanayake <rsen@mit.edu>
Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Co-authored-by: Changwan Hong <changwan@lanka.csail.mit.edu>
Co-authored-by: Rohan Yadav <rohany@cs.stanford.edu>
Co-authored-by: Sam Kaplan <sam@extreme-scale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant