-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic @exclusive sizes #121
Comments
What do you do when there is more than one I am wondering if |
Right now, I agree there should be a runtime check before launching kernels to verify the sizes There isn't any thread local storage-like memory in OCCA, not sure how that would map to GPU memory but maybe something to keep in mind :) |
Maybe only require (emit verification code) that all |
@pdhahn Make sure to use
to escape |
oops :-) |
np, I was accidentally emailing |
The |
Yes that is partly the motivation for my earlier allusion to a thread-oriented meaning for exclusive vs. an iteration-oriented meaning. But the loop variable lower and upper bounds, at least as specified by the OKL programmer, are arbitrary and are what is proposed to determine the size of the exclusive variable array, correct? |
Yeah, based on the number of iterations and like the docs say
so it's more like TLS than iterations |
OK. I think I misinterpreted your first comment about allocating the exclusive memory array based on "full inner loop size", where I thought you meant the latter was defined by the loop index variable bounds at the logical OKL program code level, as specified by the OKL programmer, so there would always be one array element per iteration (unrelated to threads). But one element per thread (TLS) makes total sense, at least when the inner loop index variable does not exceed the max. number of threads per block. Like you said, the latter can be readily determined, e.g. as device work group size. BTW it would be ideal if the OKL programmer did not have to consider any issues related to physical device constraints on granularity of the parallelization in the outer/inner loops (i.e., how computationally, for the ubiquitous block-oriented topology assumed by OCCA, the device hardware dimensions map to logical dimensions), such as max threads per work group. Ideally, that is all abstracted away for him completely, and he is free to specify outer/inner dimensions based on the raw, ungrouped extent of the data to be processed (e.g., like we can do using OpenMP |
I think your first interpretation was right, I meant the concept was similar
👍 I agree It might mean OKL auto-tiles outer and inner loops if the loops go out of the device bounds (like too many threads or too many iterations for exclusives) |
A note - I've run into memory errors related to this limitation when the size of inner(0) > 256. |
@jlchan sorry about that! Maybe we should increase the number as a temporary fix |
no worries. I don't need it at the moment, but would it be useful to just add a warning flag during the OKL build? |
@exclusive
array sizes for CPU modes are hard-coded to256
We should use allocate the size depending on the full
@inner
loop sizeThe text was updated successfully, but these errors were encountered: