Wrong handling of workspace in multiple numpy operators #19458

ptrendx · 2020-10-30T23:34:24Z

Description

The backward pass of tensordot (both regular and the integer version) operator uses workspace to store intermediate value, but then uses ReduceAxesComputeImpl, which also uses workspace). See e.g. here:
https://github.com/apache/incubator-mxnet/blob/bd55002/src/operator/numpy/np_tensordot_op-inl.h#L420-L428

Since there is only a single workspace storage in MXNet, this means that it is possible for the snippet linked above to

deallocate dtypespace (if the workspace needs to be reallocated to fit the workspace needed for reduce) leading to a crash
use the same pointer as input and workspace, leading to silent wrong results

@szha @leezu

The text was updated successfully, but these errors were encountered:

szha · 2020-10-31T17:25:43Z

good catch. ~~it should be refactored to use stateful op instead.~~

sxjscience · 2020-11-02T07:11:29Z

@hzfan @reminisce @haojin2 I think we have the discussion about multiple workspaces before. Would you help provide some contexts on this?

ptrendx · 2020-11-02T15:28:33Z

@szha I don't think one would need stateful op there, just allocate the larger workspace that can hold both the intermediate result and the workspace for reduction.

ptrendx · 2020-11-02T19:42:39Z

Same problem in MomentsForwardImpl: https://github.com/apache/incubator-mxnet/blob/33d94f1d59335f504ed5b9a7b32f0e81a5d5da56/src/operator/nn/moments-inl.h#L136-L141

ptrendx · 2020-11-02T19:45:40Z

There is a lot of copy-paste in the implementation of various numpy ops - they should be reviewed. Maybe we could make a test pipeline with a warning if someone tries to access ctx.requested[0].get_space_typed more than once in a given operator and review those usages?

sxjscience · 2020-11-02T19:49:50Z

I raised the concern before to @reminisce and @haojin2 but there were some discussions that mentioned that we can have multiple workspaces.

haojin2 · 2020-11-02T20:55:02Z

@ptrendx Please refer to the new impl for numpy which uses a pre-allocated workspace to avoid this problem at: https://github.com/apache/incubator-mxnet/blob/master/src/operator/numpy/np_broadcast_reduce_op.h#L666-L694. the MomentsForwardImpl you posted is not serving any numpy operator

ptrendx · 2020-11-02T21:12:56Z

@haojin2 Sure (I'm actually folding that implementation back to the original for code reuse but anyway), but this function is not used in the examples I showed.

I did see multiple examples of this, including some that I am fixing in my #19426 PR (like this in numpy kron: https://github.com/apache/incubator-mxnet/blob/master/src/operator/numpy/np_kron-inl.h#L229-L234). It just feels that whatever I fix in that PR is just patching things that I noticed instead of making sure the issue is fully fixed. That is the purpose of that issue.

sxjscience · 2020-11-02T21:21:41Z

I've checked the implementation, the way to solve the problem is to create a large workspace, pre-compute the size of the workspace required by reduce and then slice the large workspace:

Example code in LayerNorm: https://github.com/apache/incubator-mxnet/blob/33d94f1d59335f504ed5b9a7b32f0e81a5d5da56/src/operator/nn/layer_norm-inl.h#L247-L271

sxjscience · 2020-11-02T21:49:36Z

Discussed offline with @reminisce and he mentioned that there's the issue in #15732, in which 6 tests are disabled due to this.

ptrendx · 2020-11-02T21:53:55Z

@sxjscience I don't think that is relevant - MXNET_GPU_TEMP_SPACE does not let you request multiple allocations. AFAIK, what it does is enabling multiple workspace resources, so that more than 1 operator that needs workspace can run at the same time (but each operator gets only 1 workspace resource out of the pool).

sxjscience · 2020-11-02T21:56:26Z

Okay, I was not aware of this. If that's the case, we should definitely need to fix these operators.

reminisce · 2020-11-02T22:38:59Z

@ptrendx I believe @sxjscience was referring to something that should have advocated for developing operators requiring a series of calls to other operators with multiple temp spaces. In fact, assigning multiple KTempSpace to an op's attribute FResourceRequest should allow you to request multiple allocations. It's just that it's disallowed by a hard-coded check here due to some historical reason.
https://github.com/apache/incubator-mxnet/blob/33d94f1d59335f504ed5b9a7b32f0e81a5d5da56/src/imperative/imperative_utils.h#L368

@hzfan and I investigated the feasibility of lifting this restriction but GPU CI could not pass at the time. We narrowed down the possible root cause residing in six unit tests: in #15732. Unfortunately, the real culprit is still not identified. It would be great if someone can take this forward to get the problem resolved so that the developer focusing on implementing one operator does not need to have full knowledge its callees' implementation details, i.e. calculating a lump-sum workspace.

ptrendx · 2020-11-02T23:18:12Z

I agree that the developer experience is lacking here. That said, using multiple workspaces to solve this is not great - you end up with memory fragmentation and potentially much higher memory consumption overall because of this.

I would argue that the proper way of handling of the developer experience should be better structure of the code (so that you do not take a random piece of the other operator without the need to think about prerequisites) and possibly some API limitations (e.g. in this case - maybe you should only be allowed to take the workspace from context once and potentially return it back if you want a reset of that counter (e.g. to resize the amount of workspace you need)).

reminisce · 2020-11-03T00:10:47Z

@ptrendx If I understand your argument correctly, it seems the mechanism of returning the workspace from the context and resetting the counter similarly leads to the potential problem of memory fragmentation? Also, resetting the temp space in a callee would require copying the data from the original context to the bigger one and consequently would cause performance degradation. I believe it's a decision about the tradeoff between developer experience and ultimate performance by comparing multi-temp-space solution with one lump-sum temp space.

Anyway, I'm just speaking from my past experience of operator development. It's up to the community to bring up some RFC to tackle this.

ptrendx · 2020-11-03T00:41:19Z

No no, by returning the workspace I did not mean actual freeing of memory - just an indicator to the developer that whatever workspace they were using before is no longer valid, the actual handling of workspace by the backend would stay the same.
Basically there are 2 scenarios here:

where limit to the number of times I can request workspace from context is enough:

I asked for some memory
then I (or my callee, about which I was not aware of) am asking the second time
this fails -> I now am aware that there is a bug in my code and I fix it

where I actually need to be able to grow my workspace:

I asked for some memory for auxiliary computation
based on that computation I now know how much memory I will really need (but I can't just ask for it now because it would fail, as there is a limit)
I "return" my previous workspace, acknowledging that I will not be able to use that memory again, resetting the limit
I take a bigger slice (which is probably the same memory I had before)

szha · 2020-11-03T15:29:03Z

I've checked the implementation, the way to solve the problem is to create a large workspace, pre-compute the size of the workspace required by reduce and then slice the large workspace:

Example code in LayerNorm:

https://github.com/apache/incubator-mxnet/blob/33d94f1d59335f504ed5b9a7b32f0e81a5d5da56/src/operator/nn/layer_norm-inl.h#L247-L271

Attention needs to be paid to memory alignment when dtypes for different chunks are different. It didn't matter in this example.

ptrendx · 2020-11-03T15:42:55Z

Yes, I think we should add overload to get_space_typed to take "multiple" workspaces (as in, still single allocation but with all the alignment etc. taken care of for the operator developer).

sxjscience · 2020-11-03T15:49:21Z

One example about alignment (Use PadBytes for padding): https://github.com/apache/incubator-mxnet/blob/ea222a355005fb3f13fe422fcab7caab53999dfd/src/operator/tensor/ordering_op-inl.h#L579-L617

ptrendx · 2020-11-03T15:52:42Z

Yup, that padding logic should be handled by the call to take workspace, not by each operator separately.

ptrendx added Bug needs triage labels Oct 30, 2020

ptrendx mentioned this issue Oct 31, 2020

[FEATURE] Use RTC for reduction ops #19426

Merged

4 tasks

szha added Numpy Operator and removed needs triage labels Oct 31, 2020

ptrendx changed the title ~~Wrong handling of workspace in backward of numpy tensordot operator~~ Wrong handling of workspace in multiple numpy operators Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong handling of workspace in multiple numpy operators #19458

Wrong handling of workspace in multiple numpy operators #19458

ptrendx commented Oct 30, 2020

szha commented Oct 31, 2020 •

edited

Loading

sxjscience commented Nov 2, 2020

ptrendx commented Nov 2, 2020

ptrendx commented Nov 2, 2020

ptrendx commented Nov 2, 2020

sxjscience commented Nov 2, 2020

haojin2 commented Nov 2, 2020

ptrendx commented Nov 2, 2020

sxjscience commented Nov 2, 2020

sxjscience commented Nov 2, 2020 •

edited

Loading

ptrendx commented Nov 2, 2020

sxjscience commented Nov 2, 2020

reminisce commented Nov 2, 2020

ptrendx commented Nov 2, 2020

reminisce commented Nov 3, 2020 •

edited

Loading

ptrendx commented Nov 3, 2020

szha commented Nov 3, 2020

ptrendx commented Nov 3, 2020

sxjscience commented Nov 3, 2020

ptrendx commented Nov 3, 2020

Wrong handling of workspace in multiple numpy operators #19458

Wrong handling of workspace in multiple numpy operators #19458

Comments

ptrendx commented Oct 30, 2020

Description

szha commented Oct 31, 2020 • edited Loading

sxjscience commented Nov 2, 2020

ptrendx commented Nov 2, 2020

ptrendx commented Nov 2, 2020

ptrendx commented Nov 2, 2020

sxjscience commented Nov 2, 2020

haojin2 commented Nov 2, 2020

ptrendx commented Nov 2, 2020

sxjscience commented Nov 2, 2020

sxjscience commented Nov 2, 2020 • edited Loading

ptrendx commented Nov 2, 2020

sxjscience commented Nov 2, 2020

reminisce commented Nov 2, 2020

ptrendx commented Nov 2, 2020

reminisce commented Nov 3, 2020 • edited Loading

ptrendx commented Nov 3, 2020

szha commented Nov 3, 2020

ptrendx commented Nov 3, 2020

sxjscience commented Nov 3, 2020

ptrendx commented Nov 3, 2020

szha commented Oct 31, 2020 •

edited

Loading

sxjscience commented Nov 2, 2020 •

edited

Loading

reminisce commented Nov 3, 2020 •

edited

Loading