Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Wrong handling of workspace in multiple numpy operators #19458

Open
ptrendx opened this issue Oct 30, 2020 · 20 comments
Open

Wrong handling of workspace in multiple numpy operators #19458

ptrendx opened this issue Oct 30, 2020 · 20 comments

Comments

@ptrendx
Copy link
Member

ptrendx commented Oct 30, 2020

Description

The backward pass of tensordot (both regular and the integer version) operator uses workspace to store intermediate value, but then uses ReduceAxesComputeImpl, which also uses workspace). See e.g. here:
https://github.com/apache/incubator-mxnet/blob/bd55002/src/operator/numpy/np_tensordot_op-inl.h#L420-L428

Since there is only a single workspace storage in MXNet, this means that it is possible for the snippet linked above to

  • deallocate dtypespace (if the workspace needs to be reallocated to fit the workspace needed for reduce) leading to a crash
  • use the same pointer as input and workspace, leading to silent wrong results

@szha @leezu

@szha
Copy link
Member

szha commented Oct 31, 2020

good catch. it should be refactored to use stateful op instead.

@sxjscience
Copy link
Member

@hzfan @reminisce @haojin2 I think we have the discussion about multiple workspaces before. Would you help provide some contexts on this?

@ptrendx
Copy link
Member Author

ptrendx commented Nov 2, 2020

@szha I don't think one would need stateful op there, just allocate the larger workspace that can hold both the intermediate result and the workspace for reduction.

@ptrendx
Copy link
Member Author

ptrendx commented Nov 2, 2020

@ptrendx ptrendx changed the title Wrong handling of workspace in backward of numpy tensordot operator Wrong handling of workspace in multiple numpy operators Nov 2, 2020
@ptrendx
Copy link
Member Author

ptrendx commented Nov 2, 2020

There is a lot of copy-paste in the implementation of various numpy ops - they should be reviewed. Maybe we could make a test pipeline with a warning if someone tries to access ctx.requested[0].get_space_typed more than once in a given operator and review those usages?

@sxjscience
Copy link
Member

I raised the concern before to @reminisce and @haojin2 but there were some discussions that mentioned that we can have multiple workspaces.

@haojin2
Copy link
Contributor

haojin2 commented Nov 2, 2020

@ptrendx Please refer to the new impl for numpy which uses a pre-allocated workspace to avoid this problem at: https://github.com/apache/incubator-mxnet/blob/master/src/operator/numpy/np_broadcast_reduce_op.h#L666-L694. the MomentsForwardImpl you posted is not serving any numpy operator

@ptrendx
Copy link
Member Author

ptrendx commented Nov 2, 2020

@haojin2 Sure (I'm actually folding that implementation back to the original for code reuse but anyway), but this function is not used in the examples I showed.

I did see multiple examples of this, including some that I am fixing in my #19426 PR (like this in numpy kron: https://github.com/apache/incubator-mxnet/blob/master/src/operator/numpy/np_kron-inl.h#L229-L234). It just feels that whatever I fix in that PR is just patching things that I noticed instead of making sure the issue is fully fixed. That is the purpose of that issue.

@sxjscience
Copy link
Member

I've checked the implementation, the way to solve the problem is to create a large workspace, pre-compute the size of the workspace required by reduce and then slice the large workspace:

Example code in LayerNorm: https://github.com/apache/incubator-mxnet/blob/33d94f1d59335f504ed5b9a7b32f0e81a5d5da56/src/operator/nn/layer_norm-inl.h#L247-L271

@sxjscience
Copy link
Member

sxjscience commented Nov 2, 2020

Discussed offline with @reminisce and he mentioned that there's the issue in #15732, in which 6 tests are disabled due to this.

@ptrendx
Copy link
Member Author

ptrendx commented Nov 2, 2020

@sxjscience I don't think that is relevant - MXNET_GPU_TEMP_SPACE does not let you request multiple allocations. AFAIK, what it does is enabling multiple workspace resources, so that more than 1 operator that needs workspace can run at the same time (but each operator gets only 1 workspace resource out of the pool).

@sxjscience
Copy link
Member

Okay, I was not aware of this. If that's the case, we should definitely need to fix these operators.

@reminisce
Copy link
Contributor

@ptrendx I believe @sxjscience was referring to something that should have advocated for developing operators requiring a series of calls to other operators with multiple temp spaces. In fact, assigning multiple KTempSpace to an op's attribute FResourceRequest should allow you to request multiple allocations. It's just that it's disallowed by a hard-coded check here due to some historical reason.
https://github.com/apache/incubator-mxnet/blob/33d94f1d59335f504ed5b9a7b32f0e81a5d5da56/src/imperative/imperative_utils.h#L368

@hzfan and I investigated the feasibility of lifting this restriction but GPU CI could not pass at the time. We narrowed down the possible root cause residing in six unit tests: in #15732. Unfortunately, the real culprit is still not identified. It would be great if someone can take this forward to get the problem resolved so that the developer focusing on implementing one operator does not need to have full knowledge its callees' implementation details, i.e. calculating a lump-sum workspace.

@ptrendx
Copy link
Member Author

ptrendx commented Nov 2, 2020

I agree that the developer experience is lacking here. That said, using multiple workspaces to solve this is not great - you end up with memory fragmentation and potentially much higher memory consumption overall because of this.

I would argue that the proper way of handling of the developer experience should be better structure of the code (so that you do not take a random piece of the other operator without the need to think about prerequisites) and possibly some API limitations (e.g. in this case - maybe you should only be allowed to take the workspace from context once and potentially return it back if you want a reset of that counter (e.g. to resize the amount of workspace you need)).

@reminisce
Copy link
Contributor

reminisce commented Nov 3, 2020

@ptrendx If I understand your argument correctly, it seems the mechanism of returning the workspace from the context and resetting the counter similarly leads to the potential problem of memory fragmentation? Also, resetting the temp space in a callee would require copying the data from the original context to the bigger one and consequently would cause performance degradation. I believe it's a decision about the tradeoff between developer experience and ultimate performance by comparing multi-temp-space solution with one lump-sum temp space.

Anyway, I'm just speaking from my past experience of operator development. It's up to the community to bring up some RFC to tackle this.

@ptrendx
Copy link
Member Author

ptrendx commented Nov 3, 2020

No no, by returning the workspace I did not mean actual freeing of memory - just an indicator to the developer that whatever workspace they were using before is no longer valid, the actual handling of workspace by the backend would stay the same.
Basically there are 2 scenarios here:

  1. where limit to the number of times I can request workspace from context is enough:
  • I asked for some memory
  • then I (or my callee, about which I was not aware of) am asking the second time
  • this fails -> I now am aware that there is a bug in my code and I fix it
  1. where I actually need to be able to grow my workspace:
  • I asked for some memory for auxiliary computation
  • based on that computation I now know how much memory I will really need (but I can't just ask for it now because it would fail, as there is a limit)
  • I "return" my previous workspace, acknowledging that I will not be able to use that memory again, resetting the limit
  • I take a bigger slice (which is probably the same memory I had before)

@szha
Copy link
Member

szha commented Nov 3, 2020

I've checked the implementation, the way to solve the problem is to create a large workspace, pre-compute the size of the workspace required by reduce and then slice the large workspace:

Example code in LayerNorm:

https://github.com/apache/incubator-mxnet/blob/33d94f1d59335f504ed5b9a7b32f0e81a5d5da56/src/operator/nn/layer_norm-inl.h#L247-L271

Attention needs to be paid to memory alignment when dtypes for different chunks are different. It didn't matter in this example.

@ptrendx
Copy link
Member Author

ptrendx commented Nov 3, 2020

Yes, I think we should add overload to get_space_typed to take "multiple" workspaces (as in, still single allocation but with all the alignment etc. taken care of for the operator developer).

@sxjscience
Copy link
Member

@ptrendx
Copy link
Member Author

ptrendx commented Nov 3, 2020

Yup, that padding logic should be handled by the call to take workspace, not by each operator separately.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants