-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Wrong handling of workspace in multiple numpy operators #19458
Comments
good catch. |
@hzfan @reminisce @haojin2 I think we have the discussion about multiple workspaces before. Would you help provide some contexts on this? |
@szha I don't think one would need stateful op there, just allocate the larger workspace that can hold both the intermediate result and the workspace for reduction. |
Same problem in |
There is a lot of copy-paste in the implementation of various numpy ops - they should be reviewed. Maybe we could make a test pipeline with a warning if someone tries to access |
I raised the concern before to @reminisce and @haojin2 but there were some discussions that mentioned that we can have multiple workspaces. |
@ptrendx Please refer to the new impl for numpy which uses a pre-allocated workspace to avoid this problem at: https://github.com/apache/incubator-mxnet/blob/master/src/operator/numpy/np_broadcast_reduce_op.h#L666-L694. the |
@haojin2 Sure (I'm actually folding that implementation back to the original for code reuse but anyway), but this function is not used in the examples I showed. I did see multiple examples of this, including some that I am fixing in my #19426 PR (like this in numpy kron: https://github.com/apache/incubator-mxnet/blob/master/src/operator/numpy/np_kron-inl.h#L229-L234). It just feels that whatever I fix in that PR is just patching things that I noticed instead of making sure the issue is fully fixed. That is the purpose of that issue. |
I've checked the implementation, the way to solve the problem is to create a large workspace, pre-compute the size of the workspace required by reduce and then slice the large workspace: Example code in LayerNorm: https://github.com/apache/incubator-mxnet/blob/33d94f1d59335f504ed5b9a7b32f0e81a5d5da56/src/operator/nn/layer_norm-inl.h#L247-L271 |
Discussed offline with @reminisce and he mentioned that there's the issue in #15732, in which 6 tests are disabled due to this. |
@sxjscience I don't think that is relevant - |
Okay, I was not aware of this. If that's the case, we should definitely need to fix these operators. |
@ptrendx I believe @sxjscience was referring to something that should have advocated for developing operators requiring a series of calls to other operators with multiple temp spaces. In fact, assigning multiple @hzfan and I investigated the feasibility of lifting this restriction but GPU CI could not pass at the time. We narrowed down the possible root cause residing in six unit tests: in #15732. Unfortunately, the real culprit is still not identified. It would be great if someone can take this forward to get the problem resolved so that the developer focusing on implementing one operator does not need to have full knowledge its callees' implementation details, i.e. calculating a lump-sum workspace. |
I agree that the developer experience is lacking here. That said, using multiple workspaces to solve this is not great - you end up with memory fragmentation and potentially much higher memory consumption overall because of this. I would argue that the proper way of handling of the developer experience should be better structure of the code (so that you do not take a random piece of the other operator without the need to think about prerequisites) and possibly some API limitations (e.g. in this case - maybe you should only be allowed to take the workspace from context once and potentially return it back if you want a reset of that counter (e.g. to resize the amount of workspace you need)). |
@ptrendx If I understand your argument correctly, it seems the mechanism of returning the workspace from the context and resetting the counter similarly leads to the potential problem of memory fragmentation? Also, resetting the temp space in a callee would require copying the data from the original context to the bigger one and consequently would cause performance degradation. I believe it's a decision about the tradeoff between developer experience and ultimate performance by comparing multi-temp-space solution with one lump-sum temp space. Anyway, I'm just speaking from my past experience of operator development. It's up to the community to bring up some RFC to tackle this. |
No no, by returning the workspace I did not mean actual freeing of memory - just an indicator to the developer that whatever workspace they were using before is no longer valid, the actual handling of workspace by the backend would stay the same.
|
Attention needs to be paid to memory alignment when dtypes for different chunks are different. It didn't matter in this example. |
Yes, I think we should add overload to |
One example about alignment (Use PadBytes for padding): https://github.com/apache/incubator-mxnet/blob/ea222a355005fb3f13fe422fcab7caab53999dfd/src/operator/tensor/ordering_op-inl.h#L579-L617 |
Yup, that padding logic should be handled by the call to take workspace, not by each operator separately. |
Description
The backward pass of tensordot (both regular and the integer version) operator uses workspace to store intermediate value, but then uses
ReduceAxesComputeImpl
, which also uses workspace). See e.g. here:https://github.com/apache/incubator-mxnet/blob/bd55002/src/operator/numpy/np_tensordot_op-inl.h#L420-L428
Since there is only a single workspace storage in MXNet, this means that it is possible for the snippet linked above to
@szha @leezu
The text was updated successfully, but these errors were encountered: