Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

2.0 migration guide? #20109

Open
matteosal opened this issue Mar 31, 2021 · 6 comments
Open

2.0 migration guide? #20109

matteosal opened this issue Mar 31, 2021 · 6 comments

Comments

@matteosal
Copy link
Contributor

Is there a resource or set of resources for migrating a project to v2.0? I am specifically interesed to the C API. I have checked #17676 but it's not very specific apart from the *Ex functions removal/renaming.

For example everything related to executors, on which our code relies heavily, is gone in v2.0. This PR #18598 mentions "we will use cached_op_create/invoke in favor of graph executor c APIs" but I couldn't find detailed information on how to migrate code using the executor API to cached ops.

@szha
Copy link
Member

szha commented Mar 31, 2021

Thanks for bringing this up. Yes I think we definitely need a migration guide.

@matteosal
Copy link
Contributor Author

matteosal commented Apr 2, 2021

Ok I went looking for how the python wrapper (symbol.py and executor.py) got adapted to work with the cached op API and I'm starting to get a feeling of the current scenario. Some questions:

  • One major difference is that while the the old C executor required arrays at construction time, cached op requires them at evaluation time (MXInvokeCachedOp). Cached op also has a flag called static_alloc. This suggests that when such flag is activated the graph is actually "compiled" and shipped to the execution device by MXInvokeCachedOp, similarly to what MXExecutorBind used to do (except MXInvokeCachedOp also immediately runs the graph). Is this reasoning correct?
  • Following the reasoning of the previous point, calling MXInvokeCachedOp twice in a row specifying CPU and GPU devices (with static_alloc = true) will allocate the graph on both system memory and GPU memory. Is there a way to selectively free CPU or GPU memory?
  • Again on the same reasoning, what happens if MXInvokeCachedOp is called multiple times with NDArrays of different shapes? Will the arrays share memory?
  • What is the purpose of flags forward_bulk_size and backward_bulk_size?

@szha
Copy link
Member

szha commented Apr 6, 2021

Hi @matteosal. Thanks for looking into this.

Is this reasoning correct?

Yes

Is there a way to selectively free CPU or GPU memory?

Not yet, though this is indeed a useful functionality.

what happens if MXInvokeCachedOp is called multiple times with NDArrays of different shapes? Will the arrays share memory?

When static_alloc is on and multiple calls with different shapes happen, cached op will enlarge the buffer size when necessary.

What is the purpose of flags forward_bulk_size and backward_bulk_size?

These are flags for bulk execution. When bulk execution happens with forward_bulk_size=x, x number of operators will be executed in one call to the engine (scheduler) instead of x separate calls, so that the scheduling overhead is amortized.

@matteosal
Copy link
Contributor Author

Another question: in v1.x it was possible to query executors for their allocated memory:

import mxnet as mx

sym = mx.sym.Convolution(
	mx.sym.Variable('in'),
	mx.sym.Variable('w'),
	num_filter=16,
	kernel=(3, 3),
	no_bias=True
)
sym = mx.sym.relu(sym)
args = {'in': mx.nd.ones([64, 3, 300, 300]), 'w': mx.nd.ones([16, 3, 3, 3])}
ex = sym.bind(mx.cpu(), args)

print(ex.debug_str())
Symbol Outputs:
	output[0]=relu0(0)
Variable:in
Variable:w
--------------------
Op:Convolution, Name=convolution0
Inputs:
	arg[0]=in(0) version=0
	arg[1]=w(0) version=0
Attrs:
	kernel=(3, 3)
	no_bias=True
	num_filter=16
--------------------
Op:relu, Name=relu0
Inputs:
	arg[0]=convolution0(0)
Total 346 MB allocated
Total 11 TempSpace resource requested

Is there a way to get the equivalent of the line "Total 346 MB allocated" with cached ops?

@szha
Copy link
Member

szha commented Apr 14, 2021

Not yet. It would be great to add back that feature

@matteosal
Copy link
Contributor Author

More on the cached op flags:

These are flags for bulk execution. When bulk execution happens with forward_bulk_size=x, x number of operators will be executed in one call to the engine (scheduler) instead of x separate calls, so that the scheduling overhead is amortized.

Since this is an option to set, I guess there is a downside to running a large number of operators in one call. What is it? Increased memory usage? Is there a similar tradeoff for inline_limit?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants