-
Notifications
You must be signed in to change notification settings - Fork 699
[RFC] Refactor CPUFunction and InterpreterFunction to remove per-run state #2274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
auto symbolInfo = runtimeBundle_.getSymbolInfo(v); | ||
auto addr = runtimeBundle_.getConstants() + symbolInfo.offset; | ||
auto tensor = new Tensor(addr, &symbolInfo.type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment that this creates an unowned tensor.
Refactored based on @opti-mix's suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think separating the execution state from the CompiledFunction is the right direction. I've one high-level question before I get too deep into the review though: the original intent of "setupRuns" was to prepare the device for execution of a particular model (loading the code/weights, etc.). We don't want to do that stuff on every execute
, so where should that happen with this approach?
@bertmaher That device preparation stuff should happen in the DeviceManager, I think. E.g. for the case of moving constants to the device the DeviceManager should do it in the |
Worth calling out that this PR changes the |
15cc01c
to
2e6a7cd
Compare
Can I get a review on this? I've got some follow ups piling up. |
|
||
updatePlaceholders(ctx, baseMutableWeightVarsAddress); | ||
|
||
alignedFree(baseMutableWeightVarsAddress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be fine for now (alloc and dealloc every inference request). But technically we could store this in thread local and reuse buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we run into CPU backend perf concerns we should add a memory pool here.
@@ -38,21 +38,21 @@ class CompiledFunction { | |||
virtual ~CompiledFunction() = default; | |||
/// Execute the network and allocate Placeholder memory with given | |||
/// \p ctx providing mapping between Placeholder and populated tensor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magic :) comment about ctx was already in place
///@} | ||
private: | ||
/// Load constant tensors from \p ctx into \p weightsAddress, as defined by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I'm a bit confused. Placeholders are for inputs/outputs but not for constant tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm maintaining existing behaviour of CPUFunction in this diff, which is that all constants & placeholders have their space allocated in the RuntimeBundle and then we copy them into the per-run memory block for execution. The memory should be uninitialized so we don't need to memcpy it, but figured we could fix that when we get to it. It is a known issue with the RuntimeBundle.
@@ -22,29 +22,26 @@ | |||
|
|||
#include "llvm/Support/Casting.h" | |||
|
|||
#include "llvm/Support/raw_ostream.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used.
@@ -621,7 +621,9 @@ static void topK(Tensor &outW, Tensor &indW, Tensor &inW, size_t k) { | |||
} | |||
} | |||
|
|||
void OpenCLFunction::execute() { | |||
void OpenCLFunction::execute(Context *ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do: execute (Context *) and remove (void)ctx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, @opti-mix has a comment asking for the reverse above. Personally, not worried either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, does not matter indeed.
alignedFree(baseMutableWeightVarsAddress_); | ||
baseMutableWeightVarsAddress_ = nullptr; | ||
void CPUFunction::execute(Context *ctx) { | ||
/// Base address for Activations memory block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd remove comment, does not provide any additional info on top of the var name.
@@ -96,7 +96,7 @@ class OpenCLFunction final : public CompiledFunction { | |||
///@{ | |||
~OpenCLFunction() override; | |||
|
|||
void execute() override; | |||
void execute(Context *ctx) override; | |||
///@} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to PR but ///@? needs to be closed after void tearDownRuns() override;
ah damn I pushed but didn't add the changes, i'll get em in the next one |
sounds good |
Description: To support running a compiled function multiple times, particularly concurrently on different devices, we must remove per-run state from the CompiledFunction.
This is a suggested solution for the CPU and Interpreter backend:
The side effect of these changes is to make the non execute() members of CompiledFunction (setupRuns/tearDownRuns and beforeRun/afterRun) empty. The multi-stage execution flow is inherently stateful so I think we should remove them for all backends and move their logic to the various DeviceManagers.
Testing: Unit tests in debug, release & asan.
Documentation: Will need to update, but interested in peoples thoughts.