Skip to content

[RFC] Refactor CPUFunction and InterpreterFunction to remove per-run state #2274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 23, 2019

Conversation

nickgg
Copy link
Contributor

@nickgg nickgg commented Jan 16, 2019

Description: To support running a compiled function multiple times, particularly concurrently on different devices, we must remove per-run state from the CompiledFunction.

This is a suggested solution for the CPU and Interpreter backend:

  • For the CPUFunction: we stored the contiguous buffers for activations and weights and so could only have one concurrent run. These buffers need live only as long as the execution, so have moved into the scope of the execute() method. This means also delaying filling of those buffers until execute() as well.
  • For the InterpreterFunction: We use various Tensor objects rather than a single weights/activation buffer. These Tensors were stored in the InterpreterFunction itself, which meant concurrent runs would overwrite intermediate values. The sensible place for these is the Context, so I've added the ability to store Tensors keyed by name to the Context and removed the three Tensor maps in InterpreterFunction. This is a general interface, but I'm not sure if it will be useful outside of the interpreter.

The side effect of these changes is to make the non execute() members of CompiledFunction (setupRuns/tearDownRuns and beforeRun/afterRun) empty. The multi-stage execution flow is inherently stateful so I think we should remove them for all backends and move their logic to the various DeviceManagers.

Testing: Unit tests in debug, release & asan.
Documentation: Will need to update, but interested in peoples thoughts.

}
auto symbolInfo = runtimeBundle_.getSymbolInfo(v);
auto addr = runtimeBundle_.getConstants() + symbolInfo.offset;
auto tensor = new Tensor(addr, &symbolInfo.type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that this creates an unowned tensor.

@nickgg
Copy link
Contributor Author

nickgg commented Jan 18, 2019

Refactored based on @opti-mix's suggestion.

Copy link
Contributor

@bertmaher bertmaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think separating the execution state from the CompiledFunction is the right direction. I've one high-level question before I get too deep into the review though: the original intent of "setupRuns" was to prepare the device for execution of a particular model (loading the code/weights, etc.). We don't want to do that stuff on every execute, so where should that happen with this approach?

@nickgg
Copy link
Contributor Author

nickgg commented Jan 18, 2019

@bertmaher That device preparation stuff should happen in the DeviceManager, I think. E.g. for the case of moving constants to the device the DeviceManager should do it in the addNetwork() call.

@nickgg
Copy link
Contributor Author

nickgg commented Jan 18, 2019

Worth calling out that this PR changes the CompiledFunctioninterface, by adding the Context *argument to `execute(). We'll need to update all backends.

@nickgg nickgg force-pushed the compileFunc branch 2 times, most recently from 15cc01c to 2e6a7cd Compare January 22, 2019 21:46
@nickgg
Copy link
Contributor Author

nickgg commented Jan 22, 2019

Can I get a review on this? I've got some follow ups piling up.


updatePlaceholders(ctx, baseMutableWeightVarsAddress);

alignedFree(baseMutableWeightVarsAddress);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fine for now (alloc and dealloc every inference request). But technically we could store this in thread local and reuse buffers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we run into CPU backend perf concerns we should add a memory pool here.

@@ -38,21 +38,21 @@ class CompiledFunction {
virtual ~CompiledFunction() = default;
/// Execute the network and allocate Placeholder memory with given
/// \p ctx providing mapping between Placeholder and populated tensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic :) comment about ctx was already in place

///@}
private:
/// Load constant tensors from \p ctx into \p weightsAddress, as defined by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I'm a bit confused. Placeholders are for inputs/outputs but not for constant tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm maintaining existing behaviour of CPUFunction in this diff, which is that all constants & placeholders have their space allocated in the RuntimeBundle and then we copy them into the per-run memory block for execution. The memory should be uninitialized so we don't need to memcpy it, but figured we could fix that when we get to it. It is a known issue with the RuntimeBundle.

@@ -22,29 +22,26 @@

#include "llvm/Support/Casting.h"

#include "llvm/Support/raw_ostream.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used.

@@ -621,7 +621,9 @@ static void topK(Tensor &outW, Tensor &indW, Tensor &inW, size_t k) {
}
}

void OpenCLFunction::execute() {
void OpenCLFunction::execute(Context *ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do: execute (Context *) and remove (void)ctx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, @opti-mix has a comment asking for the reverse above. Personally, not worried either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, does not matter indeed.

alignedFree(baseMutableWeightVarsAddress_);
baseMutableWeightVarsAddress_ = nullptr;
void CPUFunction::execute(Context *ctx) {
/// Base address for Activations memory block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd remove comment, does not provide any additional info on top of the var name.

@@ -96,7 +96,7 @@ class OpenCLFunction final : public CompiledFunction {
///@{
~OpenCLFunction() override;

void execute() override;
void execute(Context *ctx) override;
///@}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to PR but ///@? needs to be closed after void tearDownRuns() override;

@nickgg nickgg merged commit 73479f5 into pytorch:master Jan 23, 2019
@nickgg nickgg deleted the compileFunc branch January 23, 2019 17:58
@nickgg
Copy link
Contributor Author

nickgg commented Jan 23, 2019

ah damn I pushed but didn't add the changes, i'll get em in the next one

@rdzhabarov
Copy link
Contributor

ah damn I pushed but didn't add the changes, i'll get em in the next one

sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants