-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select function on CPU takes 10% of time on tiny students, can be optimized #684
Comments
I think the Element(...) function might be a good template how to approach this. There I am using template meta-programming to unroll over N (template parameter) number of dimensions. Computation of the current index is progressive, therefor a lot cheaper than re-calculating from the elemental dimensions at each time step. https://github.com/marian-nmt/marian-dev/blob/master/src/tensors/cpu/element.h Using that approach it should also be possible to write one solution for all possibilities. |
If I'm allowed to assume indices is 1x1x...x (axis) x1x1... and consecutive memory then void Select(Tensor out,
const Tensor in,
const Tensor indices,
int axis) {
matchOrAbort<IndexType>(indices->type());
functional::Shape outShape = out->shape();
functional::Shape inShape = in->shape();
functional::Shape idxShape = indices->shape();
int axisCPU = (int)(axis + functional::Shape::size() - out->shape().size());
// Reduce the problem to beforeAxis x idxShape[axisCPU] x afterAxis.
int beforeAxis = 1;
for (int i = 0; i < axisCPU; ++i) {
beforeAxis *= outShape[i];
}
int afterAxis = 1;
for (int i = axisCPU + 1; i < functional::Shape::size(); ++i) {
afterAxis *= outShape[i];
}
// Stride to use for the beforeAxis dimension in the input and output tensors.
int inBeforeStride = axisCPU ? inShape.stride(axisCPU - 1) : inShape.elements();
int outBeforeStride = axisCPU ? outShape.stride(axisCPU - 1) : outShape.elements();
for (int beforeIdx = 0; beforeIdx < beforeAxis; ++beforeIdx) {
for (int axisIdx = 0; axisIdx < idxShape[axisCPU]; ++axisIdx) {
// This is the value to read along.
int index = indices->data<IndexType>()[axisIdx];
auto inBase = in->data() + beforeIdx * inBeforeStride + index * afterAxis;
auto outBase = out->data() + beforeIdx * outBeforeStride + axisIdx * afterAxis;
std::copy(inBase, inBase + afterAxis, outBase);
}
}
} In the profiler it reduces to:
What assumptions are allowed here? |
Fixes #684 Measured: enes.student.tiny11 xzcat sources.shuf.xz |head -n 10000 var (Cascade Lake) single core based on intgemm_reintegrated_computestats branch Before Total time: 66.69077s wall After Total time: 61.20206s wall
Just ran a profiler on the enes student http://statmt.org/bergamot/models/ . This was in the
intgemm_reintegrated_computestats
branch, but the Select function is the same in master.That Select function is 10.79% of the time:
marian-dev/src/tensors/cpu/tensor_operators.cpp
Line 692 in b28905a
And as the comments say, an optimized version is TODO. All of the calls in the student are axis=2. There was an attempt at optimization for this case but it's been disabled and the comments claim it doesn't work.
marian-dev/src/tensors/cpu/tensor_operators.cpp
Lines 708 to 711 in b28905a
Also in all the student calls, outShape != inShape so indeed that version wouldn't be called.
The text was updated successfully, but these errors were encountered: