-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified Vector STL bind initialization from a buffer type with optimization for simple arrays #2298
Conversation
…ization for simple arrays
Do you have any numbers to show that this makes things actually faster? I wonder how much a smart compiler could already optimize this case. |
I ran it with a large (600k) Python array of bytes (unsigned char) and observed a speedup in the conversion of approximately 10x on my machine when compiled with Apple clang 11. |
Can you post the code, so we can reproduce the benchmarks? |
That does sound interesting. |
Just using the test files:
|
On my machine change is not that dramatic, it's "only" factor of 3.6 which is still quite good (ubuntu + gcc-8.3). |
include/pybind11/stl_bind.h
Outdated
if (step == 1) { | ||
auto vec = std::unique_ptr<Vector>(new Vector(p, end)); | ||
return vec.release(); | ||
} | ||
else { | ||
auto vec = std::unique_ptr<Vector>(new Vector()); | ||
vec->reserve((size_t) info.shape[0]); | ||
for (; p != end; p += step) | ||
vec->push_back(*p); | ||
return vec.release(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: can we get rid of std::unqiue_ptr
? I'm not aware of any performance penalties of return-by-value alternative which looks much cleaner.
if (step == 1) { | |
auto vec = std::unique_ptr<Vector>(new Vector(p, end)); | |
return vec.release(); | |
} | |
else { | |
auto vec = std::unique_ptr<Vector>(new Vector()); | |
vec->reserve((size_t) info.shape[0]); | |
for (; p != end; p += step) | |
vec->push_back(*p); | |
return vec.release(); | |
} | |
if (step == 1) { | |
return Vector(p, end); | |
} | |
else { | |
Vector vec; | |
vec.reserve((size_t) info.shape[0]); | |
for (; p != end; p += step) | |
vec.push_back(*p); | |
return vec; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any issues with doing that, performance looks the same to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely looks cleaner, but there's a possibility that only one of the two branches will have return value optimization applied. In theory, it should be possible to apply RVO to both branches, but I don't think compilers do that (yet). Can we check what are the compilers actually doing?
I mean, we're getting a pretty nice performance improvement, but maybe copying 6'000'000 objects would negate the performance improvements.
Reading the source some more, the first branch would return a temporary and RVO there is guaranteed. Considering that the size of the vector "blueprints" is constant (3 pointers), it might work better than I expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if no RVO is involved the result should be moved with no performance hit, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. It will definitely be possible to move the vector, instead of copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_unique
is just syntax sugar, no? In comparison, make_shared
can IIRC avoid a memory allocation compared to the naive initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_unique
is just syntax sugar, no? In comparison,make_shared
can IIRC avoid a memory allocation compared to the naive initialization.
I think it is, yes. But it's shorter (doesn't repeat the type) and makes things feel more atomic (whereas unique_ptr<T>(new T())
allows you to be tempted to split up the new
and the construction of unique_ptr
, which could result in leaks on exceptions, etc).
But it's C++14, so nevermind all that :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I was the one bringing up RVO and whether it works...
; GCC
.L32:
mov rax, QWORD PTR [rsp]
mov QWORD PTR [r13+0], rax
mov rax, QWORD PTR [rsp+8]
mov QWORD PTR [r13+8], rax
mov rax, QWORD PTR [rsp+16]
mov QWORD PTR [r13+16], rax
.L24:
mov rax, r13
In other words, after populating the vector, we move 3 words, from the top of the stack into R13
and return a little later. This is what a move looks like. At O3 it will be further optimized to 128bit move + 64bit move, instead of 3 x 64bit moves.
; Clang
mov rax, r14
You guessed it. RVO works here.
; MSVC
$LN3@to_vector:
mov rax, QWORD PTR vec$1[rsp]
mov QWORD PTR [rsi], rax
mov QWORD PTR [rsi+8], rcx
mov rax, QWORD PTR vec$1[rsp+16]
mov QWORD PTR [rsi+16], rax
$LN36@to_vector:
mov rax, rsi
This is the exact same thing as GCC, just with different registers. Except that I couldn't get MSVC to use XMM0 and perform a 128bit move.
The branch with the temporary gets optimized to a memcpy
and RVO is used. The assembly is just beautiful in this case.
so this could be slower if the move-constructor is costly
True, but we're talking about a std::vector
-like container. You're not moving the T
s. You're moving the container "blueprints". Is there a reasonable "vector-like" container with expensive move constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whooooa, great investigation, @bstaletic! :-D
True, but we're talking about a
std::vector
-like container. You're not moving theT
s. You're moving the container "blueprints". Is there a reasonable "vector-like" container with expensive move constructor?
Probably not; I'm just thinking of some custom thing, but then it might be the user's own problem if they can't implement a decent move constructor.
On the other hand, I'm wondering that if it doesn't matter, why not just keep it like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, I'm wondering that if it doesn't matter, why not just keep it like this?
Because readability and maintainability. It's easier to reason about the code without needless indirections.
This looks like a reasonable change to me. @YannickJadoul, feel free to merge once you are happy with it. (Maybe having some tests for this would also be good, though we probably already initialize a few of these simple layout arrays in the test suite.) |
I just checked, and there's no test that tests the branch with |
Was just getting back to looking at this, I think a test for this case exists in
I do not think it is possible to test this branch using the |
What exactly do you mean? Isn't it a Maybe I misunderstand what you actually mean, but numpy arrays are not copied:
|
Sorry for being unclear! My understanding is that Python standard library I can add an additional test that makes it more clear what is happening, but my question was whether or not there is something in the standard library that would also allow testing this branch with non-numpy buffers. |
OK, yes, good point. Sorry, I had missed those tests, but yes, numpy arrays ought to be row-major, so
Alright, I misunderstood that question! This seems like it might work, though?
(I didn't immediately know, either, but |
Thanks for the hint! I implemented some basic tests for step > 1 using numpy arrays and memoryview. |
92464b2
to
1fcc435
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, sorry, completely forgot about this one, @marc-chiesa!
Looks good to me. Let's merge once everything is green!
Merged! Thanks again, @marc-chiesa! |
Added a check in the initialization of a bound vector from a buffer type to perform fast initialization if the buffer is a simple contiguous array.