Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified Vector STL bind initialization from a buffer type with optimization for simple arrays #2298

Merged
merged 4 commits into from
Aug 13, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions include/pybind11/stl_bind.h
Original file line number Diff line number Diff line change
Expand Up @@ -397,14 +397,20 @@ vector_buffer(Class_& cl) {
if (!detail::compare_buffer_info<T>::compare(info) || (ssize_t) sizeof(T) != info.itemsize)
throw type_error("Format mismatch (Python: " + info.format + " C++: " + format_descriptor<T>::format() + ")");

auto vec = std::unique_ptr<Vector>(new Vector());
vec->reserve((size_t) info.shape[0]);
T *p = static_cast<T*>(info.ptr);
ssize_t step = info.strides[0] / static_cast<ssize_t>(sizeof(T));
T *end = p + info.shape[0] * step;
for (; p != end; p += step)
vec->push_back(*p);
return vec.release();
if (step == 1) {
auto vec = std::unique_ptr<Vector>(new Vector(p, end));
return vec.release();
}
else {
auto vec = std::unique_ptr<Vector>(new Vector());
vec->reserve((size_t) info.shape[0]);
for (; p != end; p += step)
vec->push_back(*p);
return vec.release();
}
Copy link
Contributor

@sizmailov sizmailov Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: can we get rid of std::unqiue_ptr? I'm not aware of any performance penalties of return-by-value alternative which looks much cleaner.

Suggested change
if (step == 1) {
auto vec = std::unique_ptr<Vector>(new Vector(p, end));
return vec.release();
}
else {
auto vec = std::unique_ptr<Vector>(new Vector());
vec->reserve((size_t) info.shape[0]);
for (; p != end; p += step)
vec->push_back(*p);
return vec.release();
}
if (step == 1) {
return Vector(p, end);
}
else {
Vector vec;
vec.reserve((size_t) info.shape[0]);
for (; p != end; p += step)
vec.push_back(*p);
return vec;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any issues with doing that, performance looks the same to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely looks cleaner, but there's a possibility that only one of the two branches will have return value optimization applied. In theory, it should be possible to apply RVO to both branches, but I don't think compilers do that (yet). Can we check what are the compilers actually doing?

I mean, we're getting a pretty nice performance improvement, but maybe copying 6'000'000 objects would negate the performance improvements.

Reading the source some more, the first branch would return a temporary and RVO there is guaranteed. Considering that the size of the vector "blueprints" is constant (3 pointers), it might work better than I expect.

Copy link
Contributor

@sizmailov sizmailov Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if no RVO is involved the result should be moved with no performance hit, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It will definitely be possible to move the vector, instead of copying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_unique is just syntax sugar, no? In comparison, make_shared can IIRC avoid a memory allocation compared to the naive initialization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_unique is just syntax sugar, no? In comparison, make_shared can IIRC avoid a memory allocation compared to the naive initialization.

I think it is, yes. But it's shorter (doesn't repeat the type) and makes things feel more atomic (whereas unique_ptr<T>(new T()) allows you to be tempted to split up the new and the construction of unique_ptr, which could result in leaks on exceptions, etc).

But it's C++14, so nevermind all that :-D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was the one bringing up RVO and whether it works...

; GCC
.L32:
        mov     rax, QWORD PTR [rsp]
        mov     QWORD PTR [r13+0], rax
        mov     rax, QWORD PTR [rsp+8]
        mov     QWORD PTR [r13+8], rax
        mov     rax, QWORD PTR [rsp+16]
        mov     QWORD PTR [r13+16], rax
.L24:
        mov     rax, r13

In other words, after populating the vector, we move 3 words, from the top of the stack into R13 and return a little later. This is what a move looks like. At O3 it will be further optimized to 128bit move + 64bit move, instead of 3 x 64bit moves.

; Clang
        mov     rax, r14

You guessed it. RVO works here.

; MSVC
$LN3@to_vector:
        mov     rax, QWORD PTR vec$1[rsp]
        mov     QWORD PTR [rsi], rax
        mov     QWORD PTR [rsi+8], rcx
        mov     rax, QWORD PTR vec$1[rsp+16]
        mov     QWORD PTR [rsi+16], rax
$LN36@to_vector:
        mov     rax, rsi

This is the exact same thing as GCC, just with different registers. Except that I couldn't get MSVC to use XMM0 and perform a 128bit move.

 

The branch with the temporary gets optimized to a memcpy and RVO is used. The assembly is just beautiful in this case.

 

@YannickJadoul

so this could be slower if the move-constructor is costly

True, but we're talking about a std::vector-like container. You're not moving the Ts. You're moving the container "blueprints". Is there a reasonable "vector-like" container with expensive move constructor?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whooooa, great investigation, @bstaletic! :-D

True, but we're talking about a std::vector-like container. You're not moving the Ts. You're moving the container "blueprints". Is there a reasonable "vector-like" container with expensive move constructor?

Probably not; I'm just thinking of some custom thing, but then it might be the user's own problem if they can't implement a decent move constructor.
On the other hand, I'm wondering that if it doesn't matter, why not just keep it like this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I'm wondering that if it doesn't matter, why not just keep it like this?

Because readability and maintainability. It's easier to reason about the code without needless indirections.

}));

return;
Expand Down