Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified Vector STL bind initialization from a buffer type with optimization for simple arrays #2298

Merged
merged 4 commits into from
Aug 13, 2020

Conversation

marc-chiesa
Copy link
Contributor

Added a check in the initialization of a bound vector from a buffer type to perform fast initialization if the buffer is a simple contiguous array.

@YannickJadoul
Copy link
Collaborator

YannickJadoul commented Jul 14, 2020

Do you have any numbers to show that this makes things actually faster? I wonder how much a smart compiler could already optimize this case.

@marc-chiesa
Copy link
Contributor Author

I ran it with a large (600k) Python array of bytes (unsigned char) and observed a speedup in the conversion of approximately 10x on my machine when compiled with Apple clang 11.

@bstaletic
Copy link
Collaborator

Can you post the code, so we can reproduce the benchmarks?

@YannickJadoul
Copy link
Collaborator

I ran it with a large (600k) Python array of bytes (unsigned char) and observed a speedup in the conversion of approximately 10x on my machine when compiled with Apple clang 11.

That does sound interesting.

@marc-chiesa
Copy link
Contributor Author

marc-chiesa commented Jul 14, 2020

Just using the test files:

import sys
import timeit
import array
from pybind11_tests import stl_binders as m

print(timeit.timeit("m.VectorUChar(array.array('B', [25]) * 600000)", number=1000, globals=globals()))

@sizmailov
Copy link
Contributor

On my machine change is not that dramatic, it's "only" factor of 3.6 which is still quite good (ubuntu + gcc-8.3).

Comment on lines 403 to 413
if (step == 1) {
auto vec = std::unique_ptr<Vector>(new Vector(p, end));
return vec.release();
}
else {
auto vec = std::unique_ptr<Vector>(new Vector());
vec->reserve((size_t) info.shape[0]);
for (; p != end; p += step)
vec->push_back(*p);
return vec.release();
}
Copy link
Contributor

@sizmailov sizmailov Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: can we get rid of std::unqiue_ptr? I'm not aware of any performance penalties of return-by-value alternative which looks much cleaner.

Suggested change
if (step == 1) {
auto vec = std::unique_ptr<Vector>(new Vector(p, end));
return vec.release();
}
else {
auto vec = std::unique_ptr<Vector>(new Vector());
vec->reserve((size_t) info.shape[0]);
for (; p != end; p += step)
vec->push_back(*p);
return vec.release();
}
if (step == 1) {
return Vector(p, end);
}
else {
Vector vec;
vec.reserve((size_t) info.shape[0]);
for (; p != end; p += step)
vec.push_back(*p);
return vec;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any issues with doing that, performance looks the same to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely looks cleaner, but there's a possibility that only one of the two branches will have return value optimization applied. In theory, it should be possible to apply RVO to both branches, but I don't think compilers do that (yet). Can we check what are the compilers actually doing?

I mean, we're getting a pretty nice performance improvement, but maybe copying 6'000'000 objects would negate the performance improvements.

Reading the source some more, the first branch would return a temporary and RVO there is guaranteed. Considering that the size of the vector "blueprints" is constant (3 pointers), it might work better than I expect.

Copy link
Contributor

@sizmailov sizmailov Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if no RVO is involved the result should be moved with no performance hit, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It will definitely be possible to move the vector, instead of copying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_unique is just syntax sugar, no? In comparison, make_shared can IIRC avoid a memory allocation compared to the naive initialization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_unique is just syntax sugar, no? In comparison, make_shared can IIRC avoid a memory allocation compared to the naive initialization.

I think it is, yes. But it's shorter (doesn't repeat the type) and makes things feel more atomic (whereas unique_ptr<T>(new T()) allows you to be tempted to split up the new and the construction of unique_ptr, which could result in leaks on exceptions, etc).

But it's C++14, so nevermind all that :-D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was the one bringing up RVO and whether it works...

; GCC
.L32:
        mov     rax, QWORD PTR [rsp]
        mov     QWORD PTR [r13+0], rax
        mov     rax, QWORD PTR [rsp+8]
        mov     QWORD PTR [r13+8], rax
        mov     rax, QWORD PTR [rsp+16]
        mov     QWORD PTR [r13+16], rax
.L24:
        mov     rax, r13

In other words, after populating the vector, we move 3 words, from the top of the stack into R13 and return a little later. This is what a move looks like. At O3 it will be further optimized to 128bit move + 64bit move, instead of 3 x 64bit moves.

; Clang
        mov     rax, r14

You guessed it. RVO works here.

; MSVC
$LN3@to_vector:
        mov     rax, QWORD PTR vec$1[rsp]
        mov     QWORD PTR [rsi], rax
        mov     QWORD PTR [rsi+8], rcx
        mov     rax, QWORD PTR vec$1[rsp+16]
        mov     QWORD PTR [rsi+16], rax
$LN36@to_vector:
        mov     rax, rsi

This is the exact same thing as GCC, just with different registers. Except that I couldn't get MSVC to use XMM0 and perform a 128bit move.

 

The branch with the temporary gets optimized to a memcpy and RVO is used. The assembly is just beautiful in this case.

 

@YannickJadoul

so this could be slower if the move-constructor is costly

True, but we're talking about a std::vector-like container. You're not moving the Ts. You're moving the container "blueprints". Is there a reasonable "vector-like" container with expensive move constructor?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whooooa, great investigation, @bstaletic! :-D

True, but we're talking about a std::vector-like container. You're not moving the Ts. You're moving the container "blueprints". Is there a reasonable "vector-like" container with expensive move constructor?

Probably not; I'm just thinking of some custom thing, but then it might be the user's own problem if they can't implement a decent move constructor.
On the other hand, I'm wondering that if it doesn't matter, why not just keep it like this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I'm wondering that if it doesn't matter, why not just keep it like this?

Because readability and maintainability. It's easier to reason about the code without needless indirections.

@wjakob
Copy link
Member

wjakob commented Jul 15, 2020

This looks like a reasonable change to me. @YannickJadoul, feel free to merge once you are happy with it. (Maybe having some tests for this would also be good, though we probably already initialize a few of these simple layout arrays in the test suite.)

@YannickJadoul
Copy link
Collaborator

(Maybe having some tests for this would also be good, though we probably already initialize a few of these simple layout arrays in the test suite.)

I just checked, and there's no test that tests the branch with step != 1. Would you mind adding one to test_vector_buffer that passes a view with stride > 1, @marc-chiesa (i.e., [::2] or so, I suppose)?

@marc-chiesa
Copy link
Contributor Author

(Maybe having some tests for this would also be good, though we probably already initialize a few of these simple layout arrays in the test suite.)

I just checked, and there's no test that tests the branch with step != 1. Would you mind adding one to test_vector_buffer that passes a view with stride > 1, @marc-chiesa (i.e., [::2] or so, I suppose)?

Was just getting back to looking at this, I think a test for this case exists in test_vector_buffer_numpy. At least I think the following code tests a step != 1

    v = m.VectorInt(a[:, 1])
    assert len(v) == 3
    assert v[2] == 10

I do not think it is possible to test this branch using the array module because the slice/stride operator apparently returns a copy of the elements from the original array rather than a view or the original memory with a modified stride property as is done in numpy. Is there a standard library collection that provides support for this?

@YannickJadoul
Copy link
Collaborator

I do not think it is possible to test this branch using the array module because the slice/stride operator apparently returns a copy of the elements from the original array rather than a view or the original memory

What exactly do you mean? Isn't it a std_bind-bound vector constructed from a buffer? So if you can construct m.VectorInt from a numpy array where you put array[::2], it should have stride != 1, no?

Maybe I misunderstand what you actually mean, but numpy arrays are not copied:

$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> x = np.arange(10)
>>> y = x[::2]
>>> y
array([0, 2, 4, 6, 8])
>>> y.base
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> y.base is x
array([  0,   1,   2,   3,   4,   5,   6,   7, 100,   9])
>>> y[-1] = 42
>>> x
array([ 0,  1,  2,  3,  4,  5,  6,  7, 42,  9])
>>> 

@marc-chiesa
Copy link
Contributor Author

marc-chiesa commented Jul 23, 2020

I do not think it is possible to test this branch using the array module because the slice/stride operator apparently returns a copy of the elements from the original array rather than a view or the original memory

What exactly do you mean? Isn't it a std_bind-bound vector constructed from a buffer? So if you can construct m.VectorInt from a numpy array where you put array[::2], it should have stride != 1, no?

Maybe I misunderstand what you actually mean, but numpy arrays are not copied:

Sorry for being unclear! My understanding is that Python standard library array is copied when specifying a stride, but you are correct that numpy arrays are not. So the block of code that I referenced should already invoke the != 1 branch of the code in the numpy-specific test test_vector_buffer_numpy. It should have a step of 4 ints to get to the next element in the resulting vector.

I can add an additional test that makes it more clear what is happening, but my question was whether or not there is something in the standard library that would also allow testing this branch with non-numpy buffers.

@YannickJadoul
Copy link
Collaborator

So the block of code that I referenced should already invoke the != 1 branch of the code in the numpy-specific test test_vector_buffer_numpy.

OK, yes, good point. Sorry, I had missed those tests, but yes, numpy arrays ought to be row-major, so [:,1] should have a stride != 1! Still, it wouldn't hurt to try adding some test like a[::2,0], I suppose?

but my question was whether or not there is something in the standard library that would also allow testing this branch with non-numpy buffers.

Alright, I misunderstood that question! This seems like it might work, though?

>>> x = bytearray(range(10))
>>> y = memoryview(x)
>>> y.contiguous
True
>>> z = y[::2]
>>> z.contiguous
False
>>> 

(I didn't immediately know, either, but memoryview is already used in the test_vector_buffer test, which reminded me.)

@marc-chiesa
Copy link
Contributor Author

Thanks for the hint! I implemented some basic tests for step > 1 using numpy arrays and memoryview.

@marc-chiesa marc-chiesa force-pushed the issue-2102 branch 2 times, most recently from 92464b2 to 1fcc435 Compare August 4, 2020 03:59
Copy link
Collaborator

@YannickJadoul YannickJadoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry, completely forgot about this one, @marc-chiesa!

Looks good to me. Let's merge once everything is green!

@YannickJadoul YannickJadoul merged commit 830adda into pybind:master Aug 13, 2020
@YannickJadoul
Copy link
Collaborator

Merged! Thanks again, @marc-chiesa!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants