Modified Vector STL bind initialization from a buffer type with optimization for simple arrays #2298

marc-chiesa · 2020-07-14T00:07:44Z

Added a check in the initialization of a bound vector from a buffer type to perform fast initialization if the buffer is a simple contiguous array.

…ization for simple arrays

YannickJadoul · 2020-07-14T14:36:58Z

Do you have any numbers to show that this makes things actually faster? I wonder how much a smart compiler could already optimize this case.

marc-chiesa · 2020-07-14T14:45:49Z

I ran it with a large (600k) Python array of bytes (unsigned char) and observed a speedup in the conversion of approximately 10x on my machine when compiled with Apple clang 11.

bstaletic · 2020-07-14T14:55:14Z

Can you post the code, so we can reproduce the benchmarks?

YannickJadoul · 2020-07-14T15:09:20Z

I ran it with a large (600k) Python array of bytes (unsigned char) and observed a speedup in the conversion of approximately 10x on my machine when compiled with Apple clang 11.

That does sound interesting.

marc-chiesa · 2020-07-14T15:33:48Z

Just using the test files:

import sys
import timeit
import array
from pybind11_tests import stl_binders as m

print(timeit.timeit("m.VectorUChar(array.array('B', [25]) * 600000)", number=1000, globals=globals()))

sizmailov · 2020-07-14T17:21:38Z

On my machine change is not that dramatic, it's "only" factor of 3.6 which is still quite good (ubuntu + gcc-8.3).

sizmailov · 2020-07-14T17:26:34Z

include/pybind11/stl_bind.h

+        if (step == 1) {
+            auto vec = std::unique_ptr<Vector>(new Vector(p, end));
+            return vec.release();
+        }
+        else {
+            auto vec = std::unique_ptr<Vector>(new Vector());
+            vec->reserve((size_t) info.shape[0]);
+            for (; p != end; p += step)
+                vec->push_back(*p);
+            return vec.release();
+        }


Out of curiosity: can we get rid of std::unqiue_ptr? I'm not aware of any performance penalties of return-by-value alternative which looks much cleaner.

Suggested change

if (step == 1) {

auto vec = std::unique_ptr<Vector>(new Vector(p, end));

return vec.release();

}

else {

auto vec = std::unique_ptr<Vector>(new Vector());

vec->reserve((size_t) info.shape[0]);

for (; p != end; p += step)

vec->push_back(*p);

return vec.release();

}

if (step == 1) {

return Vector(p, end);

}

else {

Vector vec;

vec.reserve((size_t) info.shape[0]);

for (; p != end; p += step)

vec.push_back(*p);

return vec;

}

I don't see any issues with doing that, performance looks the same to me.

It definitely looks cleaner, but there's a possibility that only one of the two branches will have return value optimization applied. In theory, it should be possible to apply RVO to both branches, but I don't think compilers do that (yet). Can we check what are the compilers actually doing?

I mean, we're getting a pretty nice performance improvement, but maybe copying 6'000'000 objects would negate the performance improvements.

Reading the source some more, the first branch would return a temporary and RVO there is guaranteed. Considering that the size of the vector "blueprints" is constant (3 pointers), it might work better than I expect.

Even if no RVO is involved the result should be moved with no performance hit, right?

You're right. It will definitely be possible to move the vector, instead of copying.

make_unique is just syntax sugar, no? In comparison, make_shared can IIRC avoid a memory allocation compared to the naive initialization.

make_unique is just syntax sugar, no? In comparison, make_shared can IIRC avoid a memory allocation compared to the naive initialization.

I think it is, yes. But it's shorter (doesn't repeat the type) and makes things feel more atomic (whereas unique_ptr<T>(new T()) allows you to be tempted to split up the new and the construction of unique_ptr, which could result in leaks on exceptions, etc).

But it's C++14, so nevermind all that :-D

Since I was the one bringing up RVO and whether it works...

; GCC .L32: mov rax, QWORD PTR [rsp] mov QWORD PTR [r13+0], rax mov rax, QWORD PTR [rsp+8] mov QWORD PTR [r13+8], rax mov rax, QWORD PTR [rsp+16] mov QWORD PTR [r13+16], rax .L24: mov rax, r13

In other words, after populating the vector, we move 3 words, from the top of the stack into R13 and return a little later. This is what a move looks like. At O3 it will be further optimized to 128bit move + 64bit move, instead of 3 x 64bit moves.

; Clang mov rax, r14

You guessed it. RVO works here.

; MSVC $LN3@to_vector: mov rax, QWORD PTR vec$1[rsp] mov QWORD PTR [rsi], rax mov QWORD PTR [rsi+8], rcx mov rax, QWORD PTR vec$1[rsp+16] mov QWORD PTR [rsi+16], rax $LN36@to_vector: mov rax, rsi

This is the exact same thing as GCC, just with different registers. Except that I couldn't get MSVC to use XMM0 and perform a 128bit move.

The branch with the temporary gets optimized to a memcpy and RVO is used. The assembly is just beautiful in this case.

@YannickJadoul

so this could be slower if the move-constructor is costly

True, but we're talking about a std::vector-like container. You're not moving the Ts. You're moving the container "blueprints". Is there a reasonable "vector-like" container with expensive move constructor?

Whooooa, great investigation, @bstaletic! :-D

True, but we're talking about a std::vector-like container. You're not moving the Ts. You're moving the container "blueprints". Is there a reasonable "vector-like" container with expensive move constructor?

Probably not; I'm just thinking of some custom thing, but then it might be the user's own problem if they can't implement a decent move constructor.
On the other hand, I'm wondering that if it doesn't matter, why not just keep it like this?

On the other hand, I'm wondering that if it doesn't matter, why not just keep it like this?

Because readability and maintainability. It's easier to reason about the code without needless indirections.

wjakob · 2020-07-15T15:25:52Z

This looks like a reasonable change to me. @YannickJadoul, feel free to merge once you are happy with it. (Maybe having some tests for this would also be good, though we probably already initialize a few of these simple layout arrays in the test suite.)

YannickJadoul · 2020-07-15T15:50:40Z

(Maybe having some tests for this would also be good, though we probably already initialize a few of these simple layout arrays in the test suite.)

I just checked, and there's no test that tests the branch with step != 1. Would you mind adding one to test_vector_buffer that passes a view with stride > 1, @marc-chiesa (i.e., [::2] or so, I suppose)?

marc-chiesa · 2020-07-23T03:48:44Z

(Maybe having some tests for this would also be good, though we probably already initialize a few of these simple layout arrays in the test suite.)

I just checked, and there's no test that tests the branch with step != 1. Would you mind adding one to test_vector_buffer that passes a view with stride > 1, @marc-chiesa (i.e., [::2] or so, I suppose)?

Was just getting back to looking at this, I think a test for this case exists in test_vector_buffer_numpy. At least I think the following code tests a step != 1

    v = m.VectorInt(a[:, 1])
    assert len(v) == 3
    assert v[2] == 10

I do not think it is possible to test this branch using the array module because the slice/stride operator apparently returns a copy of the elements from the original array rather than a view or the original memory with a modified stride property as is done in numpy. Is there a standard library collection that provides support for this?

YannickJadoul · 2020-07-23T15:59:17Z

I do not think it is possible to test this branch using the array module because the slice/stride operator apparently returns a copy of the elements from the original array rather than a view or the original memory

What exactly do you mean? Isn't it a std_bind-bound vector constructed from a buffer? So if you can construct m.VectorInt from a numpy array where you put array[::2], it should have stride != 1, no?

Maybe I misunderstand what you actually mean, but numpy arrays are not copied:

$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> x = np.arange(10)
>>> y = x[::2]
>>> y
array([0, 2, 4, 6, 8])
>>> y.base
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> y.base is x
array([  0,   1,   2,   3,   4,   5,   6,   7, 100,   9])
>>> y[-1] = 42
>>> x
array([ 0,  1,  2,  3,  4,  5,  6,  7, 42,  9])
>>>

marc-chiesa · 2020-07-23T16:25:59Z

I do not think it is possible to test this branch using the array module because the slice/stride operator apparently returns a copy of the elements from the original array rather than a view or the original memory

What exactly do you mean? Isn't it a std_bind-bound vector constructed from a buffer? So if you can construct m.VectorInt from a numpy array where you put array[::2], it should have stride != 1, no?

Maybe I misunderstand what you actually mean, but numpy arrays are not copied:

Sorry for being unclear! My understanding is that Python standard library array is copied when specifying a stride, but you are correct that numpy arrays are not. So the block of code that I referenced should already invoke the != 1 branch of the code in the numpy-specific test test_vector_buffer_numpy. It should have a step of 4 ints to get to the next element in the resulting vector.

I can add an additional test that makes it more clear what is happening, but my question was whether or not there is something in the standard library that would also allow testing this branch with non-numpy buffers.

YannickJadoul · 2020-07-24T21:53:38Z

So the block of code that I referenced should already invoke the != 1 branch of the code in the numpy-specific test test_vector_buffer_numpy.

OK, yes, good point. Sorry, I had missed those tests, but yes, numpy arrays ought to be row-major, so [:,1] should have a stride != 1! Still, it wouldn't hurt to try adding some test like a[::2,0], I suppose?

but my question was whether or not there is something in the standard library that would also allow testing this branch with non-numpy buffers.

Alright, I misunderstood that question! This seems like it might work, though?

>>> x = bytearray(range(10))
>>> y = memoryview(x)
>>> y.contiguous
True
>>> z = y[::2]
>>> z.contiguous
False
>>>

(I didn't immediately know, either, but memoryview is already used in the test_vector_buffer test, which reminded me.)

… with step > 1

marc-chiesa · 2020-08-04T02:20:51Z

Thanks for the hint! I implemented some basic tests for step > 1 using numpy arrays and memoryview.

…ility

YannickJadoul

Whoops, sorry, completely forgot about this one, @marc-chiesa!

Looks good to me. Let's merge once everything is green!

YannickJadoul · 2020-08-13T20:47:31Z

Merged! Thanks again, @marc-chiesa!

Modified Vector STL bind initialization from a buffer type with optim…

ddc4704

…ization for simple arrays

sizmailov reviewed Jul 14, 2020

View reviewed changes

Add subtests to demonstrate processing Python buffer protocol objects…

1fcc435

… with step > 1

marc-chiesa force-pushed the issue-2102 branch 2 times, most recently from 92464b2 to 1fcc435 Compare August 4, 2020 03:59

Fixed memoryview step test to only run on Python 3+

ac6ea0f

marc-chiesa requested a review from YannickJadoul August 13, 2020 19:04

Modified Vector constructor from buffer to return by value for readab…

0dcf899

…ility

YannickJadoul approved these changes Aug 13, 2020

View reviewed changes

YannickJadoul merged commit 830adda into pybind:master Aug 13, 2020

bstaletic mentioned this pull request Sep 12, 2020

optimize buffer protocol constructor for stl_bind std::vector when buffer is a simple array #2102

Closed

rwgk mentioned this pull request Feb 10, 2023

FWD pybind11 google/pybind11clif#2298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modified Vector STL bind initialization from a buffer type with optimization for simple arrays #2298

Modified Vector STL bind initialization from a buffer type with optimization for simple arrays #2298

marc-chiesa commented Jul 14, 2020

YannickJadoul commented Jul 14, 2020 •

edited

Loading

marc-chiesa commented Jul 14, 2020

bstaletic commented Jul 14, 2020

YannickJadoul commented Jul 14, 2020

marc-chiesa commented Jul 14, 2020 •

edited

Loading

sizmailov commented Jul 14, 2020

sizmailov Jul 14, 2020 •

edited

Loading

marc-chiesa Jul 14, 2020

bstaletic Jul 14, 2020

sizmailov Jul 14, 2020 •

edited

Loading

bstaletic Jul 14, 2020

wjakob Jul 15, 2020

YannickJadoul Jul 15, 2020

bstaletic Jul 15, 2020

YannickJadoul Jul 15, 2020

bstaletic Jul 15, 2020

wjakob commented Jul 15, 2020

YannickJadoul commented Jul 15, 2020

marc-chiesa commented Jul 23, 2020

YannickJadoul commented Jul 23, 2020

marc-chiesa commented Jul 23, 2020 •

edited

Loading

YannickJadoul commented Jul 24, 2020

marc-chiesa commented Aug 4, 2020

YannickJadoul left a comment

YannickJadoul commented Aug 13, 2020

Modified Vector STL bind initialization from a buffer type with optimization for simple arrays #2298

Modified Vector STL bind initialization from a buffer type with optimization for simple arrays #2298

Conversation

marc-chiesa commented Jul 14, 2020

YannickJadoul commented Jul 14, 2020 • edited Loading

marc-chiesa commented Jul 14, 2020

bstaletic commented Jul 14, 2020

YannickJadoul commented Jul 14, 2020

marc-chiesa commented Jul 14, 2020 • edited Loading

sizmailov commented Jul 14, 2020

sizmailov Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sizmailov Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjakob commented Jul 15, 2020

YannickJadoul commented Jul 15, 2020

marc-chiesa commented Jul 23, 2020

YannickJadoul commented Jul 23, 2020

marc-chiesa commented Jul 23, 2020 • edited Loading

YannickJadoul commented Jul 24, 2020

marc-chiesa commented Aug 4, 2020

YannickJadoul left a comment

Choose a reason for hiding this comment

YannickJadoul commented Aug 13, 2020

YannickJadoul commented Jul 14, 2020 •

edited

Loading

marc-chiesa commented Jul 14, 2020 •

edited

Loading

sizmailov Jul 14, 2020 •

edited

Loading

sizmailov Jul 14, 2020 •

edited

Loading

marc-chiesa commented Jul 23, 2020 •

edited

Loading