Change usage of std::string to movable object #144

olzhabay · 2018-10-05T07:30:19Z

Constructor of std::string does not allow any moving. Instead, it always copies all data.
Except if pointer std::string is not used. However, that is used minorly in project.
This leads to too many unnecessary copying around data, which can be size of block (64mb).

Proposal, use custom class that will not be copying data, but moving it. Also, it will control number of owners and gets destructed when owners<=0

class Slice {
  char* ptr;
  uint64_t size;
  atomic<int> owners;
}

Other way is to use std::shared_ptrstd::string

Solving this will do some improvement to performance

The text was updated successfully, but these errors were encountered:

vicentebolea · 2018-10-07T06:12:29Z

Thanks for pointing out @olzhabay.

I totally get your point, however, in many sections of the code whenever expensive string assignment was done I forced the move using std::move(std::string). In such case I understood that it will move the internal data. Yet I do not know the internals of std::move. What do you think?

olzhabay · 2018-10-08T04:55:27Z

It does not move if std::string goes out of stack scope. Also, std::string generates too much assembly code. Examples:

olzhabay · 2018-10-08T07:23:48Z

What if we make the data wrapper class serializable like this for now? Maybe later on, the implementation of raw block sending can be implemented.

vicentebolea · 2018-10-08T16:17:19Z

I have reviewed the proposed slice class which is a class that wraps basically a string (length and char array) plus some directives to allow fast concurrent access.

I also understand the rationale of the proposed idea, you find std::string inefficient for our project and slices provides a custom way for us to tweek and optimize the block content handling.

`std::move` in C++ strings

Still we can totally use std::move in the strings in those two scenarios:

First one

 string a = "hello", b;
 b = std::move(a);

Second one

string fun(){
  string a = "100"  
  return  std::move(a); // std::move is useless, compiler will do RVO if possible 
}

string b = fun();

Summary

Using `slice` for block abstraction instead of `string` will allow us:

Concurrent access to its content. YAGNI, we use one thread and proactor pattern:-1:
Tailored class to our block case (large size). Good point 👍

Ditching `std::string` in favor of `slices` will cost us:

One more class to maintain and debug. KISS
There is so much support and optimization for std::string (inside BOOST and stdlib). Think about alignment and other things. Important

Verdict

I do not want to turn down good ideas and work. I think you have in mind a bigger refactor than slice custom class where slices class makes much more sense for that purpose. As I see it for now, it is premature optimization as we try to solve a problem that our architecture does not have (yet).

If it is part of a bigger optimization and you are thinking to change the architecture to multi-threaded, it will come super handy, I will happily accept slices along its change of architecture.

As for now I am reticent to ditch std::string

If you think I am mistaken, please point it out and lets keep the conversation. If you can't still convince me, do it by facts and benchmarks :)

signed-off: vicente.bolea@gmail.com

olzhabay · 2018-10-11T07:25:19Z

@vicentebolea got it. Proceeding with tests and benchmarks to see the actual performance benefits

olzhabay · 2018-10-12T10:53:17Z

Performance benchmark results

tested two cases:

creation of an object
move of an object
tested with [8, 64, 512, 1k, 8k, 64m] bytes

std::string performance

----------------------------------------------------------------------------
Benchmark                     Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------
BM_creation/8                34 ns         34 ns   20388586 bytes_per_second=222.224M/s
BM_creation/64               61 ns         61 ns   11668637 bytes_per_second=1008.09M/s
BM_creation/512             113 ns        113 ns    6118007 bytes_per_second=4.20737G/s
BM_creation/1024            142 ns        142 ns    4906628 bytes_per_second=6.72681G/s
BM_creation/8192            362 ns        362 ns    1933918 bytes_per_second=21.091G/s
BM_creation/67108864   20001119 ns   19945407 ns         33 bytes_per_second=3.13355G/s
BM_move/8                     1 ns          1 ns  626554280 bytes_per_second=6.68698G/s
BM_move/64                    1 ns          1 ns  626517599 bytes_per_second=53.4915G/s
BM_move/512                   1 ns          1 ns  627175796 bytes_per_second=427.968G/s
BM_move/1024                  1 ns          1 ns  627080058 bytes_per_second=855.979G/s
BM_move/8192                  1 ns          1 ns  626729370 bytes_per_second=6.68706T/s
BM_move/67108864              1 ns          1 ns  627153977 bytes_per_second=53.4822P/s

slice performance

----------------------------------------------------------------------------
Benchmark                     Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------
BM_creation/8                31 ns         31 ns   22385795 bytes_per_second=245M/s
BM_creation/64               31 ns         31 ns   22250523 bytes_per_second=1.90128G/s
BM_creation/512              44 ns         44 ns   16038834 bytes_per_second=10.8969G/s
BM_creation/1024             48 ns         48 ns   14534327 bytes_per_second=19.8019G/s
BM_creation/8192            139 ns        139 ns    4993379 bytes_per_second=54.8901G/s
BM_creation/67108864    6466286 ns    6466127 ns         89 bytes_per_second=9.66575G/s
BM_move/8                     0 ns          0 ns 1000000000 bytes_per_second=31.5797P/s
BM_move/64                    0 ns          0 ns 1000000000 bytes_per_second=260.75P/s
BM_move/512                   0 ns          0 ns 1000000000 bytes_per_second=2.0278E/s
BM_move/1024                  0 ns          0 ns 1000000000 bytes_per_second=4.05561E/s
BM_move/8192                  0 ns          0 ns 1000000000 bytes_per_second=32.5937E/s
BM_move/67108864              0 ns          0 ns 1000000000 bytes_per_second=72.9697Z/s

vicentebolea · 2018-10-12T12:35:32Z

Hey the numbers are too high. Did you drop the buffer cache before the benchmarks?

…

On Fri, Oct 12, 2018, 19:53 Olzhas Kaiyrakhmet ***@***.***> wrote: std::string performance ---------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------- BM_creation/8 34 ns 34 ns 20388586 bytes_per_second=222.224M/s BM_creation/64 61 ns 61 ns 11668637 bytes_per_second=1008.09M/s BM_creation/512 113 ns 113 ns 6118007 bytes_per_second=4.20737G/s BM_creation/1024 142 ns 142 ns 4906628 bytes_per_second=6.72681G/s BM_creation/8192 362 ns 362 ns 1933918 bytes_per_second=21.091G/s BM_creation/67108864 20001119 ns 19945407 ns 33 bytes_per_second=3.13355G/s BM_move/8 1 ns 1 ns 626554280 bytes_per_second=6.68698G/s BM_move/64 1 ns 1 ns 626517599 bytes_per_second=53.4915G/s BM_move/512 1 ns 1 ns 627175796 bytes_per_second=427.968G/s BM_move/1024 1 ns 1 ns 627080058 bytes_per_second=855.979G/s BM_move/8192 1 ns 1 ns 626729370 bytes_per_second=6.68706T/s BM_move/67108864 1 ns 1 ns 627153977 bytes_per_second=53.4822P/s slice performance ---------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------- BM_creation/8 31 ns 31 ns 22385795 bytes_per_second=245M/s BM_creation/64 31 ns 31 ns 22250523 bytes_per_second=1.90128G/s BM_creation/512 44 ns 44 ns 16038834 bytes_per_second=10.8969G/s BM_creation/1024 48 ns 48 ns 14534327 bytes_per_second=19.8019G/s BM_creation/8192 139 ns 139 ns 4993379 bytes_per_second=54.8901G/s BM_creation/67108864 6466286 ns 6466127 ns 89 bytes_per_second=9.66575G/s BM_move/8 0 ns 0 ns 1000000000 bytes_per_second=31.5797P/s BM_move/64 0 ns 0 ns 1000000000 bytes_per_second=260.75P/s BM_move/512 0 ns 0 ns 1000000000 bytes_per_second=2.0278E/s BM_move/1024 0 ns 0 ns 1000000000 bytes_per_second=4.05561E/s BM_move/8192 0 ns 0 ns 1000000000 bytes_per_second=32.5937E/s BM_move/67108864 0 ns 0 ns 1000000000 bytes_per_second=72.9697Z/s — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#144 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5XFncphW-11LhSEE32roPVETT-aXmrks5ukHSfgaJpZM4XJs1O> .

vicentebolea · 2018-10-12T12:37:07Z

You must have a future computer to have petabytes per second throughout haha On Fri, Oct 12, 2018, 21:35 Vicente Adolfo Bolea Sánchez < vicente.bolea@gmail.com> wrote:

…

Hey the numbers are too high. Did you drop the buffer cache before the benchmarks? On Fri, Oct 12, 2018, 19:53 Olzhas Kaiyrakhmet ***@***.***> wrote: > std::string performance > > ---------------------------------------------------------------------------- > Benchmark Time CPU Iterations UserCounters... > ---------------------------------------------------------------------------- > BM_creation/8 34 ns 34 ns 20388586 bytes_per_second=222.224M/s > BM_creation/64 61 ns 61 ns 11668637 bytes_per_second=1008.09M/s > BM_creation/512 113 ns 113 ns 6118007 bytes_per_second=4.20737G/s > BM_creation/1024 142 ns 142 ns 4906628 bytes_per_second=6.72681G/s > BM_creation/8192 362 ns 362 ns 1933918 bytes_per_second=21.091G/s > BM_creation/67108864 20001119 ns 19945407 ns 33 bytes_per_second=3.13355G/s > BM_move/8 1 ns 1 ns 626554280 bytes_per_second=6.68698G/s > BM_move/64 1 ns 1 ns 626517599 bytes_per_second=53.4915G/s > BM_move/512 1 ns 1 ns 627175796 bytes_per_second=427.968G/s > BM_move/1024 1 ns 1 ns 627080058 bytes_per_second=855.979G/s > BM_move/8192 1 ns 1 ns 626729370 bytes_per_second=6.68706T/s > BM_move/67108864 1 ns 1 ns 627153977 bytes_per_second=53.4822P/s > > slice performance > > ---------------------------------------------------------------------------- > Benchmark Time CPU Iterations UserCounters... > ---------------------------------------------------------------------------- > BM_creation/8 31 ns 31 ns 22385795 bytes_per_second=245M/s > BM_creation/64 31 ns 31 ns 22250523 bytes_per_second=1.90128G/s > BM_creation/512 44 ns 44 ns 16038834 bytes_per_second=10.8969G/s > BM_creation/1024 48 ns 48 ns 14534327 bytes_per_second=19.8019G/s > BM_creation/8192 139 ns 139 ns 4993379 bytes_per_second=54.8901G/s > BM_creation/67108864 6466286 ns 6466127 ns 89 bytes_per_second=9.66575G/s > BM_move/8 0 ns 0 ns 1000000000 bytes_per_second=31.5797P/s > BM_move/64 0 ns 0 ns 1000000000 bytes_per_second=260.75P/s > BM_move/512 0 ns 0 ns 1000000000 bytes_per_second=2.0278E/s > BM_move/1024 0 ns 0 ns 1000000000 bytes_per_second=4.05561E/s > BM_move/8192 0 ns 0 ns 1000000000 bytes_per_second=32.5937E/s > BM_move/67108864 0 ns 0 ns 1000000000 bytes_per_second=72.9697Z/s > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#144 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AA5XFncphW-11LhSEE32roPVETT-aXmrks5ukHSfgaJpZM4XJs1O> > . >

olzhabay · 2018-10-13T12:18:12Z

Update on performance benchmark results

Here are again comparison of slice and std::string
In this case we compare their performance on serialization and deserialization.
Sample benchmark code:

static void BM_serialization(benchmark::State& state) {
  char* data = new char[state.range(0)];
  srand(0);
  int ascii = (std::rand() % 255) + 1;
  memset(data, ascii, state.range(0));
  TYPE str(data, state.range(0));
  for (auto _ : state) {
    std::ofstream ofs("stream");
    {
      boost::archive::binary_oarchive oa(ofs);
      oa << str;
    }
    TYPE new_str;
    {
      std::ifstream ifs("stream");
      boost::archive::binary_iarchive ia(ifs);
      ia >> new_str;
    }
    assert(str == new_str);
  }
}
BENCHMARK(BM_serialization)->Arg(8)->Arg(64)->Arg(512)->Arg(1<<10)->Arg(8<<10)->Arg(64<<20);

Results for std::string

---------------------------------------------------------------------------------
Benchmark                          Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------------
BM_serialization/8            257080 ns      96024 ns       7136
BM_serialization/64           272027 ns     102924 ns      10258
BM_serialization/512          259539 ns      90391 ns      10055
BM_serialization/1024         261867 ns     103443 ns       6890
BM_serialization/8192         312548 ns     124600 ns       6665
BM_serialization/67108864 1091854852 ns   92877561 ns         11

Results for slice

---------------------------------------------------------------------------------
Benchmark                          Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------------
BM_serialization/8            265745 ns      94510 ns       7766
BM_serialization/64           295327 ns     107273 ns       8124
BM_serialization/512          297243 ns     106744 ns       7802
BM_serialization/1024         260830 ns      98153 ns       9907
BM_serialization/8192         260027 ns      97212 ns       9469
BM_serialization/67108864     300611 ns      87311 ns       7906

We can see significant result performance improvement on 64MB data serialization/deserialization

vicentebolea · 2018-10-14T10:27:29Z

@olzhabay, you are doing a very fine job benchmarking this one. it looks like serialization exec time for str is linearly proportional to its size.

We can see significant result performance improvement on 64MB data serialization/deserialization

For slice it just takes a bit more exec time from 8 KB to 64MB, that is too much of a difference, even CPU time it drops a bit for a 10000 times bigger input. I believe it its faster but not this much.

How about moving this one to a pull request so that we can check the code together?

olzhabay added type: bug type: feature type: refactor labels Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change usage of std::string to movable object #144

Change usage of std::string to movable object #144

olzhabay commented Oct 5, 2018

vicentebolea commented Oct 7, 2018

olzhabay commented Oct 8, 2018

olzhabay commented Oct 8, 2018 •

edited

Loading

vicentebolea commented Oct 8, 2018

olzhabay commented Oct 11, 2018

olzhabay commented Oct 12, 2018 •

edited

Loading

vicentebolea commented Oct 12, 2018 via email

vicentebolea commented Oct 12, 2018 via email

olzhabay commented Oct 13, 2018

vicentebolea commented Oct 14, 2018

Change usage of std::string to movable object #144

Change usage of std::string to movable object #144

Comments

olzhabay commented Oct 5, 2018

vicentebolea commented Oct 7, 2018

olzhabay commented Oct 8, 2018

olzhabay commented Oct 8, 2018 • edited Loading

vicentebolea commented Oct 8, 2018

std::move in C++ strings

First one

Second one

Summary

Using slice for block abstraction instead of string will allow us:

Ditching std::string in favor of slices will cost us:

Verdict

olzhabay commented Oct 11, 2018

olzhabay commented Oct 12, 2018 • edited Loading

Performance benchmark results

std::string performance

slice performance

vicentebolea commented Oct 12, 2018 via email

vicentebolea commented Oct 12, 2018 via email

olzhabay commented Oct 13, 2018

Update on performance benchmark results

Results for std::string

Results for slice

vicentebolea commented Oct 14, 2018

olzhabay commented Oct 8, 2018 •

edited

Loading

`std::move` in C++ strings

Using `slice` for block abstraction instead of `string` will allow us:

Ditching `std::string` in favor of `slices` will cost us:

olzhabay commented Oct 12, 2018 •

edited

Loading