Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range insertion speeds up generate_primes #123

Closed
cgd8d opened this issue Jun 25, 2022 · 6 comments
Closed

Range insertion speeds up generate_primes #123

cgd8d opened this issue Jun 25, 2022 · 6 comments

Comments

@cgd8d
Copy link

cgd8d commented Jun 25, 2022

I noticed a roughly 10% speedup for building vectors of primes when I use range insertion rather than copying one prime at a time. (Full disclosure that I only timed it on one system, one compiler, and one set of parameters.) I did this with a modified version of store_primes from the current master branch:

template <typename T>
void store_primes_modified(uint64_t start,
                         uint64_t stop,
                         T& primes)
{
  if (start > 0)
    start--;
  if (~stop == 0)
    stop--;

  if (start < stop)
  {
    using V = typename T::value_type;
    std::size_t size = primes.size() + primesieve::prime_count_approx(start, stop);
    primes.reserve(size);

    primesieve::iterator it(start, stop);
    while(true)
    {
      it.i_++;
      if(it.i_ >= it.size_)
        it.generate_next_primes();

      uint64_t* it_begin = it.primes_ + it.i_;
      uint64_t* it_last = it.primes_ + it.size_;
      uint64_t* it_end = std::lower_bound(it_begin, it_last, stop);
      primes.insert(primes.end(), it_begin, it_end);
      it.i_ += std::distance(it_begin, it_end)-1;
      if(it_end < it_last)
        break;
    }
  }
}

This would speed up generate_primes and something similar would presumably speed up generate_n_primes. I suspect it's not the best thing that could be done -- in particular, if the iterator could be forced to put its primes directly into the target vector then the copy could be eliminated entirely -- but since this is a fairly small change I wanted to go ahead and share.

@kimwalisch
Copy link
Owner

You are right, this should theoretically be faster since the current code does a size check for each inserted prime while your range insert code only does a size check for each inserted range of primes. Funnily I tried exactly the same code a few weeks ago, then I integrated this new libprimesieve version into my primecount project and ran the primecount benchmarks. However I was not able to measure any speedup, primecount uses mostly vector<uint32_t> primes. With what vector type did you test? (Since primesieve::iterator uses uint64_t under the hood, I guess that using vector<uint64_t> primes could provide a speedup using your modified algorithm).

There is another issue: primesieve::iterator uses uint64_t under the hood. If we range insert its uint64_t primes buffer into e.g. a vector<uint32_t> primes then this could potentially cause a compiler warning. In my tests this was not the case, but do you know if we can rely on this (are there any guarantees by the C++ standard)? Because if your new code would cause a compiler warning inside std::vector::insert() (e.g. in a future version of GCC) it would be impossible for us to fix this warning without removing the std::vector::insert() code.

@cgd8d
Copy link
Author

cgd8d commented Jun 25, 2022

I tested with vector<uint64_t>, and that's true that it could make a difference.

Whether the standard guarantees that conversion is handled correctly - I tried to dig in here. What I got from the standard (which I'm hopefully reading correctly) is that the handling of insert is determined by the construct function of the allocator used for the destination vector, which for the default allocator is equivalent to repeated calls to placement-new. So, insert from vector<uint64_t> into vector<uint32_t> would be equivalent to new((void *)p) uint32_t(uint64_t val) which works and shouldn't warn. At least, that's my reading if it.

But other cases might not work. If you want to support destination vectors with custom allocators then my impression is that a caller could be using std::vector with an allocator that does not implement construct with type conversion, and the code would fail. Also these functions are written to work with any container that is API-compatible with std::vector, so using insert instead of push_back changes the requirement for API compatibility. So it seems to be true that there is a compatibility cost of some kind.

@kimwalisch
Copy link
Owner

I have checked the STL vector::insert() implementation of GCC and it seems to me that in the end there happens an implicit conversion from _InputIterator to vector::value_type which will definitely cause a compiler warning on some compilers (e.g. using MSVC with /W4) if the _InputIterator is different (wider) than vector::value_type.

template<typename _Tp, typename _Alloc>
template<typename _InputIterator>
void vector<_Tp, _Alloc>::_M_range_insert(iterator __pos, _InputIterator __first, _InputIterator __last, std::input_iterator_tag)
{
    if (__pos == end())
    {
        // This does the actual insert for our use case
    for (; __first != __last; ++__first)
        insert(end(), *__first);
    }

template<typename _Tp, typename _Alloc>
typename vector<_Tp, _Alloc>::iterator
vector<_Tp, _Alloc>::
// The value_type is different from _InputIterator, hence it seems to me
// that there is an implicit conversion happening here from _InputIterator (uint64_t)
// to value_type of the std::vector which could be e.g. uint32_t. This could
// cause a compiler warning depending on the compiler & compiler flags.
insert(const_iterator __position, const value_type& __x)
{
    // ...
}

If we want to guarantee that our code cannot cause any warnings then we can only use your vector::insert() algorithm for vector<uint64_t>. Hence we would keep the old algorithm as a default and specialize for vector<uint64_t> using e.g. std::enable_if. (Unfortunately std::vector misses a resize_uninitialized() method which would allow us to implement your faster algorithm very easily)

The question is now whether the added complexity is worth the effort. Can you please post your benchmark test program? I want to run it on my latest Intel Alder Lake CPU and check for myself how much faster it is...

@kimwalisch
Copy link
Owner

kimwalisch commented Jun 26, 2022

Another option that just came to my mind would be to disable implicit conversion warnings using #pragma for the 3 major C++ compilers (GCC, Clang & MSVC) when calling vector::insert. This would allow us to have optimal code gen for all vector types and also simplify the implementation, since we only need to implement one algorithm without any std::enable_if specialization.

E.g. for MSVC we could use:

// Disable warning C4244: conversion from X to Y, possible loss of data
#pragma warning(push)
#pragma warning(disable : 4244)

vect.insert(vect.end(), arr, arr + size);

#pragma warning(pop)

The latest versions of GCC & Clang do not warn even using -Wall -Wextra -pedantic -Wconversion. So no need disable any warnings for these compilers (if a future GCC/Clang version issues a warning we will suppress that warning then).

@kimwalisch
Copy link
Owner

I have merged your feature request!

Would you mind rerunning your benchmark using the latest libprimesieve?

@cgd8d
Copy link
Author

cgd8d commented Jun 27, 2022

Thanks! That looks good to me and gave the expected speedup in my benchmark. For reference, the benchmark I used was (compiled with clang 12, flags included -O3 -march=native, running on an Azure cloud machine with a Xeon Platinum 8171M CPU running Ubuntu 20.04):

#include <primesieve.hpp>
#include <iostream>
#include <vector>
#include <numeric>

int main(int argc, char** argv)
{
  uint64_t num_iter = (1 << 7);
  uint64_t step_size = (1 << 30);
  std::vector<uint64_t> vec;
  uint64_t sum = 0;

  for(uint64_t i = 0; i < num_iter; i++)
  {
    vec.clear();
    primesieve::generate_primes(
      i*step_size,
      (i+1)*step_size,
      &vec);
    sum = std::accumulate(vec.begin(), vec.end(), sum);
  }

  std::cout << "Sum of primes <= " << num_iter*step_size << " = " << sum << std::endl;

  // Note that since sum is a 64-bit variable the result
  // will be incorrect (due to integer overflow) if
  // limit > 10^10. However we do allow limits > 10^10
  // since this is useful for benchmarking.
  if (num_iter*step_size > 10000000000ull)
    std::cerr << "Warning: sum is likely incorrect due to 64-bit integer overflow!" << std::endl;

  return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants