Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about performance of at(...) #210

Closed
samuelpmish opened this issue Oct 4, 2023 · 4 comments
Closed

question about performance of at(...) #210

samuelpmish opened this issue Oct 4, 2023 · 4 comments

Comments

@samuelpmish
Copy link

Hi, I'm trying out this library for a project where I have a performance bottleneck on a piece of code that uses a std::unordered_map to renumber some indices:

template < typename Map >
void renumber(const std::vector< uint64_t > & vertex_ids, 
              std::vector< std::array< uint64_t, 4 > > elements,
              int num_threads) {

  bool supports_parallel_insertion = 
    !std::is_same< Map, std::unordered_map<uint64_t, uint64_t> >::value;

  Map new_ids;
  std::atomic< uint64_t > new_id = 0;

  threadpool pool((supports_parallel_insertion) ? num_threads : 1);

  // insert vertex ids into a map, and give them a new compact index
  pool.parallel_for(vertex_ids.size(), [&](uint64_t i){
    auto id = new_id++;
    new_ids[vertex_ids[i]] = id;
  });

  // update connectivity lists to use the new indices
  pool.parallel_for(elements.size(), [&](uint64_t i){
    auto & elem = elements[i];
    elem[0] = new_ids.at(elem[0]);
    elem[1] = new_ids.at(elem[1]);
    elem[2] = new_ids.at(elem[2]);
    elem[3] = new_ids.at(elem[3]);
  });

Of course std::unordered_map doesn't let you insert things in parallel, so every other part of the code is benefitting from 32 threads, but this one becomes a bottleneck because of the serial insertion step. When I time these two operations, I get (first timing is insertion, second is updating indices)

std::unordered_map, (single threaded insertion, single threaded update): 1609.12ms 6296.07ms
std::unordered_map, (single threaded insertion, multithreaded update): 1815.36ms 608.119ms

So, the insertion can't benefit from multiple threads, but the update step can (~10x speedup).

When I try using

template < int n >
using pmap = phmap::parallel_node_hash_map< 
                uint64_t, 
                uint64_t,
                std::hash<uint64_t>,
                std::equal_to<uint64_t>, 
                std::allocator<std::pair<const uint64_t, uint64_t>>, 
                n, 
                std::mutex >;

as a drop in replacement for std::unordered_map<uint64_t, uint64_t>, I get better insertion times

pmap<4>, 1 thread: 709.521ms 13461.7ms
pmap<4>, 32 threads: 617.853ms 5174.52ms
pmap<6>, 1 thread: 1216.25ms 13500.4ms
pmap<6>, 32 threads: 442.394ms 2610.72ms
pmap<9>, 32 threads: 194.461ms 1679.37ms

but considerably worse access times in the update step.

Is this expected, or am I perhaps using the wrong kind of container (or configuring them with the wrong parameters)?

The complete sourcefile for this benchmark is available here.

Thank you

@greg7mdp
Copy link
Owner

greg7mdp commented Oct 7, 2023

Hi @samuelpmish , thank you for using phmap.

I have a fix for you. You''ll be happy with the speed:

  1. use phmap::parallel_flat_hash_map, it is faster.
  2. create another type alias (I called it pmap_nullmutex) which has phmap::NullMutex as the mutex type
  3. call the renumber function with this extra type, as in:
    renumber< pmap<4>, pmap_nullmutex<4> >(vertex_ids, elements, 1);
  4. before the update part, swap the pmap with the mutex with the one with the NullMutex, and then use the new map with no mutex locking (which is unneeded as we access the map only for read), as in:
  stopwatch.start();
  Map_nomutex new_ids_nc;
  new_ids_nc.swap(new_ids);
  pool.parallel_for(elements.size(), [&](uint64_t i){
    auto & elem = elements[i];
    elem[0] = new_ids_nc.at(elem[0]);
    elem[1] = new_ids_nc.at(elem[1]);
    elem[2] = new_ids_nc.at(elem[2]);
    elem[3] = new_ids_nc.at(elem[3]);
  });
  stopwatch.stop();

Here are the results I got:

generating dataset .. done
std::unordered_map, 1 thread: 1073.95ms 4651.66ms
std::unordered_map, 32 thread (single threaded insertion): 1253.99ms 464.183ms
pmap4, 1 thread: 649.7ms 2574.57ms
pmap4, 32 threads: 353.275ms 217.594ms
pmap6, 1 thread: 365.098ms 2615.59ms
pmap6, 32 threads: 204.003ms 219.373ms

@greg7mdp
Copy link
Owner

greg7mdp commented Oct 7, 2023

Here is your updated benchmark

PS: would you mind if I add it as an example in phmap?
PS2: your benchmark feels very familiar. I used to develop a graphics lbrary for FE element analysis at Altair Engineering :-).

@samuelpmish
Copy link
Author

thanks for the update, that was what I was missing!

would you mind if I add it as an example in phmap?

sounds good to me

your benchmark feels very familiar. I used to develop a graphics lbrary for FE element analysis at Altair Engineering :-)

you've figured me out, it's taken from a finite element project-- thanks again for the help!

@greg7mdp
Copy link
Owner

greg7mdp commented Oct 9, 2023

Oh, I just thought that your bench will be a bit faster if you reserve the size in your map:

    new_ids.reserve(vertex_ids.size() * 110 / 100);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants