merging maps #234

gf777 · 2024-02-12T19:50:01Z

Hi @greg7mdp,
First of all thank you very much for providing this hash map, it's extremely efficient!
I would like to know if there is a way to do the following with the current functions.
Imagine you have to merge multiple hash tables that you generated with parallel-hashmap. The way I am currently doing this is that I load two tables at a time, then go through each key of the smaller table and insert it into the larger one. Now, with the code provided here https://greg7mdp.github.io/parallel-hashmap/ one can easily see how this can be multithreaded by having each thread checking a different hash however each thread still has to go through all keys (even though skipping many of them). It seems to me that it would be much more efficient if the threads could have access directly to the submaps and could merge them concurrently. Each thread would immediately know which hashes to check (only those of the assigned hashmap). It would be efficient also memory-wise because once the merge of one submap is done, the submap could even be deleted. Is that reasonable/doable?
Thank you in advance

greg7mdp · 2024-02-13T00:24:45Z

Hi @gf777 ,
Thanks for the kind workds, and also for using phmap.

You have the right idea, and indeed for your purpose this would work very well. The two tables to merge would have to have the same N template parameter (which is 4 by default), so that the two maps to merge have the same number of submaps, and the values are inserted into the submaps with the same index.

So say you are using N=4 and therefore you have 16 submaps, you could create 16 threads and have each thread merge the two matching submaps (with the same index). To access the submaps, you can use the get_inner function:

auto& inner = map1.get_inner(0);   // to retrieve the submap at index 0
auto& submap1 = inner.set_;        // can be a set or a map, depending on the type of map1

gf777 · 2024-02-14T17:34:40Z

Thanks for the prompt feedback @greg7mdp
Dumb question then, how do I achieve this:

    auto& inner = map1->get_inner(subMapIndex);   // to retrieve the submap at given index
    auto& submap1 = inner.set_;        // can be a set or a map, depending on the type of map1
    inner = map2->get_inner(subMapIndex);
    auto& submap2 = inner.set_;
    
    for (auto pair : submap1) { // for each element in map1, find it in map2 and increase its value
        
        DBGkmer &dbgkmerMap = submap2[pair.first]; // insert or find this kmer in the hash table

Error:
src//kreeq.cpp:944:38: error: type 'EmbeddedSet' (aka 'raw_hash_set<phmap::priv::FlatHashMapPolicy<unsigned long long, DBGkmer>, phmap::Hash<uint64_t>, phmap::EqualTo, std::allocator<std::pair<const unsigned long long, DBGkmer>>>') does not provide a subscript operator

I tried a number of things with .insert and .find but to no avail..
Thank you for the help!

gf777 · 2024-02-14T20:01:21Z

    auto& inner = map1->get_inner(subMapIndex);   // to retrieve the submap at given index
    auto& submap1 = inner.set_;        // can be a set or a map, depending on the type of map1
    auto& inner2 = map2->get_inner(subMapIndex);
    auto& submap2 = inner2.set_;
    
    for (auto pair : submap1) { // for each element in map1, find it in map2 and increase its value
        
        auto got = submap2.find(pair.first); // insert or find this kmer in the hash table
        if (got == submap2.end()){
            submap2.insert(pair);
        }else{

            DBGkmer& dbgkmerMap = got->second;

Apparently this works (verbose though)

gf777 · 2024-02-14T20:49:27Z

One more question: if I now want to increase the number of maps in the template, say when I declare it here:
std::vector<phmap::parallel_flat_hash_map<uint64_t, VALUE>*> maps; // all hash maps where VALUES are stored
Is there an easy way of declaring it ?

greg7mdp · 2024-02-15T01:51:44Z

Hum, did you find the answer to your question? I was looking in Kreeq and I see that you figured out how to declare parallelMap with 256 submaps (using N=8), and also your code for merging the submaps looks good.

Let me know if you have any other question, I'm happy to help.

gf777 · 2024-02-15T15:21:32Z

This seems to work like a charm, thanks! I would go as far as saying that it should be a core function XD
While I have your attention I could pick your brain on how to best estimate map size in constant time, currently I have this from another project:

uint64_t mapSize(parallelMap& m) {
    
    return (m.size() * (sizeof(DBGkmer) + sizeof(void*)) + // data list
     m.bucket_count() * (sizeof(void*) + sizeof(uint64_t))) // bucket index
    * 1.3; // estimated allocation overheads
    
}

I noticed you have some calculations in the readme as well. Thoughts?

greg7mdp · 2024-02-15T18:57:12Z

Correct size is:

uint64_t mapSize(parallelMap& m) {
   return m.capacity() * (sizeof(parallelMap::value_type) + 1) + sizeof(parallelMap);
}

Map is in essence an array of values + an array of bytes (hence the + 1). I think the above is pretty accurate.

gf777 · 2024-02-16T02:48:53Z

terrific! It seems to be working well
Thank you!

gf777 closed this as completed Feb 16, 2024

gf777 mentioned this issue May 13, 2024

issue under mac? #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merging maps #234

merging maps #234

gf777 commented Feb 12, 2024

greg7mdp commented Feb 13, 2024 •

edited

Loading

gf777 commented Feb 14, 2024 •

edited

Loading

gf777 commented Feb 14, 2024

gf777 commented Feb 14, 2024

greg7mdp commented Feb 15, 2024

gf777 commented Feb 15, 2024

greg7mdp commented Feb 15, 2024 •

edited

Loading

gf777 commented Feb 16, 2024

merging maps #234

merging maps #234

Comments

gf777 commented Feb 12, 2024

greg7mdp commented Feb 13, 2024 • edited Loading

gf777 commented Feb 14, 2024 • edited Loading

gf777 commented Feb 14, 2024

gf777 commented Feb 14, 2024

greg7mdp commented Feb 15, 2024

gf777 commented Feb 15, 2024

greg7mdp commented Feb 15, 2024 • edited Loading

gf777 commented Feb 16, 2024

greg7mdp commented Feb 13, 2024 •

edited

Loading

gf777 commented Feb 14, 2024 •

edited

Loading

greg7mdp commented Feb 15, 2024 •

edited

Loading