-
-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to reset the inner sub map #238
Comments
Thanks for the great explanation @marioroy . Can you do the reset inside the for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
set = map_str_int_type::EmbeddedSet(); // <------------------------------
});
} |
Thanks. Per your suggestion (removing for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
set = map_str_int_type::EmbeddedSet();
});
} The
|
Sorry, my bad. After the loop, and instead if map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
set = map_str_int_type::EmbeddedSet();
}); |
Thank you, @greg7mdp. Your suggestion works. for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
});
// The sub map is no longer needed. Reset the set to reclaim memory.
map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
set = map_str_int_type::EmbeddedSet();
});
} Notice "map to vector" completing in 2 seconds. This was taking 3 seconds, previously.
|
Hey @marioroy , I just tried your Which parameters did you use for
|
Wow! Do not mind. llil4map.cc is final. Thank you for your help, @greg7mdp; (1) some time ago, found one-off error in chunking; (2) at the time, helped with map to vector allowing parallel; (3) and now, helped ensure minimum memory consumption. Blessings and grace. That's it, just as written in the comments for generating the input files.
I found my script for making 92 input files + shuffled (requires 2.8 GB). On my machine (where llil4map.cc resides), #!/bin/bash
# Script for generating input files for llil4map.cc.
mkdir -p /data/input
cp gen-llil.pl shuffle.pl /data/input
cd /data/input
# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >1; mv 1 big$n
done &
# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >2; mv 2 big$n
done &
# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >3; mv 3 big$n
done &
# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >4; mv 4 big$n
done &
wait The default mode in // #define MAX_STR_LEN_L (size_t) 12 My PerlMonks friend eyepopslikeamosquito helped me learn C++. I enjoy parallel processing. So naturally, I tried: (1) chunk IO in get properties using |
Thanks, I'll add it as an example then, it is a great example! |
Added! Looks like you did an amazing job learning C++! I saw a small perf. issue in Here is my updated version. |
Thanks. I learned also, reading I tried your updated version and added line to reset the set, especially beneficial for long strings. Without it, the "map to vector" takes 3 seconds for long strings and peaks near 18.4 GB. for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(const_cast<str_type&>(x.first)), x.second);
// reset the set (no longer needed) to reclaim memory early
set = map_str_int_type::EmbeddedSet();
});
} I respect your decision if you prefer leaving the line commented out, long strings mode (line 119). // #define MAX_STR_LEN_L (size_t) 12 Or default to fixed-length, uncomment line. #define MAX_STR_LEN_L (size_t) 12 Here too, I respect your decision disabled or enabled (line 46). I enabled it locally to validate parallel sort. Disabled takes 58.0 seconds for "vector stable sort" (long strings mode), or 21.5 seconds (fixed-length mode). #define USE_BOOST_PARALLEL_SORT 1 Testing was done on a 32 core (64 logical threads), AMD Ryzen Threadripper 3970X machine, DDR4 memory 3600 MHz.
|
Thanks @marioroy , I reverted the defaults to the way you had them, and also added the
|
That is great performance for the AMD 7950x CPU. The "write stdout" taking 3.2 seconds is likely the L3 size difference between the two CPUs. I am providing the following output, because the 7950x CPU is amazing. In other words, my system is unable to reach 2x faster total time for being 64 logical cores. Building with It is impressive that "get properties" for the 7950x CPU completes in 12.3 seconds. Wow! g++
clang++
Last year, I tried custom memory allocation. I was trying to reduce peak memory consumption to no avail, because it greatly impacted "write stdout". Thanks for the reset suggestion. Memory consumption no longer peaks high. |
It is weird that I don't have the same result for |
No, not weird at all. Because of the random generator script. This is normal for unique count to differ per machine or whenever refreshing the input files. If you run Do you want to add to .gitignore, the tmp dir?
|
Sure, will do, even though in my case I generate the files in my build directory. I tried with
|
Oh, I assumed wrongly the location.
That is amazing. I checked and you have the same defaults. Mindfulness for users and why default fixed-length is important? So the application can run with lesser memory consumption, no surprises on machines with 8 ~ 16 GB. The reset was the missing piece. #define USE_BOOST_PARALLEL_SORT 1
#define MAX_STR_LEN_L (size_t) 12 Folks may run with a partial list, not all 92 files. For example just |
I have a question about the data files. They contain words followed by a number. Am I right to assume this number is always positive? Also I added a couple minor simplifications in the code, see this |
Yes, always positive is certain.
It's mind boggling. We saw that The application runs parallel at each level (1) input, (2) map to vector, (3) sorting, and (4) output. Crazy fast. The coolness factor is the Randomness is beautiful. With the many sub maps, it's less likely for threads to be using the same map/mutex in a given time window or fraction of a second. If they do, no problems. The
I had no idea about Thank you, @greg7mdp. |
I'll play with it some more later. It is an interesting piece of code for sure! |
Hey @marioroy , just wondering, if your solution the fastest one that exists for this problem? |
Hi, @greg7mdp I am not aware of anything faster. The llil4emh There is the llil4vec implementation. This appends to a vector, no maps. So, may consume a lot of memory, particularly repeating keys. It may run faster up to some number of cores, but does not scale well beyond that. If daring to run llil4map peaks less than 8 GB for 200 million keys, fixed-length=12. I created a backup plan, if the number of unique keys were to exceed memory capacity :-) Run with |
Wow, thanks for all the information! You seem to have spent a lot of time on this. Is this something you are doing for work, or just for fun? |
For fun actually. I wondered if possible to solve the challenge. A Perl Monk "eyepopslikeamosquito" tried the vector approach, stating memory is cheap. My solutions took some time (new to C++). I chose the path that preferred low memory consumption. Your phmap C++ library made this possible (emhash as well). OpenMP made parallel possible. Chunk IO and spinlock mutex enables scaling to many CPU cores. The reset capability ensures low memory consumption during map to vector. I had no idea there was a way to do so. Thank you, for your tip. Boost's |
There is one more problem to resolve. That is for long strings or variable length keys. The std::string (std::basic_string) container is not memory efficient, unfortunately. In long strings mode, For comparison to |
I made a memory efficient version llil4map2, using a vector of pointers, to the phmap key-value pairs. That met having to construct the phmap outside the block, and no reset. This results in sorting taking longer, possibly due to CPU cache misses. However, memory consumption is significantly less.
I removed the limit for the number of threads for better performance, sorting and output. This is helpful if RAM data is scattered, not cache friendly. |
Give me a couple days, I'll provide a solution for that (fixed size string which also supports long strings and is as memory efficient as possible) |
Hi @marioroy , I added a new type I also used a set instead of a map, storing the Some other small changes and cleanup, I hope you don't mind. see this |
Hi @greg7mdp, Wow! I gave it a spin on this machine. This peaks at 6.9 GB. The lowest memory consumption thus far for supporting long strings.
At your convenience, please make one tiny change to allow parallel sort to run on all CPUs threads, i.e. no limit. Replace
I will study |
Hi @marioroy , I just made that update to remove the 32 thread limit. |
I reached out to eyepopslikeamosquito, to let him know. He triggered my curiosity in another thread, to making an attempt at the original Long List is Long challenge. This became a great opportunity for learning OpenMP using C++. Thank you @greg7mdp, for the |
Blessings and grace, @greg7mdp. Thank you for the extra clarity. The
Just earlier, I tried (for fun) putting What if Note: For long strings, I really mean variable length strings. Thanks, |
If you want the same struct to work for both long and short strings, as we do I think, it has to contain at least a pointer (8 bytes) + a count. If the count was never greater than 8, we could store it in the 3 lower bits of the pointer (which are always 0), so the pointer + count would fit in 8 bytes, so we could have either a 7 byte immediate string + count, or a pointer + count. If the count can be greater than 8, then we need more than 8 bytes, so 16 bytes is the minimum our struct will occupy (for alignment), unless we are even more tricky by moving the pointers to an aligned address before dereferencing them. But honestly, I don't see how you can save more than a few bytes, and then also you reduce the size of immediate strings that are supported. |
Please, disregard. So... "Instead of extra[3], store the flag (1 bit) indicating whether using a pointer in the |
No problem, it is not obvious to understand if C++ is not your main language for sure. |
I see it clearly, right here :) string_cnt(string_cnt&& o) noexcept {
if (o.extra[3]) {
str = o.str;
o.str = nullptr;
extra[3] = 1;
} else {
std::strcpy((char *)(&str), o.get());
extra[3] = 0;
}
cnt = o.cnt;
} |
This is a move constructor, so this is stealing the data from If I could add some comments. |
I think I have an idea to make |
I understand your
My favorite line And the importance of |
Your beautification of the include directives put me to work :-) For consistency reason, I thought to beautify the include directives as well. I created a thread at Perl Monks to let folks know: https://perlmonks.org/?node_id=11158711 I brought back llil4umap (unordered_map variant). Your tip for releasing memory immediately resolved the long cleanup delay in that one. So, the umap version is nice to have for comparing the standard container vs the powerful phmap library. |
@greg7mdp Your solution is superb. Thanks many, I learned a lot. The Blessings and grace, |
@greg7mdp There is extra global cleanup time, exiting the It dawned on me to do a testing involving mixed length. I created # create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
perl gen-llil.pl long$n 200 12 1
perl shuffle.pl long$n >1; mv 1 long$n
done &
# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
perl gen-llil.pl long$n 200 3 1
perl shuffle.pl long$n >2; mv 2 long$n
done &
# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
perl gen-llil.pl long$n 200 12 1
perl shuffle.pl long$n >3; mv 3 long$n
done &
# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
perl gen-llil.pl long$n 200 3 1
perl shuffle.pl long$n >4; mv 4 long$n
done &
wait For this round of testing, the UNIX time is captured for detecting any extra global/gc cleanup time. That is the case for
llil4map_greg
llil4map_long WARNING: Do not run llil4map_long (peaks 32 GB), or process longa* longa* longa* (a suffix) instead.
llil4map_fix16
|
Interesting! What happens is when you use So in your test, with the mix of 6 character strings and 15 character string, you didn't allocate memory for strings whether you commented However, using Probably if you tried with a mix 6/11 characters, or 6/17 characters, the difference would be less. The aim with |
@greg7mdp Thank you for the clarity about std::string sso. I wanted testing to be apples-to-apples comparison.
I will re-create 19 length (to exeed sso limit), by passing alternating std::string does the same thing; long exit time due to global cleanup. llil4map_greg
llil4map_long
string_cnt shines, consuming lesser memory consumption.
|
std::string and string_cnt, once past sso limit, result in significantly longer exit time due to global cleanup. If only @greg7mdp Akin to { sha1sum, sha224sum, sha256sum, sha384sum, sha512sum } naming, one can build llil4map supporting fixed-length mode { llil4map_12, llil4map_16, llil4map_20, llil4map_24, llil4map_30 }, and llil4map_dyn (using static_cnt). The Chuma LLiL challenge is solved. :-) |
It is indeed possible. I have previously modified the version of
|
PS: I got |
The new update is wonderful. I like the fact that the private method int_type count = fast_atoll64(found + 1);
std::string_view sv(beg_ptr, found - beg_ptr);
// Use lazy_emplace to modify the set while the mutex is locked.
set_ret.lazy_emplace_l(
sv,
[&](string_cnt_set_t::value_type& p) {
p.cnt += count; // called only when key was already present
},
[&](const string_cnt_set_t::constructor& ctor) {
ctor(std::move(sv), count); // construct value_type in place when key not present
}
);
Nice work, get_properties does run faster (set strlen(s), strcpy vs. set s.size(), memcpy()). Before and after results.
I re-compiled void set(std::string_view s) {
static_assert(buffsz == N);
static_assert(offsetof(string_cnt_t, cnt) == (intptr_t)buffsz);
// static_assert(sizeof(string_cnt_t) == N + extra_sz); // complained for string_cnt_t<20> The momemt has come :-) Let's try fixed-length 20 i.e.
I bumped
Wow! The improved At last, the LLiL challenge is solved. Thank you, @greg7mdp. The P.S. Not fair :-) I too want to use |
That's great, thanks for all your work. And for fixing the static_assert in
You are welcome to use |
So much learning. I sped up get_properties by storing the length of the key with the count i.e. (count << 8) + (klen & 0xff).
void set(const char *s, uint8_t len) {
set(std::string_view{s, len});
}
std::size_t hash() const {
auto s = get();
std::string_view sv {s, cnt & 0xff};
return std::hash<std::string_view>()(sv);
} I ran with 16 CPU cores. Before and after.
get_properties is now as fast as my llil4map version (without string_cnt). |
Very nice speedup! congrats. |
I created a gist (two files) containing std::size_t hash() const {
auto s = getstr();
std::string_view sv {s, cnt & 0xff};
return std::hash<std::string_view>()(sv);
} https://gist.github.com/marioroy/693d952b578792bf090fe20c2aaccad5 Edit: The gist contains three files, added |
Sorting an array runs faster for fixed length keys. That is the case for union {
struct {
char * str;
char extra[extra_sz];
uint_t cnt;
} s_;
std::array<char, sizeof(s_) - sizeof(uint_t)> d_;
};
bool operator<(const string_cnt_t& o) const {
if (s_.extra[mark_idx] || o.s_.extra[mark_idx])
return std::strcmp(getstr(), o.getstr()) < 0;
else
return d_ < o.d_;
} Before and after results.
Temporarily, I decreased the number of threads to 8 for sorting. Look at sorting go. :-)
Let's try
Gees! All the union changes just to improve sorting for fixed-length keys. |
Hum, this comparison is not necessarily correct, if you have garbage after the terminating null character of the contained string, because the |
If you |
Hi @greg7mdp, You are correct. There could be garbage, fixed by clearing in set. Thank you! std::memset(f_.data(), 0, N);
std::memcpy(f_.data(), s.data(), std::min(len, N - 1)); Fixing the garbage issue allowed me to improve the bool operator==(const string_cnt_t& o) const {
if (s_.extra[mark_idx] || o.s_.extra[mark_idx])
return std::strcmp(getstr(), o.getstr()) == 0;
else
return std::memcmp(getdata(), o.getdata(), N) == 0;
} Freeing memory is now handled with multiple threads, in // Free dynamically allocated memory in parallel.
#pragma omp parallel for schedule(dynamic, 300000) num_threads(nthds)
for (size_t i = 0; i < propvec.size(); i++)
propvec[i].free(); I added a fix-length only version, // #include "string_cnt.h" // original + key length stored with cnt
// #include "string_cnt_f.h" // fix-length only version
#include "string_cnt_u.h" // union std::array Thank you for |
Hi @marioroy , congrats on all the great work. I've been very busy at work, so I haven't had time to work on this or look at your latest changes, but hopefully I will soon. All the best! |
Greetings, @greg7mdp I made a new gist containing my final phmap and vector demonstrations. The gist contains a README and LLiL utilities. There are three C++ phmap and vector demonstrations using string_cnt_t
I learned more C++ from your |
Hi @greg7mdp, Running Processing int_type count = fast_atoll64(found + 1);
string_cnt s{(const char *)beg_ptr, (size_t)(found - beg_ptr), count};
// Use lazy_emplace to modify the set while the mutex is locked.
set_ret.lazy_emplace_l(
s,
[&](string_cnt_set_t::value_type& p) {
p.incr(count); // called only when key was already present
},
[&](const string_cnt_set_t::constructor& ctor) {
ctor(std::move(s)); // construct value_type in place when key not present
}
); |
Hi @marioroy , thanks for the updates. On my side I've been busy and have not worked on |
I made an attempt at dynamic buffer allocation. Get properties decreased to 10.6s on a realtime Linux, now similar to running on a non-realtime kernel. https://gist.github.com/marioroy/d02881b96b20fa1adde4388b3e216163 The dynamic allocation logic is thread-safe i.e. first argument optional thread id for thread-safety. Plus rolling back the dynamic memory position if the long string was already present. int_type count = fast_atoll64(found + 1);
string_cnt s{tid, beg_ptr, (size_t)(found - beg_ptr), count};
// Use lazy_emplace to modify the set while the mutex is locked.
set_ret.lazy_emplace_l(
s,
[&](string_cnt_set_t::value_type& p) {
p.incr(count); // called only when key was already present
s.mem_rollback(tid);
},
[&](const string_cnt_set_t::constructor& ctor) {
// construct value_type in place when key not present
// long strings will be stored in dynamic memory storage
ctor(std::move(s));
}
); Notice lesser memory consumption. And there too, get properties completing faster than before.
Thank you, @greg7mdp. Your |
Hi, @greg7mdp
I tested variable length keys. It turns out the llil4map.cc demonstration consumes the most memory consumption. The reason is unable to clear or reset the inner sub map. Without this ability, memory consumption peaks 18.4 GB. Otherwise, 15.1 GB with
reset_inner
capability.Here, we get ready to insert the key,value pairs into a vector.
I added the
reset_inner
method tophmap.h
, locally on my machine.Great news!
This change enables
llil4map
to reachllil4hmap
performance, processing variable length keys. Previously, thellil4map
demonstration ran slower. Being able to clear/reset a sub map early is beneficial when the sub map is no longer needed. In the C++ snippet above, parallel threads process each sub map individually, "map to vector" below. I prefer for the map memory consumption to decrease while the vector memory increases and not wait until the end to clear the map. No issues for fixed-length keys.llil4map.cc
The sub maps are managed by the phmap library.
llil4hmap.cc
The sub maps are managed at the application level. Notice, similar times for "get properties" and "map to vector". The key
hash_value
is computed once only and stored with the key. I was curious at the time and left it this way.Now, I understand why the two
phmap
demonstrations did not perform similarly before. You may like to know how far behindemhash
. Thellil4map
is 0.9 seconds apart withreset_inner
capability. Previously, 2.2 seconds apart total time. Please, no worries about phmap vs. emhash. Lacking the ability to clear/reset the sub map was the reason greater than 2 seconds apart (for variable length keys).llil4emh.cc
Here too, the sub maps are managed at the application level. Including, key
hash_value
stored with the key.The text was updated successfully, but these errors were encountered: