-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Parallelised folding in Gemini #550
Conversation
I'm still digging into the details but I have to admit I'm finding the code quite hard to read. I think you said that you originally tried simply parallelizing each round of folding but that the gain was not significant. It seems like your current approach is to split portions of the d-many folding rounds across individual threads. Is this a significant improvement over the first approach? I ask because as far as I can tell, the first approach would literally be a 1 line change. If the difference is not significant, perhaps the added complexity is not worth it.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Most of my comments are just points for general discussion around multithreading
const size_t num_variables = mle_opening_point.size(); // m | ||
|
||
const size_t num_threads = get_num_cpus_pow2(); | ||
constexpr size_t efficient_operations_per_thread = 64; // A guess of the number of operation for which there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always play with this later but I'm just curious whether this is motivated by some experimentation or if its really just a guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a guess. I think it would be helpful to actually check what parallelize_for is useful for. I'll add an issue to concretely evaluate the size of computation for which it is efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: #553
// A_l_fold = Aₗ₊₁(X) = (1-uₗ)⋅even(Aₗ)(X) + uₗ⋅odd(Aₗ)(X) | ||
auto A_l_fold = fold_polynomials[l + offset_to_folded].data(); | ||
|
||
parallel_for(num_used_threads, [&](size_t i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, parallel_for
will determine the number of threads to use via openMP and essentially just ignores the input num_used_threads
. Maybe this is fine but I wonder if we should update parallel_for
to call omp_set_num_threads()
so the num threads that is input is actually enforced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose since parallel_for
defines a loop based on the num threads it gets as input, num_used_threads
will indeed be the max num threads used for multithreading, there is just no guarantee that it wont be fewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I get the value of the number of available threads first and then compute num_used_threads based on that
size_t num_used_threads = std::min(n_l / efficient_operations_per_thread, num_threads); | ||
num_used_threads = num_used_threads ? num_used_threads : 1; | ||
size_t chunk_size = n_l / num_used_threads; | ||
size_t last_chunk_size = (n_l % chunk_size) ? (n_l % num_used_threads) : chunk_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this robustness is nice to have but I wonder if in practice we can always assume (otherwise, enforce) that num_threads is a power of 2 and thus divides n_l
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of 6-core laptops out there)
Description
The folding operation in Gemini now used parallelism. I also split gemini.hpp into gemini.cpp and gemini.hpp for faster compilation.
Checklist:
@brief
describing the intended functionality.include
directives have been added.