Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Parallelised folding in Gemini #550

Merged
merged 3 commits into from
Jun 22, 2023
Merged

Conversation

Rumata888
Copy link
Contributor

Description

The folding operation in Gemini now used parallelism. I also split gemini.hpp into gemini.cpp and gemini.hpp for faster compilation.

Checklist:

  • I have reviewed my diff in github, line by line.
  • Every change is related to the PR description.
  • The branch has been merged with/rebased against the head of its merge target.
  • There are no unexpected formatting changes, superfluous debug logs, or commented-out code.
  • There are no circuit changes, OR a cryptographer has been assigned for review.
  • New functions, classes, etc. have been documented according to the doxygen comment format. Classes and structs must have @brief describing the intended functionality.
  • If existing code has been modified, such documentation has been added or updated.
  • No superfluous include directives have been added.
  • I have linked to any issue(s) it resolves.
  • I'm happy for the PR to be merged at the reviewer's next convenience.

@ledwards2225
Copy link
Collaborator

I'm still digging into the details but I have to admit I'm finding the code quite hard to read. I think you said that you originally tried simply parallelizing each round of folding but that the gain was not significant. It seems like your current approach is to split portions of the d-many folding rounds across individual threads. Is this a significant improvement over the first approach? I ask because as far as I can tell, the first approach would literally be a 1 line change. If the difference is not significant, perhaps the added complexity is not worth it..

Copy link
Collaborator

@ledwards2225 ledwards2225 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Most of my comments are just points for general discussion around multithreading

const size_t num_variables = mle_opening_point.size(); // m

const size_t num_threads = get_num_cpus_pow2();
constexpr size_t efficient_operations_per_thread = 64; // A guess of the number of operation for which there
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always play with this later but I'm just curious whether this is motivated by some experimentation or if its really just a guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a guess. I think it would be helpful to actually check what parallelize_for is useful for. I'll add an issue to concretely evaluate the size of computation for which it is efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: #553

// A_l_fold = Aₗ₊₁(X) = (1-uₗ)⋅even(Aₗ)(X) + uₗ⋅odd(Aₗ)(X)
auto A_l_fold = fold_polynomials[l + offset_to_folded].data();

parallel_for(num_used_threads, [&](size_t i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, parallel_for will determine the number of threads to use via openMP and essentially just ignores the input num_used_threads. Maybe this is fine but I wonder if we should update parallel_for to call omp_set_num_threads() so the num threads that is input is actually enforced.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose since parallel_for defines a loop based on the num threads it gets as input, num_used_threads will indeed be the max num threads used for multithreading, there is just no guarantee that it wont be fewer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I get the value of the number of available threads first and then compute num_used_threads based on that

size_t num_used_threads = std::min(n_l / efficient_operations_per_thread, num_threads);
num_used_threads = num_used_threads ? num_used_threads : 1;
size_t chunk_size = n_l / num_used_threads;
size_t last_chunk_size = (n_l % chunk_size) ? (n_l % num_used_threads) : chunk_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this robustness is nice to have but I wonder if in practice we can always assume (otherwise, enforce) that num_threads is a power of 2 and thus divides n_l.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of 6-core laptops out there)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants