Investigate performance of parallelization #5

adlerjohn · 2020-06-17T15:02:04Z

Computing the erasure coding of each row and column, along with computing the namespace Merkle trees, can be done in parallel. Investigate the potential performance gains of doing so.

evan-forbes · 2020-11-17T22:15:06Z

I made a quick and dirty implementation that can be found here, and the benchmarks + more detailed write up can be found here

On an 8 core cpu, this implementation gets roughly 2-4 times faster, depending on data square size.

)

I think this is a decent starting point, but judging by the trace, it looks like there is plenty of room for improvement.

I should have a chance to give the nmt library a similar treatment sometime in the next few days.

evan-forbes · 2020-11-21T21:47:28Z

I ran some more benchmarks for the nmt generation portion of hashing the data availability header, and fortunately (unfortunately?) everything went as one might expect. This task is even more parallelizable than generating the erasure data, but this implementation doesn't get the full benefits for such a parallelizable load.

~6x performances gains on a 8 core cpu for max width data square, with very little overhead introduced.

Summary: These unoptimized implementations show that there are easy options for performance gains should that be required.
Write-up w/ code and traces: https://github.com/evan-forbes/rsmt2dbench
Implementation for multi-threaded nmt generation: https://github.com/evan-forbes/lazyledger-core/tree/evan/multithreaded-rsmt

Thanks to @adlerjohn for the fun the idea
Shoutout to @musalbas and @liamsi for writing easy to read code

As for further investigation

it might be useful to run pprof on the current implementations and see if there are any low hanging fruit
see how fast the current implementations allow for creating proofs

musalbas · 2020-11-23T16:34:38Z

Thanks for this @evan-forbes, this looks like great work.

Wondertan · 2022-05-08T22:37:04Z

@evan-forbes, I didn't see this before. Nice graphs! Curios, how did you generate those?

Wondertan · 2022-05-08T22:48:58Z

A few ideas towards parallelizing(it is getting annoying to wait for tests to generate the compute the square already):

Parallelization should be enabled by default
With runtime.NumCPU() workers/routines
- There is no point in having more than that for CPU heavy load.
Amount of workers should be optionally configurable
We should spawn workers globally for any amount of squares
- If ten squares are getting re/computed simultaneously, they won't spawn their workers but will rely on the global ones.
- Instead of having some singleton square processor, we can rely on global vars and init, spawning workers for us. If designed carefully, there won't be any issues with the global state.
- The main advantage of this approach is that we won't need to modify the API.

adlerjohn · 2022-05-08T22:55:57Z

Additional thoughts regarding parallelization: whether rsmt2d erasure coding/NMT root computing should be done in parallel largely depends on how parallelized the user of the library is. For example, if only a single block is verified at a time, then indeed there will be gains if rsmt2d is parallelized. But if multiple blocks are verified in parallel, then the CPU will already be saturated.

That being said, there's always the case that once you've completed IBD and are at the tip, you'll be verifying one block at a time, and in such cases rsmt2d being parallelized is a requirement to saturate CPU cores.

evan-forbes · 2022-05-10T11:04:39Z

Curios, how did you generate those?

just Google sheets iirc

Wondertan · 2022-05-16T09:54:02Z

Users can potentially rely on Go's scheduler to distribute the load during sync, e.g., 10 blocks sync and reconstruction for each runs in its own routine. This is the simplest approach, but it has its downsides:

No control over how and in what way processing is going for the user
- Indeterministic ordering of the block processing. Scheduler picks routines pseudorandomly.
- Passing complexities to the user that can be solved more efficiently in the lib itseld
Inability to parallelize one block, as you mentioned
Performance implications
- Allocating a new routine per block instead of having a constant amount of routines.
- More context switching for the scheduler

Additional thoughts regarding parallelization: whether rsmt2d erasure coding/NMT root computing should be done in parallel largely depends on how parallelized the user of the library is. For example, if only a single block is verified at a time, then indeed there will be gains if rsmt2d is parallelized. But if multiple blocks are verified in parallel, then the CPU will already be saturated.

The best long-term approach I would propose is to have rsmt2d lib also do some form of concurrency and parallelization on the axis level. Writing some small global engine that processes axises and does not even aware of the square notion. It can even be implementing the Codec interface, but I have to think about it more.

Wondertan · 2022-05-16T15:11:05Z

I was going through the code for another reason and realized that it wouldn't be possible to do this on the Codec level, unfortunately. Codec is strictly sync interface, and the only way I see rn to make parallelization of repair by axis is to change solveCrossword method.

evan-forbes · 2022-09-13T14:58:46Z

we can probably change this issue to be about actually implementing parallelization, with the above feedback from @adlerjohn and @Wondertan, instead of only investigating it.

liamsi mentioned this issue Apr 5, 2021

Introduce bitMatrix to replace mat.Dense #33

Merged

liamsi added this to Celestia Node Dec 8, 2021

evan-forbes linked a pull request Sep 15, 2022 that will close this issue

Parallelise data extension and pre-repair sanity check #116

Merged

musalbas closed this as completed in #116 Sep 20, 2022

musalbas moved this to Done in Celestia Node Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance of parallelization #5

Investigate performance of parallelization #5

adlerjohn commented Jun 17, 2020

evan-forbes commented Nov 17, 2020 •

edited

Loading

evan-forbes commented Nov 21, 2020

musalbas commented Nov 23, 2020

Wondertan commented May 8, 2022

Wondertan commented May 8, 2022

adlerjohn commented May 8, 2022

evan-forbes commented May 10, 2022

Wondertan commented May 16, 2022

Wondertan commented May 16, 2022

evan-forbes commented Sep 13, 2022

Investigate performance of parallelization #5

Investigate performance of parallelization #5

Comments

adlerjohn commented Jun 17, 2020

evan-forbes commented Nov 17, 2020 • edited Loading

evan-forbes commented Nov 21, 2020

musalbas commented Nov 23, 2020

Wondertan commented May 8, 2022

Wondertan commented May 8, 2022

adlerjohn commented May 8, 2022

evan-forbes commented May 10, 2022

Wondertan commented May 16, 2022

Wondertan commented May 16, 2022

evan-forbes commented Sep 13, 2022

evan-forbes commented Nov 17, 2020 •

edited

Loading