Muhash parallel reduce -- optimize U3072 mul when LHS = one #581

michaelsutton · 2024-10-13T14:19:54Z

The following PR tightens performance of parallel muhash reduction by adding a special condition within the inner U3072::mul. Results suggest that using multiple threads strongly scales now, as oppose to prior to this change where much of the gain was countered by the increased number of inner mul ops.

The optimization is to short-circuit and self assign other (RHS) when LHS is one. This case is especially frequent during parallel reduce operation where the identity (one) is used for each sub-computation (at the LHS).

Benchmarks before 09d6679:

Benchmarking muhash txs/muhash seq: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.0s, enable flat sampling, or reduce sample count to 50.
muhash txs/muhash seq   time:   [1.7592 ms 1.7634 ms 1.7676 ms]
                        change: [+0.5061% +0.8974% +1.2930%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
muhash txs/muhash par 8 time:   [478.03 µs 479.56 µs 480.95 µs]
                        change: [+0.3087% +0.7933% +1.2579%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
muhash txs/muhash par 16
                        time:   [394.97 µs 397.13 µs 399.10 µs]
                        change: [-0.9620% -0.0286% +0.8966%] (p = 0.95 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
muhash txs/muhash par 32
                        time:   [476.49 µs 486.11 µs 497.24 µs]
                        change: [-3.0751% -0.5378% +2.0900%] (p = 0.69 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

and after:

Benchmarking muhash txs/muhash seq: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
muhash txs/muhash seq   time:   [1.7494 ms 1.7541 ms 1.7584 ms]
                        change: [-1.5057% -1.1099% -0.7332%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
muhash txs/muhash par 8 time:   [334.88 µs 335.69 µs 336.46 µs]
                        change: [-29.835% -29.499% -29.127%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
muhash txs/muhash par 16
                        time:   [287.39 µs 288.74 µs 290.02 µs]
                        change: [-27.467% -26.816% -26.179%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe
muhash txs/muhash par 32
                        time:   [358.20 µs 361.12 µs 364.06 µs]
                        change: [-25.664% -24.294% -22.887%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

coderofstuff

Ran the benchmarks and it looks like a solid ~30% improvement

elichai

Looks good :)

@tiram

* rothschild: donate funds to external address with custom priority fee (kaspanet#482) * rothschild: donate funds to external address Signed-off-by: Dmitry Perchanov <demisrael@gmail.com> * rothschild: Append priority fee to txs. Signed-off-by: Dmitry Perchanov <demisrael@gmail.com> * rothschild: add option to choose and randomize fee Signed-off-by: Dmitry Perchanov <dima@voyager.local> * rothschild: address clippy formatting issues Signed-off-by: Dmitry Perchanov <demisrael@gmail.com> --------- Signed-off-by: Dmitry Perchanov <demisrael@gmail.com> Signed-off-by: Dmitry Perchanov <dima@voyager.local> Co-authored-by: coderofstuff <114628839+coderofstuff@users.noreply.github.com> Co-authored-by: Dmitry Perchanov <dima@voyager.local> * fix wrong combiner condition (kaspanet#567) * fix wRPC json notification format (kaspanet#571) * Documentation updates (kaspanet#570) * docs * Export ConsensusSessionOwned * add CI pass to run `cargo doc` * module rust docs * lints * fix typos * replace glob import terminology with "re-exports" * cleanup * fix wasm rpc method types for methods without mandatory arguments (kaspanet#572) * cleanup legacy bip39 cfg that interferes with docs.rs builds (kaspanet#573) * Bump tonic and prost versions, adapt middlewares (kaspanet#553) * bump tonic, prost versions update middlewares * use unbounded channel * change log level to trace * use bounded channel * reuse counts bytes body to measure bytes body * remove unneeded clone * Fix README.md layout and add linting section (kaspanet#488) * Bump tonic version (kaspanet#579) * replace statrs and statest deps & upgrade some deps. (kaspanet#425) * replace statrs and statest deps. * remove todo in toml.cargo and fmt & lints. * do a run of `cargo audit fix` for some miscellaneous reports. * use maintained alt ks crate. * add cargo.lock. * update * use new command * newline * refresh cargo lock with a few more version updates * fix minor readme glitches --------- Co-authored-by: Michael Sutton <msutton@cs.huji.ac.il> * enhance tx inputs processing (kaspanet#495) * sighash reused trait * benches are implemented * use cache per iteration per function * fix par versions * fix benches * use upgreadable read * use concurrent cache * use hashcache * dont apply cache * rollback rwlock and indexmap. * remove scc * apply par iter to `check_scripts` * refactor check_scripts fn, fix tests * fix clippy * add bench with custom threadpool * style: fmt * suppress warnings * Merge branch 'master' into bcm-parallel-processing * renames + map err * reuse code * bench: avoid exposing cache map + iter pools in powers of 2 * simplify check_sig_op_counts * use thread pool also if a single input 1. to avoid confusion 2. since tokio blocking threads are not meant to be used for processing anyway * remove todo * clear cache instead of recreate * use and_then (so map_err can be called in a single location) * extend check scripts tests for better coverage of the par_iter case --------- Co-authored-by: Michael Sutton <msutton@cs.huji.ac.il> * Parallelize MuHash calculations (kaspanet#575) * Parallelize MuHash calculations MuHash calculations are additive and can be done in chunks then later combined * Reimplement validate tx with muhash as a separate fn * Use smallvec for muhash parallel Co-authored-by: Michael Sutton <msutton@cs.huji.ac.il> * Add independent rayon order test * Filter some data * Use tuple_windows for test iter --------- Co-authored-by: Michael Sutton <msutton@cs.huji.ac.il> * Muhash parallel reduce -- optimize U3072 mul when LHS = one (kaspanet#581) * semantic: add `from` ext methods * muhash from txs benchmark * optimization: in u3072 mul test if lhs is one * extract `parallelism_in_power_steps` * comment * Rust 1.82 fixes + mempool std sig op count check (kaspanet#583) * rust 1.82 fixes * sig op count std check * typo(cli/utils): kaspa wording (kaspanet#582) Co-authored-by: Michael Sutton <msutton@cs.huji.ac.il> * On-demand calculation for Ghostdag for Higher Levels (kaspanet#494) * Refactor pruning proof validation to many functions Co-authored-by: Ori Newman <orinewman1@gmail.com> * Use blue score as work for higher levels Co-authored-by: Ori Newman <orinewman1@gmail.com> * Remove pruning processor dependency on gd managers Co-authored-by: Ori Newman <orinewman1@gmail.com> * Consistency renaming Co-authored-by: Ori Newman <orinewman1@gmail.com> * Update db version Co-authored-by: Ori Newman <orinewman1@gmail.com> * GD Optimizations Co-authored-by: Ori Newman <orinewman1@gmail.com> * Remove remnant of old impl. optimize db prefixes * Ensure parents are in relations; Add comments apply_proof only inserts parent entries for a header from the proof into the relations store for a level if there was GD data in the old stores for that header. This adds a check to filter out parent records not in relations store * Match depth check to block_at_depth logic * Use singular GD store for header processing * Relax the panic to warn when finished_headers and couldn't find sufficient root This happens when there's not enough headers in the pruning proof but it satisfies validation * Error handling for gd on higher levels relations.get_parents on GD gets extra parents that aren't in the current GD store. so get_blue_work throws an error next, ORIGIN was mising from the GD so add that * remove using deeper requirements in lower levels * Fix missed references to self.ghostdag_stores in validate_pruning_point_proof * Refactoring for single GD header processing * Add assertion to check root vs old_root * Lint fix current_dag_level * Keep DB Version at 3 The new prefixes added are compatible with the old version. We don't want to trigger a db delete with this change * Cleanup apply_proof logic and handle more ghostdag_stores logic * remove simpa changes * Remove rewriting origin to primary GD It's already on there * More refactoring to use single GD store/manager * Lint fixes * warn to trace for common retry * Address initial comments * Remove "primary" in ghostdag store/manager references * Add small safety margin to proof at level 0 This prevents the case where new root is an anticone of old root * Revert to only do proof rebuilding on sanity check * Proper "better" proof check * Update comment on find_selected_parent_header_at_level * Re-apply missed comment * Implement db upgrade logic from 3 to 4 * Explain further the workaround for GD ordering.rs * Minor update to Display of TempGD keys * Various fixes - Keep using old root to minimize proof size. Old root is calculated using the temporary gd stores - fix the off-by-one in block_at_depth and chain_up_to_depth - revert the temp fix to sync with the off-by-one * Revert "Various fixes" This reverts commit bc56e65. This experimental commit requires a bit more thinking to apply, and optimization can be deferred. * Revert better proof check Recreates the GD stores for the current consensus by checking existing proof * Fix: use cc gd store * When building pruning point proof ghostdag data, ignore blocks before the root * Add trusted blocks to all relevant levels during apply_proof As opposed to applying only to level 0 * Calculate headers estimate in init proof stores * Explain finished headers logic Add back the panic if we couldn't find the required block and our headers are done Add explanation in comment for why trying anyway if finished_headers is acceptable * clarify comment * Rename old_root to depth_based_root explain logic for the two root calculation * More merge fixes * Refactor relations services into self * Use blue_work for find_selected_parent_header_at_level * Comment fixes and small refactor * Revert rename to old root * Lint fix from merged code * Some cleanup - use BlueWorkType - fix some comments * remove last reference to ghostdag_primary_* * Cleaner find_selected_parent_header_at_level Co-authored-by: Michael Sutton <mikisiton2@gmail.com> * Refactor for better readability and add more docs * Smaller safety margin for all * Lint and logic fix * Reduce loop depth increase on level proof retries Co-authored-by: Michael Sutton <mikisiton2@gmail.com> * Update consensus/src/processes/pruning_proof/mod.rs Co-authored-by: Michael Sutton <mikisiton2@gmail.com> * Comment cleanup * Remove unnecessary clone Co-authored-by: Michael Sutton <mikisiton2@gmail.com> * Rename genesis_hash to root; Remove redundant filter * Cleaner reachability_stores type Co-authored-by: Michael Sutton <mikisiton2@gmail.com> * Change failed to find sufficient root log to debug * Bump node version to 0.15.3 * A few minor leftovers --------- Co-authored-by: Ori Newman <orinewman1@gmail.com> Co-authored-by: Michael Sutton <mikisiton2@gmail.com> Co-authored-by: Michael Sutton <msutton@cs.huji.ac.il> * Standartize fork activation logic (kaspanet#588) * Use ForkActivation for all fork activations * Avoid using negation in some ifs * Add is_within_range_from_activation * Move 'is always' check inside is_within_range_from_activation * lints * Refactoring for cleaner pruning proof module (kaspanet#589) * Cleanup manual block level calc There were two areas in pruning proof mod that manually calculated block level. This replaces those with a call to calc_block_level * Refactor pruning proof build functions * Refactor apply pruning proof functions * Refactor validate pruning functions * Add comments for clarity * Pruning proof minor improvements (kaspanet#590) * Check pow for headers in level proof * Implement comparable level work * Implement fairer pruning proof comparison * prefer having the GD manager compose the level target, so that 1. level_work is always used 2. level zero can be explicitly set to 0 by the manager itself (being consensus sensitive code) * 1. no need to init origin here 2. comments about blue work are obvious * use saturating ops and avoid SignedInteger all together * Comment on level_work * Move MAX_WORK_LEVEL close to BlueWorkType and explain * Refactor block level calc from pow to a function --------- Co-authored-by: Michael Sutton <msutton@cs.huji.ac.il> * Add KIP-10 Transaction Introspection Opcodes, 8-byte arithmetic and Hard Fork Support (kaspanet#487) * implement new opcodes * example of mutual tx * add docs describing scenario * introduce feature gate for new features * introduce hf feature that enables txscript hf feature * style: fmt and clippy fix * implement new opcodes * example of mutual tx * add docs describing scenario * introduce feature gate for new features * style: fmt and clippy fix * remove unused feature * fmt * make opcode invalid in case of feature disabled * feature gate test * change test set based on feature add ci cd test * rename InputSPK -> InputSpk * enable kip10 opcodes based on daa_score in runtime * use dummy kip10 activation daa score in params * use dummy kip10 activation daa score in params * suppress clippy lint * add example with shared key * fix clippy * remove useless check from example * add one-time borrowing example * Implement one-time and two-times threshold borrowing scenarios - Add threshold_scenario_limited_one_time function - Add threshold_scenario_limited_2_times function - Create generate_limited_time_script for reusable script generation - Implement nested script structure for two-times borrowing - Update documentation for both scenarios - Add tests for owner spending, borrowing, and invalid attempts in both cases - Ensure consistent error handling and logging across scenarios - Refactor to use more generic script generation approach * fix: fix incorrect sig-op count * correct error description * style: fmt * pass kip-10 flag in constructor params * remove borrow scenario from tests. run tests against both kip1- enabled/disabled engine * introduce method that converts spk to bytes. add tests covering new opcodes * return comment describing where invalid opcodes starts from. add comments describing why 2 files are used. * fix wring error messages * support introspection by index * test input spk * test output spk * tests refactor * support 8-byte arithmetics * Standartize fork activation logic (kaspanet#588) * Use ForkActivation for all fork activations * Avoid using negation in some ifs * Add is_within_range_from_activation * Move 'is always' check inside is_within_range_from_activation * lints * Refactoring for cleaner pruning proof module (kaspanet#589) * Cleanup manual block level calc There were two areas in pruning proof mod that manually calculated block level. This replaces those with a call to calc_block_level * Refactor pruning proof build functions * Refactor apply pruning proof functions * Refactor validate pruning functions * Add comments for clarity * only enable 8 byte arithmetics for kip10 * use i64 value in 9-byte tests * fix tests covering kip10 and i64 deserialization * fix test according to 8-byte math * finish test covering kip10 opcodes: input/output/amount/spk * fix kip10 examples * rename test * feat: add input index op * feat: add input/outpiut opcodes * reseve opcodes reorder kip10 opcodes. reflect script tests * fix example * introspection opcodes are reserverd, not disables * use ForkActivation type * cicd: run kip-10 example * move spk encoding to txscript module * rework bound check ot input/output index * fix tests by importing spkencoding trait * replace todo in descripotions of over[under]flow errors * reorder new opcodes, reserve script sig opcode, remove txid * fix bitcoin script tests * add simple opcode tests * rename id(which represents input index) to idx * fix comments * add input spk tests * refactor test cases * refactor(txscript): Enforce input index invariant via assertion Change TxScriptEngine::from_transaction_input to assert valid input index instead of returning Result. This better reflects that an invalid index is a caller's (transaction validation) error rather than a script engine error, since the input must be part of the transaction being validated. An invalid index signifies a mismatch between the transaction and the input being validated - this is a programming error in the transaction validator layer, not a script engine concern. The script engine should be able to assume it receives valid inputs from its caller. The change simplifies error handling by enforcing this invariant early, while maintaining identical behavior for valid inputs. The function is now documented to panic on malformed inputs. This is a breaking change for code that previously handled InvalidIndex errors, though such handling was likely incorrect as it indicated an inconsistency in transaction validation. * refactor error types to contain correct info * rename id to idx * rename opcode * make construction of TxScriptEngine from transaction input infallible * style: format combinators chain * add integration test covering activation of kip10 * rename kip10_activation_daa_score to kip10_activation * Update crypto/txscript/src/lib.rs refactor vector filling * rework assert * verify that block is disqualified in case of it has tx which requires block that contains the tx with kip10 opcode is accepted after daa score has being reached * revert changer to infallible api * add doc comments * Update crypto/txscript/src/opcodes/mod.rs Fallible conversion of output amount (usize -> i64) * Update crypto/txscript/src/opcodes/mod.rs Fallible conversion of input amount (usize -> i64) * add required import * refactor: SigHashReusedValuesUnsync doesnt neet to be mutable * fix test description * rework example * 9 byte integers must fail to serialize * add todo * rewrite todo * remove redundant code * remove redundant mut in example * remove redundant mut in example * remove redundant mut in example * cicd: apply lint to examples --------- Co-authored-by: Ori Newman <orinewman1@gmail.com> * Some simplification to script number types (kaspanet#594) * Some simplification to script number types * Add TODO * Address review comments * feat: add signMessage noAuxRand option for kaspa wasm (kaspanet#587) * feat: add signMessageWithoutRand method for kaspa wasm * enhance: sign message api * fix: unit test fail * chore: update noAuxRand of ISignMessage * chore: add sign message demo for noAuxRand * Optimize window cache building for ibd (kaspanet#576) * show changes. * optimize window caches for ibd. * do lints and checks etc.. * bench and compare. * clean-up * rework lock time check a bit. * // bool: todo!(), * fmt * address some reveiw points. * address reveiw comments. * update comments. * pass tests. * fix blue work assumption, update error message. * Update window.rs slight comment update. * simplify a bit more. * remove some unneeded things. rearrange access to cmpct gdd. * fix conflicts. * lints.. * address reveiw points from m. sutton. * uncomplicate check_block_transactions_in_context * commit in lazy * fix lints. * query compact data as much as possible. * Use DefefMut to unify push_mergeset logic for all cases (addresses @tiram's review) * comment on cache_sink_windows * add comment to new_sink != prev_sink * return out of push_mergeset, if we cannot push any more. * remove unused diff cache and do non-daa as option. * Cargo.lock * bindings signer layout --------- Signed-off-by: Dmitry Perchanov <demisrael@gmail.com> Signed-off-by: Dmitry Perchanov <dima@voyager.local> Co-authored-by: demisrael <81626907+demisrael@users.noreply.github.com> Co-authored-by: coderofstuff <114628839+coderofstuff@users.noreply.github.com> Co-authored-by: Dmitry Perchanov <dima@voyager.local> Co-authored-by: Maxim <59533214+biryukovmaxim@users.noreply.github.com> Co-authored-by: aspect <anton.yemelyanov@gmail.com> Co-authored-by: George Bogodukhov <gvbgduh@gmail.com> Co-authored-by: Michael Sutton <msutton@cs.huji.ac.il> Co-authored-by: D-Stacks <78099568+D-Stacks@users.noreply.github.com> Co-authored-by: Romain Billot <romainbillot3009@gmail.com> Co-authored-by: Ori Newman <orinewman1@gmail.com> Co-authored-by: Michael Sutton <mikisiton2@gmail.com> Co-authored-by: witter-deland <87846830+witter-deland@users.noreply.github.com>

michaelsutton added 5 commits October 12, 2024 20:18

semantic: add from ext methods

e129041

muhash from txs benchmark

30cd5f9

optimization: in u3072 mul test if lhs is one

09d6679

extract parallelism_in_power_steps

cba8922

comment

b5793ce

michaelsutton requested review from coderofstuff and elichai October 13, 2024 14:20

coderofstuff approved these changes Oct 13, 2024

View reviewed changes

elichai approved these changes Oct 13, 2024

View reviewed changes

michaelsutton merged commit 0df2de5 into kaspanet:master Oct 13, 2024
6 checks passed

michaelsutton deleted the muhash-new-opt branch October 13, 2024 17:54

0xA001113 mentioned this pull request Oct 24, 2024

[RK] Batch upstream merge, part 2 spectre-project/rusty-spectre#20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muhash parallel reduce -- optimize U3072 mul when LHS = one #581

Muhash parallel reduce -- optimize U3072 mul when LHS = one #581

michaelsutton commented Oct 13, 2024

coderofstuff left a comment

elichai left a comment

Muhash parallel reduce -- optimize U3072 mul when LHS = one #581

Muhash parallel reduce -- optimize U3072 mul when LHS = one #581

Conversation

michaelsutton commented Oct 13, 2024

coderofstuff left a comment

Choose a reason for hiding this comment

elichai left a comment

Choose a reason for hiding this comment