Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ETL to TxLookup stage #6655

Merged
merged 43 commits into from
Feb 22, 2024
Merged

Conversation

joshieDo
Copy link
Collaborator

@joshieDo joshieDo commented Feb 18, 2024

built on top of #6654

Changes

Adds etl to TxLookup stage since inserting hashing hashes vs appending is very, very expensive..

Benches

maxperf holesky @ block 925555:

  • PR_6654 without etl: 47s-50s,
  • PR_6654 with etl: 13s
  • main 56 - 59s

maxperf mainnet @ tip:

  • etl @ tip: 22min
  • main @ 23rd Nov 2023: 5h

@joshieDo joshieDo added C-enhancement New feature or request A-staged-sync Related to staged sync (pipelines and stages) A-static-files Related to static files labels Feb 18, 2024
@joshieDo joshieDo changed the base branch from feat/static-files to joshie/concurrent-range-fetch February 18, 2024 15:28
joshieDo and others added 4 commits February 18, 2024 15:30
Co-authored-by: Matthias Seitz <matthias.seitz@outlook.de>
Co-authored-by: Matthias Seitz <matthias.seitz@outlook.de>
crates/stages/src/stages/tx_lookup.rs Show resolved Hide resolved
Comment on lines +134 to +141
txhash_cursor.append(
RawKey::<TxHash>::from_vec(hash),
RawValue::<TxNumber>::from_vec(number),
)?;
} else {
txhash_cursor.insert(
RawKey::<TxHash>::from_vec(hash),
RawValue::<TxNumber>::from_vec(number),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice wonder how much usage of RawKey can impact perf, are we missing on any compressions due to its usage?

Comment on lines +125 to +127
let total_hashes = hash_collector.len();
let interval = (total_hashes / 10).max(1);
for (index, hash_to_number) in hash_collector.iter()?.enumerate() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ETL API is remarkably clean, loop, insert in collector loop, when end of range iterate over the collector and write to DB. We should document this general pattern.

info!(target: "sync::stages::transaction_lookup", ?append_only, progress = %format!("{:.2}%", (index as f64 / total_hashes as f64) * 100.0), "Writing transaction hash index");
}

if append_only {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this do anything meaningful anymore?

Copy link
Collaborator Author

@joshieDo joshieDo Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, first run will be append_only, subsequent runs won't be, although they will still benefit from ETL when dealing with large sets

crates/stages/src/stages/tx_lookup.rs Outdated Show resolved Hide resolved
crates/stages/src/stages/tx_lookup.rs Show resolved Hide resolved
Co-authored-by: Alexey Shekhirin <a.shekhirin@gmail.com>
Base automatically changed from joshie/concurrent-range-fetch to feat/static-files February 20, 2024 18:47
Comment on lines +100 to +101
let mut hash_collector: Collector<TxHash, TxNumber> =
Collector::new(Arc::new(TempDir::new()?), 500 * (1024 * 1024));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to make buffer capacity configurable here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to #6696, but should make things slightly faster on bigger sized chunks

crates/stages/src/stages/tx_lookup.rs Show resolved Hide resolved
@@ -99,7 +100,7 @@ where

fn flush(&mut self) {
self.buffer_size_bytes = 0;
self.buffer.sort_unstable_by(|a, b| a.0.cmp(&b.0));
self.buffer.par_sort_unstable_by(|a, b| a.0.cmp(&b.0));
Copy link
Collaborator

@mattsse mattsse Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how large do we expect this to be?

we might not benefit from par sort here if not large, can do a length check and use par_sort if large if small buffers are possible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given its only used on the pipeline, i'd say pretty large. Max at 100MB for headers, and 500MB worth for txlookup

}
}

impl TransactionLookupStage {
/// Create new instance of [TransactionLookupStage].
pub fn new(commit_threshold: u64, prune_mode: Option<PruneMode>) -> Self {
Self { commit_threshold, prune_mode }
pub fn new(chunk_size: u64, prune_mode: Option<PruneMode>) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config and documentation need to be changed to reflect this

@joshieDo joshieDo merged commit aeaabfb into feat/static-files Feb 22, 2024
23 of 25 checks passed
@joshieDo joshieDo deleted the joshie/txlookup-etl branch February 22, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-staged-sync Related to staged sync (pipelines and stages) A-static-files Related to static files C-enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants