Horizon Lite: Improve the performance and functionality of the batch-based indexer. #4566

sreuland · 2022-08-31T15:48:02Z

Context

There are several necessary improvements to the existing map-reduce batch job for index creation:

poor performance: performance of reduce is low when the target/source index is remote, for example, S3 (jobs don't complete, running forever and churning slowly on account/tx merging routines)
low visibility on performance: there's a lack of visibility on I/O rates due to the lack of metrics and logging.
lack of flexibility: the reduce job operates on all modules, even if the map job only specified on module.

Suggestions

In the tx index merge routine, perform a query against the 'source' index that has map job output for the tx/ folder, skip iterating all 255 tx prefixes if the map job output does not have 'tx' folder. (This happens when map was configured to not include transactions in its MODULES.)
We can change the entire map/reduce flow to use a shared persistent volume across all workers, then upload the volume to remote store once at end :
- have all map jobs write to a single on-disk volume or source of storage,
- the reduce jobs merge them together to the same on-disk source,
- final step uploads/syncs that disk to remote target index.
On account index merging, pre-download all the 'source' index mapped job account summary files, load those into a map of job_id:accountid->true/false, then the worker -> account -> read-all-map-jobs-for-account loop can check for account presence first and avoid sending iterative network trips to remote 'source' index that will be empty response anyway.

Acceptance Criteria

It's entirely possible that this task can/should be broken down into many sub-tasks based on the above suggestions, but the general criteria for completion should be:

Add more output on metrics such as upload times on both the map and reduce jobs.
The reduce job does not do unnecessary work if the map job did not apply all modules - per first suggestion above
The performance of the reduce batch job is significantly improved - per all three suggestions

The text was updated successfully, but these errors were encountered:

2opremio · 2022-09-01T16:07:17Z

We should also consider using something other than s3 since we may not end up using s3 in production (for cost reasons).

sreuland · 2022-09-01T17:17:37Z

@Shaptic @2opremio , I re-worded the acceptance criteria per the scrum feedback to make this ticket's scope s3 agnostic and more about optimizing regardless of the 'target' index's interface(s3, file, others..)

sreuland added feature request Ingestion Lite labels Aug 31, 2022

sreuland mentioned this issue Aug 31, 2022

exp/lighthorizon: Create pubnet indices for the MVP endpoints. #4475

Closed

sreuland moved this to Next Sprint Proposal in Platform Scrum Aug 31, 2022

sreuland added this to Platform Scrum Aug 31, 2022

sreuland added the horizonlight-scrum label Aug 31, 2022

Shaptic mentioned this issue Aug 31, 2022

Horizon Lite: Generate per-ledger indices for accounts #4567

Closed

3 tasks

Shaptic changed the title ~~exp/lighthorizon/cmd/batch: performance optimizations for reduce~~ Horizon Lite: Optimize the performance of the indexer reduce job. Aug 31, 2022

Shaptic changed the title ~~Horizon Lite: Optimize the performance of the indexer reduce job.~~ Horizon Lite: Improve the performance and functionality of the batch-based indexer. Aug 31, 2022

Shaptic removed the feature request label Aug 31, 2022

Shaptic mentioned this issue Sep 1, 2022

Horizon Lite - MVP Epic #4317

Closed

7 tasks

jcx120 moved this from Next Sprint Proposal to Backlog in Platform Scrum Sep 1, 2022

jcx120 mentioned this issue Sep 1, 2022

Horizon Lite - Productionization / Optimization Epic #4571

Closed

64 tasks

mollykarcher added the parked label May 5, 2023

mollykarcher closed this as completed May 5, 2023

github-project-automation bot moved this from Backlog to Done in Platform Scrum May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizon Lite: Improve the performance and functionality of the batch-based indexer. #4566

Horizon Lite: Improve the performance and functionality of the batch-based indexer. #4566

sreuland commented Aug 31, 2022 •

edited

Loading

2opremio commented Sep 1, 2022

sreuland commented Sep 1, 2022

Horizon Lite: Improve the performance and functionality of the batch-based indexer. #4566

Horizon Lite: Improve the performance and functionality of the batch-based indexer. #4566

Comments

sreuland commented Aug 31, 2022 • edited Loading

Context

Suggestions

Acceptance Criteria

2opremio commented Sep 1, 2022

sreuland commented Sep 1, 2022

sreuland commented Aug 31, 2022 •

edited

Loading