-
Notifications
You must be signed in to change notification settings - Fork 6
feat: custom indexer with multi index support #455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: William Hankins <william@sundae.fi>
Signed-off-by: William Hankins <william@sundae.fi>
…ndex channels Signed-off-by: William Hankins <william@sundae.fi>
Signed-off-by: William Hankins <william@sundae.fi>
Signed-off-by: William Hankins <william@sundae.fi>
Signed-off-by: William Hankins <william@sundae.fi>
Signed-off-by: William Hankins <william@sundae.fi>
Signed-off-by: William Hankins <william@sundae.fi>
Signed-off-by: William Hankins <william@sundae.fi>
Signed-off-by: William Hankins <william@sundae.fi>
lowhung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand the flow...
- When a new index is added, it spawns an actor per index, each with its own channel
- ChainSync messages come in on the subscription and fan out to all actors
- The actors process independently, then return a result with updated cursor entries
- Main loop collects responses + persists cursors + handles failures
| let raw = self.cursor.get("cursor")?; | ||
| async fn load(&self) -> Result<HashMap<String, CursorEntry>> { | ||
| let mut out = HashMap::new(); | ||
| let iter = self.partition.prefix("cursor/"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] We reference this prefix "cursor/" in a few places, should we keep it as a const in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added "cursor/" as a const in 7b5d429.
|
|
||
| self.cursor.insert("cursor", raw)?; | ||
| for (name, point) in tips { | ||
| let key = format!("cursor/{name}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] A method in the cursor store to format the key and return it / get the prefix could be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added key_for, name_from_key, and prefix_iter helper methods in 7b5d429.
| } | ||
|
|
||
| #[derive(Debug)] | ||
| pub struct CursorSaveError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Could leverage thiserror here
| pub struct CursorSaveError { | |
| #[derive(Debug, thiserror::Error)] | |
| #[error("Failed to save cursor tips for: {failed:?}")] | |
| pub struct CursorSaveError { | |
| pub failed: Vec<String>, | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this suggestion! Switched to using thiserror in 7b5d429.
| halted: false, | ||
| }); | ||
|
|
||
| if force_restart || entry.halted { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍🏻
| new_tips.insert(name.clone(), entry.clone()); | ||
| change_sync_point(entry.tip, run_context.clone(), &sync_topic.to_string()).await?; | ||
| } | ||
| Ok(IndexResult::FatalResetError { entry, reason }) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear flow here 👍🏻
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn rollback_fails_then_reset_succeeds_clears_halt_and_updates_tip() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear tests 🔥 nicely done
Signed-off-by: William Hankins <william@sundae.fi>
Yes, that's the flow 😄 One extra detail: the index actors themselves handle halting on error and attempt their own reset on rollback failure. The main loop/manager just forwards events to the actors, persists tip/halting state into the cursor store, and removes any actors from the runtime that couldn't recover. |
| if force_restart || entry.halted { | ||
| index.reset(&default_start).await?; | ||
| entry.tip = default_start.clone(); | ||
| entry.halted = false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by this block. If an index started on slot 50000 and then halted with an error on slot 60000, why are we resetting back to 50000? That implies we successfully processed a few thousand blocks, and I don't think we need to undo that work.
In general, we should avoid resetting indexes if the caller didn't explicitly ask. In production, it will almost certainly cause downtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we reset to default_start when an index stops in a halted state is that we can't rely on receiving a corrective rollback. If a rollback fails or a block fails to decode, ChainSync won't send a second rollback message on the next run which leaves the index in a corrupted state with no further way to recover. Given that limitation, resetting on startup is the only reliable way to bring the index back to a healthy state.
| // If the rollback failed, attempt to reset the index | ||
| Err(_) => match wrapper.index.reset(&wrapper.default_start).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about resetting if a rollback fails; it could fail for arbitrary reasons, including for transient issues like a DB failure.
But I think in practice it'll be fine to do this:
- rollbacks are relatively rare compared to roll forwards
- most transient failures which happen in a rollback would probably happen in a rollforward, sooner rather than later
- resets will almost certainly result in downtime in production, but the system will eventually recover by itself
So if it's possible to recover from a rollback without triggering this reset, that'd be ideal. But we can live without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason the rollback failure path triggers a reset is that we don't get a second rollback message if the first one fails. Without that, the index is stuck in a corrupted state with no further corrective signal coming from ChainSync. A reset is the only guaranteed way to bring it back to a healthy state. If you think a retry inside the module would meaningfully reduce resets from transient failures I can add that.
|
|
||
| let mut entry = cursors.get(&name).cloned().unwrap_or(CursorEntry { | ||
| tip: default_start.clone(), | ||
| halted: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little thing, can we default this to true? That way, we call reset every time an index is created, which means callers can add (re)initialization logic in the reset method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to default to halted: true in 100d2d4.
|
|
||
| async fn reset(&mut self, start: &Point) -> Result<Point>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we return a point from this? I'm not sure what it makes sense to return besides the start it was passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here was to give implementations the option to adjust their replay point on reset. For example, an index that only needs the last N blocks could implement reset to resume from a later point than the one provided. If we don't expect any index to diverge from the provided start, then returning a point isn't needed. I can switch it to () unless we want to support this kind of behavior.
Signed-off-by: William Hankins <william@sundae.fi>
|
Merging. @SupernaviX and I discussed a refactor to improve halt recovery on subsequent runs which is outlined in #461. This will be implemented in a follow up PR after our milestones have completed as these are not required changes. |
Description
This PR extends the
custom_indexermodule to support multiple independent indexes within a single indexer instance. Each index is registered usingadd_index, which creates a dedicated channel for receiving transaction and rollback events. All messages include a oneshot response channel, allowing the manager task to update cursor state and remove corrupted indexes.The error handling model is explicit and prevents corrupted index state from persisting across runs:
halted = trueflag.senders). On next startup, the persisted halt flag causes a fresh reset.Related Issue(s)
Completes #380
How was this tested?
index_actor(apply tx, rollback, reset paths).Checklist
Impact / Side effects
Adds safe recovery behavior and multi-index support to the custom indexer.
Reviewer notes / Areas to focus
Rollback failure behavior and the halt/persist/reset flow