indexer-alt: separate updates for consistent sequential pipelines #20482

amnn · 2024-12-02T13:33:21Z

Description

Use object changes from transaction effects to figure out which changes to consistent tables correspond to new rows and which changes correspond to updates. This means we can avoid using INSERT ... ON CONFLICT DO UPDATE which requires postgres to try an insert, detect constraints, and then go for the update, which should hopefully improve performance. On the other hand, it means that these pipelines will not work at all if they are started at an arbitrary point in time (because the UPDATE-s will fail).

The largest part of this change was adding support for bulk-updates to diesel (see diesel-rs/diesel#2879). This requires opting in to breaking changes by exposing diesel's internals. To limit the fall out of that, this support has been added in its own crate.

Finally, as part of this change, I ran into a flag that can be set on model types: #[diesel(treat_none_as_default_value = ...)] which defaults to true. Setting this to false on models that contain optional values should improve statistics collection and may improve performance through prepared statement caching.

Test plan

Unit tests for update_from query generation, and E2E tests for running updates on a DB with the new DSL:

sui$ cargo nextest run -p diesel-update-from

Run the indexer before and after the change, dump the resulting tables and make sure the results are the same:

sui$ cargo run -p sui-indexer-alt -- generate-config > /tmp/indexer.toml
sui$ cargo run -p sui-indexer-alt -- indexer            \
  --remote-store-url https://checkpoints.mainnet.sui.io/ \
  --last-checkpoint 50000 --config /tmp/indexer.toml    \
  --pipeline sum_obj_types --pipeline sum_coin_balances

sui$ psql postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt
sui_indexer_alt=# COPY
    (SELECT object_id, object_version, owner_kind, owner_id FROM sum_obj_types ORDER BY object_id)
TO
    '/tmp/objs.csv'
WITH
    DELIMITER ',' CSV HEADER;
sui_indexer_alt=# COPY
    (SELECT object_id, object_version, owner_id, coin_balance FROM sum_coin_balances ORDER BY object_id)
TO
    '/tmp/coins.csv'
WITH
    DELIMITER ',' CSV HEADER;

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

vercel · 2024-12-02T13:33:26Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 15, 2024 3:50pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Dec 15, 2024 3:50pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Dec 15, 2024 3:50pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Dec 15, 2024 3:50pm

crates/diesel-update-from/src/grouped.rs

gegaowp

looks good overall, now I can relate more about the pain caused by diesel ..

gegaowp · 2024-12-03T16:24:11Z

crates/sui-indexer-alt/src/handlers/sum_coin_balances.rs

@@ -59,38 +60,49 @@ impl Processor for SumCoinBalances {
                }
            }

-            // Deleted and wrapped coins
+            // Do a fist pass to add updates without their associated contents into the `values`
+            // mapping, based on the transaction's object changes.
            for change in tx.effects.object_changes() {


TransactionEffectsAPI has methods of

fn created(&self) -> Vec<(ObjectRef, Owner)>; fn mutated(&self) -> Vec<(ObjectRef, Owner)>; fn unwrapped(&self) -> Vec<(ObjectRef, Owner)>; fn deleted(&self) -> Vec<ObjectRef>; fn unwrapped_then_deleted(&self) -> Vec<ObjectRef>; fn wrapped(&self) -> Vec<ObjectRef>;

it seems cleaner to use that instead of classifying object changes again?

Maybe it's just me, but I don't find that cleaner. For me, object_changes is more uniform, because it separates out the detail of whether the object's ID was created/deleted from whether the object was added or removed from the global store, which means you can focus on just one part.

For example, in this case, we only care about visibility in the global store, and we are able to use the presence/absence of input/output versions to enumerate every possible case. As far as I can tell, if I wanted to do the same thing with the EffectsV1 compatibility APIs:

I would need to merge together the cases that I cared about (unwrapped and created go together, wrapped and deleted go together, mutated goes on its own),

I don't get the compiler's help to tell me I've handled every case,

the code around the part that figures the update kind would need to be duplicated, and

when these compatibility APIs get called on V2 effects (the majority on chain these days), they will scan overall object changes every time, filtering changes into these buckets, only for us to merge those buckets back together.

Maybe I'm missing something though -- @gegaowp, if you see a clearer path, can you elaborate?

the main reason that I think compatibility API would help is, to use object changes, matching logic to get UpdateKind duplicates across files. If we use the compatibility API, we can do union them to get insert / update / delete, if so is UpdateKind enum still necessary?

I don't get the compiler's help to tell me I've handled every case,

this is a good point.

gegaowp · 2024-12-03T16:25:14Z

crates/sui-indexer-alt/src/models/objects.rs

@@ -14,6 +14,7 @@ use crate::schema::{

 #[derive(Insertable, Debug, Clone, FieldCount)]
 #[diesel(table_name = kv_objects, primary_key(object_id, object_version))]
+#[diesel(treat_none_as_default_value = false)]


gegaowp · 2024-12-03T16:26:43Z

crates/diesel-update-from/src/tests.rs

+            },
+        ]);
+
+    assert_display_snapshot!(debug_query::<Pg, _>(&query), @r###"UPDATE "objects" SET "version" = excluded."version", "kind" = excluded."kind", "owner" = excluded."owner", "type_" = excluded."type_" FROM (VALUES ($1, $2, $3, $4, $5), ($6, $7, $8, $9, $10)) AS excluded ("object_id", "version", "kind", "owner", "type_") WHERE ("objects"."object_id" = excluded."object_id") -- binds: [[1, 2, 3], 1, 2, None, Some("type"), [4, 5, 6], 2, 3, Some([7, 8, 9]), None]"###);


lxfind · 2024-12-04T17:21:11Z

Shall we treat this as an experiment first?
It would break the ability to benchmark arbitrary range of checkpoint data, and introduces non-trivial complexity.
So I think we should merge it only if we are sure it's valuable and worth the cost.

emmazzz · 2024-12-05T00:26:21Z

crates/sui-indexer-alt/src/handlers/sum_obj_types.rs

+                        .execute(conn),
+                )),
+                Either::Left(Either::Right(update)) => Either::Left(Either::Right(
+                    update_from(sum_obj_types::table)


It seems quite a big effort to introduce a new diesel-update-from crate just so we can use it here and in the equivalent place in sum_coin_balances. Could we just use raw query here instead and avoid adding that crate? The code in that crate seems to have done a lot just to make diesel happy. I don't fully understand the impl blocks and I fear it might be hard to maintain.

Yeah, the complexity in diesel-update-from is not great. The rationale behind isolating it in its own crate is that it is unlikely to change, and we should treat it as if it were part of diesel, so maintenance should be minimal but I would be lying if I said that there would be none (the fact that it requires opting in to breaking changes alone is a sign that there may be some).

To me this is the argument for not using diesel, though. If it does not support all the features we need, and we do not feel able to add and maintain those features ourselves, it is going to bias us towards doing things in a sub-optimal way to fit into the feature set that diesel offers conveniently, which we know we cannot afford to do and this is not the first time this has bitten us:

Our inserts and updates bind every value rather than UNNEST arrays of values.

We can't write comparisons over tuples, which would help postgres plan those queries more efficiently.

We can't use CTEs

...I am sure there are other things that I can't remember, and this is not to mention that the errors we get out of diesel when we make a mistake are horrific.

In each of these occasions, we either put up with the inefficiency or drop into raw queries where we don't get support for detecting schema changes, dealing with type marshaling etc, and the story would be similar here -- it is possible to use raw queries but it defeats the point of the abstraction if we lose its safety features when the query gets more complex.

Remembered another one -- diesel's query types don't support Clone so we needed to bend over backwards in GraphQL to do stuff like "explain a query first, and then run it", or "print out the query".

emmazzz · 2024-12-05T00:59:40Z

crates/sui-indexer-alt/src/handlers/sum_coin_balances.rs

                };

-                match values.entry(object_id) {
-                    Entry::Occupied(entry) => {
-                        ensure!(entry.get().object_version > object_version);


Hmm how does this version's code work at all? Wouldn't we always hit this case (because we inserted into values on line 83) and error out as soon as we have an object touched by multiple transactions in a checkpoint?

In the previous version of the code, the first loop handled wrapped and deleted objects, and this loops handles modified and created objects, so they would never touch the same object ID (if the transaction deleted/wrapped the object, it's not going to be in the list of output objects we're iterating over here).

emmazzz · 2024-12-05T02:51:24Z

crates/sui-indexer-alt/src/handlers/sum_coin_balances.rs

            }
        }

-        let update_chunks = updates.chunks(UPDATE_CHUNK_ROWS).map(Either::Left);
+        let insert_chunks = inserts
+            .chunks(UPDATE_CHUNK_ROWS)


I figured we may want to use a separate constant for insert chunk size?

I don't think so -- because the chunk size is determined by the number of binds, which is the same in both cases.

emmazzz · 2024-12-05T03:04:22Z

crates/sui-indexer-alt/src/handlers/wal_coin_balances.rs

@@ -42,10 +42,10 @@ impl Handler for WalCoinBalances {
                object_id: value.object_id.to_vec(),
                object_version: value.object_version as i64,

-                owner_id: value.update.as_ref().map(|o| o.owner_id.clone()),
+                owner_id: value.value.as_ref().map(|o| o.owner_id.clone()),


value.value is not super readable. Could we change it to something like update.new_value here and in the other wal- pipeline?

Sure, we can change that.

emmazzz · 2024-12-05T03:06:38Z

crates/sui-indexer-alt/src/models/objects.rs

+
+#[derive(Clone, Debug, PartialEq, Eq, PartialOrd, Ord)]
+pub enum UpdateKind {
+    /// Object was created or unwrapped at this version.


I like the comments here mapping object change terms to db update operations.

amnn

Yeah, happy to run this as an experiment first, but this has also renewed my energy around trying an alternative to diesel, because it feels like it has held us back too many times.

As an aside, while the indexer does run from arbitrary checkpoints today, it's not correct to do that, both in terms of the results that are written to the table, and the performance of those writes (I believe the fact that those writes are treated as inserts rather than updates makes a big difference on performance).

amnn · 2024-12-05T12:25:36Z

crates/sui-indexer-alt/src/handlers/sum_coin_balances.rs

            }
        }

-        let update_chunks = updates.chunks(UPDATE_CHUNK_ROWS).map(Either::Left);
+        let insert_chunks = inserts
+            .chunks(UPDATE_CHUNK_ROWS)


I don't think so -- because the chunk size is determined by the number of binds, which is the same in both cases.

amnn · 2024-12-05T12:26:47Z

crates/sui-indexer-alt/src/handlers/sum_coin_balances.rs

                };

-                match values.entry(object_id) {
-                    Entry::Occupied(entry) => {
-                        ensure!(entry.get().object_version > object_version);


In the previous version of the code, the first loop handled wrapped and deleted objects, and this loops handles modified and created objects, so they would never touch the same object ID (if the transaction deleted/wrapped the object, it's not going to be in the list of output objects we're iterating over here).

amnn · 2024-12-05T12:36:51Z

crates/sui-indexer-alt/src/handlers/sum_obj_types.rs

+                        .execute(conn),
+                )),
+                Either::Left(Either::Right(update)) => Either::Left(Either::Right(
+                    update_from(sum_obj_types::table)


Yeah, the complexity in diesel-update-from is not great. The rationale behind isolating it in its own crate is that it is unlikely to change, and we should treat it as if it were part of diesel, so maintenance should be minimal but I would be lying if I said that there would be none (the fact that it requires opting in to breaking changes alone is a sign that there may be some).

To me this is the argument for not using diesel, though. If it does not support all the features we need, and we do not feel able to add and maintain those features ourselves, it is going to bias us towards doing things in a sub-optimal way to fit into the feature set that diesel offers conveniently, which we know we cannot afford to do and this is not the first time this has bitten us:

Our inserts and updates bind every value rather than UNNEST arrays of values.

We can't write comparisons over tuples, which would help postgres plan those queries more efficiently.

We can't use CTEs

...I am sure there are other things that I can't remember, and this is not to mention that the errors we get out of diesel when we make a mistake are horrific.

In each of these occasions, we either put up with the inefficiency or drop into raw queries where we don't get support for detecting schema changes, dealing with type marshaling etc, and the story would be similar here -- it is possible to use raw queries but it defeats the point of the abstraction if we lose its safety features when the query gets more complex.

amnn · 2024-12-05T12:37:15Z

crates/sui-indexer-alt/src/handlers/wal_coin_balances.rs

@@ -42,10 +42,10 @@ impl Handler for WalCoinBalances {
                object_id: value.object_id.to_vec(),
                object_version: value.object_version as i64,

-                owner_id: value.update.as_ref().map(|o| o.owner_id.clone()),
+                owner_id: value.value.as_ref().map(|o| o.owner_id.clone()),


Sure, we can change that.

## Description Although postgres supports bulk-updating rows using `VALUES`, diesel does not natively support it. This change adds support for this. It is in its own crate so that we can limit the fallout of depending on the diesel breaking changes feature (which we need to depend on to access the types that diesel converts collections of model types into, ready to be inserted into the DB). ## Test plan Unit tests for the query generation, and E2E tests for running updates on a DB with this new DSL: ``` sui$ cargo nextest run -p diesel-update-from ```

## Description Use object changes from transaction effects to figure out which changes to consistent tables correspond to new rows and which changes correspond to updates. This means we can avoid using `INSERT ... ON CONFLICT DO UPDATE` which requires postgres to try an insert, detect constraints, and then go for the update, which should hopefully improve performance. On the other hand, it means that these pipelines will not work at all if they are started at an arbitrary point in time (because the `UPDATE`-s will fail). ## Test plan Run the indexer before and after the change, dump the resulting tables and make sure the results are the same: ``` sui$ cargo run -p sui-indexer-alt -- generate-config > /tmp/indexer.toml sui$ cargo run -p sui-indexer-alt -- indexer \ --remote-store-url https://checkpoints.mainnet.sui.io \ --last-checkpoint 50000 --config /tmp/indexer.toml \ --pipeline sum_obj_types --pipeline sum_coin_balances sui$ psql postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt sui_indexer_alt=# COPY (SELECT object_id, object_version, owner_kind, owner_id FROM sum_obj_types ORDER BY object_id) TO '/tmp/objs.csv' WITH DELIMITER ',' CSV HEADER; sui_indexer_alt=# COPY (SELECT object_id, object_version, owner_id, coin_balance FROM sum_coin_balances ORDER BY object_id) TO '/tmp/coins.csv' WITH DELIMITER ',' CSV HEADER; ```

## Description Add `#[diesel(treat_none_as_default_value = false)]` to model types that include optional fields. This affects how those fields are written out to SQL when they contain `None`. Previously (and by default), those fields would be represented by the keyword `DEFAULT VALUE`, and after this change, they will be represented by a parameter binding, which will be bound to `NULL`. This is semantically identical in our case, because we don't set default values, but it also results in less variety in prepared statements (because regardless of the content of fields, they will now all be represented by a binding), which will improve grouping of statistics per-statement, and could also improve performance, if those prepared statements can be cached and re-used. ## Test plan Re-run indexer on first 100000 checkpoints.

amnn · 2024-12-17T15:33:45Z

Experiments showed that even with this change we weren't matching the performance of the new obj_info and coin_balance_buckets pipelines, even before we landed further improvements to them. Given we are fairly confident that those pipelines will serve as well in production, I think it's safe for us wind this experiment down, and avoid taking on the extra complexity in our codebase!

## Description Pick out this change from the bigger #20482. `DEFAULT VALUE` is not something we use, and treating `None` as `DEFAULT VALUE` means we have to prepare different statements for each insert/update based on where the `None`'s appear. By treating `None` as `NULL`, we get to re-use prepared statements more, and the query sampler will do a better job grouping similar inserts/updates. ## Test plan Run indexer on coin balances and object info, locally, for the first 10,000 checkpoints.

## Description Pick out this change from the bigger #20482. `DEFAULT VALUE` is not something we use, and treating `None` as `DEFAULT VALUE` means we have to prepare different statements for each insert/update based on where the `None`'s appear. By treating `None` as `NULL`, we get to re-use prepared statements more, and the query sampler will do a better job grouping similar inserts/updates. ## Test plan Run indexer on coin balances and object info, locally, for the first 10,000 checkpoints. --- ## Release notes Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required. For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates. - [ ] Protocol: - [ ] Nodes (Validators and Full nodes): - [ ] Indexer: - [ ] JSON-RPC: - [ ] GraphQL: - [ ] CLI: - [ ] Rust SDK: - [ ] REST API:

amnn requested review from lxfind, bmwill, emmazzz, gegaowp and wlmyng December 2, 2024 13:33

amnn self-assigned this Dec 2, 2024

amnn temporarily deployed to sui-typescript-aws-kms-test-env December 2, 2024 13:33 — with GitHub Actions Inactive

amnn temporarily deployed to sui-typescript-aws-kms-test-env December 2, 2024 14:02 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs December 2, 2024 14:03 View deployment

yanganto mentioned this pull request Dec 2, 2024

Allow to call init and debug inner field in a test module #20389

Open

amnn mentioned this pull request Dec 2, 2024

[indexer-alt] Properly respect skip_watermark in sequential pipelines #20423

Open

8 tasks

amnn commented Dec 2, 2024

View reviewed changes

crates/diesel-update-from/src/grouped.rs Outdated Show resolved Hide resolved

amnn temporarily deployed to sui-typescript-aws-kms-test-env December 3, 2024 15:11 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs December 3, 2024 15:11 View deployment

gegaowp reviewed Dec 3, 2024

View reviewed changes

emmazzz reviewed Dec 5, 2024

View reviewed changes

amnn commented Dec 5, 2024

View reviewed changes

amnn force-pushed the amnn/idx-bulk-update branch from a5084b7 to ad5af0c Compare December 10, 2024 00:51

amnn temporarily deployed to sui-typescript-aws-kms-test-env December 10, 2024 00:51 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs December 10, 2024 00:53 View deployment

amnn force-pushed the amnn/idx-bulk-update branch from ad5af0c to 465401b Compare December 12, 2024 19:49

amnn temporarily deployed to sui-typescript-aws-kms-test-env December 12, 2024 19:49 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs December 12, 2024 19:51 View deployment

amnn force-pushed the amnn/idx-bulk-update branch from 465401b to f4cb7d0 Compare December 12, 2024 19:57

amnn temporarily deployed to sui-typescript-aws-kms-test-env December 12, 2024 19:57 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs December 12, 2024 19:59 View deployment

amnn force-pushed the amnn/idx-bulk-update branch from f4cb7d0 to fc4d51f Compare December 13, 2024 19:07

amnn temporarily deployed to sui-typescript-aws-kms-test-env December 13, 2024 19:07 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs December 13, 2024 19:09 View deployment

amnn added 3 commits December 15, 2024 15:48

amnn force-pushed the amnn/idx-bulk-update branch from fc4d51f to a0899f0 Compare December 15, 2024 15:48

amnn temporarily deployed to sui-typescript-aws-kms-test-env December 15, 2024 15:49 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs December 15, 2024 15:50 View deployment

amnn closed this Dec 17, 2024

amnn mentioned this pull request Dec 20, 2024

indexer-alt: treat None as NULL #20704

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexer-alt: separate updates for consistent sequential pipelines #20482

indexer-alt: separate updates for consistent sequential pipelines #20482

amnn commented Dec 2, 2024

vercel bot commented Dec 2, 2024 •

edited

Loading

gegaowp left a comment

gegaowp Dec 3, 2024

amnn Dec 3, 2024

gegaowp Dec 5, 2024 •

edited

Loading

gegaowp Dec 3, 2024

gegaowp Dec 3, 2024

lxfind commented Dec 4, 2024

emmazzz Dec 5, 2024

amnn Dec 5, 2024

amnn Dec 5, 2024

emmazzz Dec 5, 2024

amnn Dec 5, 2024

emmazzz Dec 5, 2024

amnn Dec 5, 2024

emmazzz Dec 5, 2024

amnn Dec 5, 2024

emmazzz Dec 5, 2024

amnn left a comment

amnn Dec 5, 2024

amnn Dec 5, 2024

amnn Dec 5, 2024

amnn Dec 5, 2024

amnn commented Dec 17, 2024

indexer-alt: separate updates for consistent sequential pipelines #20482

indexer-alt: separate updates for consistent sequential pipelines #20482

Conversation

amnn commented Dec 2, 2024

Description

Test plan

Release notes

vercel bot commented Dec 2, 2024 • edited Loading

gegaowp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gegaowp Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lxfind commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amnn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amnn commented Dec 17, 2024

vercel bot commented Dec 2, 2024 •

edited

Loading

gegaowp Dec 5, 2024 •

edited

Loading