v1.18: scheduler forward packets (backport of #898) #1663

mergify · 2024-06-10T06:52:08Z

Problem

swqos was built with the assumption that transactions are forwarded by validator nodes
without swqos being used, forwarding was in process of being deprecated since behavior (without swqow stuff) was useless
major push in past weeks has led to operators adopting swqos forwarding side-deals
the new scheduler does not support forwarding for above reasons
the lack of support would make operators need to make a choice between swqos side deals and the new scheduler, which is less than ideal under the current circumstances

Summary of Changes

Add support for forwarding transactions in the new scheduler

Fixes #

This is an automatic backport of pull request #898 done by [Mergify](https://mergify.com).

mergify · 2024-06-10T06:52:10Z

Cherry-pick of fb35f19 has failed:

On branch mergify/bp/v1.18/pr-898
Your branch is up to date with 'origin/v1.18'.

You are currently cherry-picking commit fb35f1912e.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   core/src/banking_stage.rs
	modified:   core/src/banking_stage/forward_worker.rs
	modified:   core/src/banking_stage/scheduler_messages.rs

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	both modified:   core/src/banking_stage/transaction_scheduler/prio_graph_scheduler.rs
	both modified:   core/src/banking_stage/transaction_scheduler/scheduler_controller.rs
	deleted by us:   core/src/banking_stage/transaction_scheduler/scheduler_metrics.rs
	both modified:   core/src/banking_stage/transaction_scheduler/transaction_state.rs
	both modified:   core/src/banking_stage/transaction_scheduler/transaction_state_container.rs
	both modified:   core/src/validator.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

(cherry picked from commit fb35f19) # Conflicts: # core/src/banking_stage/transaction_scheduler/prio_graph_scheduler.rs # core/src/banking_stage/transaction_scheduler/scheduler_controller.rs # core/src/banking_stage/transaction_scheduler/scheduler_metrics.rs # core/src/banking_stage/transaction_scheduler/transaction_state.rs # core/src/banking_stage/transaction_scheduler/transaction_state_container.rs # core/src/validator.rs

t-nelson · 2024-06-13T03:16:00Z

core/src/banking_stage/transaction_scheduler/scheduler_controller.rs

@@ -181,8 +196,112 @@ impl SchedulerController {
        }
    }

+    /// Forward packets to the next leader.
+    fn forward_packets(&mut self, hold: bool) {


is this copypasta from the old scheduler?

No this is all new. old scheduler didn't have same container structure, or the same checks we do here (actually checking if tx can pay fees), or timeout.

core/src/banking_stage/transaction_scheduler/scheduler_controller.rs

core/src/banking_stage/transaction_scheduler/scheduler_metrics.rs

core/src/banking_stage/transaction_scheduler/transaction_state.rs

t-nelson · 2024-06-14T01:41:35Z

core/src/banking_stage/transaction_scheduler/transaction_state.rs

+        should_forward: bool,
+    },
+    /// Only used during transition.
+    Transitioning,


what does "transitioning" mean? is this like "preparing"?

it's an internal "hack" for transitioning from one state to another that allows us to take ownership of the fields without having to fill in dummy information in the other state.

t-nelson · 2024-06-14T01:59:15Z

core/src/banking_stage/transaction_scheduler/scheduler_controller.rs

+            let mut filter_array = [true; CHUNK_SIZE];
+            let mut ids = Vec::with_capacity(CHUNK_SIZE);
+            let mut txs = Vec::with_capacity(CHUNK_SIZE);


looks like we can do the same allocation optimization here

can with ids but not with txs since txs holds references. even if you clear the vec, borrow checker isn't aware that the references are gone.

t-nelson · 2024-06-14T02:06:03Z

core/src/banking_stage/transaction_scheduler/scheduler_controller.rs

+                let transaction = self.container.get_transaction_ttl(&id.id).unwrap();
+                txs.push(&transaction.transaction);


can we not do this in the loop above where we fill ids?

No. Popping ids is mutating change on container. this is getting reference to something in the container

t-nelson · 2024-06-14T02:12:49Z

core/src/banking_stage/transaction_scheduler/scheduler_controller.rs

+            while let Some(id) = self.container.pop() {
+                self.container.remove_by_id(&id.id);
+            }


seems like we could stand to implement a clear() on TransactionStateContainer and call .clear() on each of its members there

it'd be similar, but not what we want. clearing would remove transaction that are "in-flight".

Example:

We are leader, schedule a transaction for processing.
We stop being leader, but processing thread hasn't yet sent the finished message back to scheduler thread.
If scheduler clears the container then we will remove the id from the map, and when scheduler gets the finished message then it's for an unknown id (since if we cleared it).

diman-io · 2024-06-18T09:15:16Z

He there!

I kindly ask you to add two changes:

1. Add an argument to control this behavior:

Do not forward at all
Forward only what remains after the validator was the leader
Forward everything (as it is now)

The first and third options: this can be a conscious choice for some validators. So, it's their responsibility.

Constant forwarding can be used as harmful behavior. A low-staked (but enough to be considered staked) node can send copies of transactions to high-staked nodes that are not leaders at the moment. With the current implementation, all of them will forward these transactions to the current leader on their behalf.

I ask to make the default behavior the second option. Otherwise, the behavior will either be harmful or not very useful in terms of forwarding as it was originally considered (so that the user does not have to send their transaction multiple times).

2. Remove non-staked nodes filtering
I don't see the point in filtering transactions coming from non-staked nodes while forwarding. If these transactions have a higher priority fee than those from staked nodes, why should they be filtered? This is not beneficial for the validator's economy. And overall, it doesn't make sense for the priority fee. So, I ask to remove this censorship.

apfitzge · 2024-06-18T11:31:59Z

He there!

I kindly ask you to add two changes:

1. Add an argument to control this behavior:
* Do not forward at all

* Forward only what remains after the validator was the leader

* Forward everything (as it is now)
The first and third options: this can be a conscious choice for some validators. So, it's their responsibility.

Constant forwarding can be used as harmful behavior. A low-staked (but enough to be considered staked) node can send copies of transactions to high-staked nodes that are not leaders at the moment. With the current implementation, all of them will forward these transactions to the current leader on their behalf.

I ask to make the default behavior the second option. Otherwise, the behavior will either be harmful or not very useful in terms of forwarding as it was originally considered (so that the user does not have to send their transaction multiple times).

2. Remove non-staked nodes filtering I don't see the point in filtering transactions coming from non-staked nodes while forwarding. If these transactions have a higher priority fee than those from staked nodes, why should they be filtered? This is not beneficial for the validator's economy. And overall, it doesn't make sense for the priority fee. So, I ask to remove this censorship.

On 1, the intent here is a quick solution. The scheduler not the right place for it, imo, but we needed something because of the quick adoption of swqos, and it was historically done in block-production. We wanted something simple that was backportable; for us this ended up just being effectively the same behavior as in previous block-production-method.

The first and third options: this can be a conscious choice for some validators. So, it's their responsibility.

Any longer term solution we come up for this, imo, should be opt-in and be more configurable. But as I mentioned, many of us don't think pairing this swqos support stuff in the scheduler is a good long-term design - so I'm not sure if this configurability will come as is.

On 2, from my view non-staked node forwarding is incredibly harmful and was contributing to congestion a while back. Non-staked forwarding leads to sending to all nodes, even those that are not leader, because they will forward to the leader...this is essentially the network DOSing itself.
It was also mostly useless when examined previously; prior to swqos adoption >95% of forwarded packets received were duplicate. RPCs, at least then, were sending to next N leaders. So by the time txs get forwarded after the current leader's slots, they were already received by the next leader directly.

diman-io · 2024-06-18T12:29:13Z

So by the time txs get forwarded after the current leader's slots, they were already received by the next leader directly.

I absolutely agree with this. I observe the same thing.

But then the only argument for forwarding remains that it is somehow related to swqos. And here I don't understand. Swqos was presented to validators as a way to allow friendly unstaked RPC nodes to connect to a specific validator node. And this node would be able to forward transactions to the current leader.
Why should forwarding be present in this scheme? In my understanding, an unstaked RPC node has a paid side deal with the validator. And the validator is already responsible for delivering transactions to the leader. With the current forwarding implementation (in this PR), the validator taking the fee from the RPC will use other validators to deliver transactions to the leader (sending copies and thereby creating spam for the leader). Also, in this scheme, the source (IP) of the transaction changes for the leader.

diman-io · 2024-06-18T12:33:20Z

In the swqos implementation, the size of the stake matters. Thus, a validator with a 20K stake can, by forwarding, make their transaction appear to the leader as if it has a priority of 200K or 2M, for example.
Therefore, forwarding even breaks the logic of swqos.

UPD. I mean a priority in delivery, of course. Because the size of the stake guarantees that your connection won't be dropped.

apfitzge · 2024-06-18T12:44:52Z

But then the only argument for forwarding remains that it is somehow related to swqos.

Right. Without this sudden swqos adoption we would not have re-added forwarding.

Why should forwarding be present in this scheme?

I'm not sure I understand the question here?

The flow looks like this, how would you expect the PartnerValidator to get the tx to the actual leader?

graph LR;
user --> RPC --> PartnerValidator -->|forwarding| leader;

And the validator is already responsible for delivering transactions to the leader.

It's not though, afaik we will only forward for swqos side-deals. Any txs from validators forwarding to us will not be forwarded by us.

the validator taking the fee from the RPC will use other validators to deliver transactions to the leader

validator sends to the current leader (with some time offset to account for avg transit time iirc), it doesn't go through some other validators.

diman-io · 2024-06-18T18:53:05Z

Oh, I see where the misunderstanding is now.

The thing is, bad actors are using tools like https://github.com/helius-labs/atlas-txn-sender or https://github.com/marinade-finance/mtransaction. In short, these are proxies with a QUIC client and a validator key. The spam comes from such tools (I suspect that they are either installed on the validator's machine or additionally proxy traffic to the validator's IP or run a gossip client on the same machine as these proxies, because many validators filter traffic by IP from gossip).

For example, I permanently receive 2-3K TPS when I'm not the leader.

These tools will send their transactions to the regular TPU (not FWD) port of the validator, which is not the leader. So, these will be transactions from a staked node, and they wasn't forwarded. As a result, the validator will forward this spam.

That's essentially the problem.

apfitzge · 2024-06-18T19:48:52Z

For example, I permanently receive 2-3K TPS when I'm not the leader.

I also see this on some of our canary nodes running master.

These tools will send their transactions to the regular TPU (not FWD) port of the validator, which is not the leader.

Interestingly, I do not see this. I see a bunch of packets coming on the TPU port, but all the top IPs spamming me are all unstaked.

At least that's what's showing in the output from banking-trace-tool packet-count

edit: re-ran with some custom filter, i do see some people sending staked on TPU so it's possible. but it's pretty minimal on order of 6 TPS vs the ~800TPS average spam on the TPU over long period.

diman-io · 2024-06-18T20:54:45Z

Why do you think it works through forwarding on your own validator? I just tried sending a transaction through my own validator, and it went through successfully, but forwarding is disabled.

If I understand correctly, the RPC service sends the transaction to https://github.com/solana-labs/solana/blob/v1.18/send-transaction-service/src/send_transaction_service.rs, where the CurrentLeaderInfo is used, which is updated every second. So, the RPC sends the transaction to the TPU of the current leader.

Where am I wrong?

diman-io · 2024-06-18T21:33:01Z

Ahh, I see https://solana.com/developers/guides/advanced/stake-weighted-qos#configuring-the-rpc-node

Friendly RPC sends directly to our TPU.

In this case, I think that "do not forward" by default will be the best solution for now. And for those who have friendly RPC providers, it will need to be disabled.
Since they are already configuring for swqos, it will just be one more action for them. And the rest of the cluster will be protected from participating in this kind of spam.

JulI0-concerto · 2024-07-05T22:01:48Z

Ahh, I see https://solana.com/developers/guides/advanced/stake-weighted-qos#configuring-the-rpc-node

Friendly RPC sends directly to our TPU.

In this case, I think that "do not forward" by default will be the best solution for now. And for those who have friendly RPC providers, it will need to be disabled. Since they are already configuring for swqos, it will just be one more action for them. And the rest of the cluster will be protected from participating in this kind of spam.

The wording in this document is incorrect. The term "leader" should be replaced with "peered validator".

willhickey · 2024-09-04T16:22:46Z

Closing this as stale. If you think this is relevant please re-open and let @anza-xyz/backport-reviewers know why it is important

ferric-sol · 2024-09-07T18:01:41Z

@apfitzge can we get this reopened once #2285 lands too, please?

mergify bot requested a review from a team as a code owner June 10, 2024 06:52

mergify bot added the conflicts label Jun 10, 2024

mergify bot assigned apfitzge Jun 10, 2024

apfitzge added 6 commits June 11, 2024 08:32

resolve conflicts: prio_graph_scheduler.rs

48f337e

resolve conflicts: scheduler_controller.rs

6c41fba

resolve conflicts: transaction_state.rs

db93968

resolve conflicts: transaction_state_container.rs

b585c77

various fixes

98f9e9b

apfitzge force-pushed the mergify/bp/v1.18/pr-898 branch from 7978888 to 98f9e9b Compare June 11, 2024 00:33

apfitzge requested a review from tao-stones June 11, 2024 07:42

t-nelson reviewed Jun 13, 2024

View reviewed changes

apfitzge added 3 commits June 13, 2024 09:05

remove scheduler metrics - was added in bad merge conflict

7d87c75

optimize using drain

5584770

should_forward

209bb85

apfitzge mentioned this pull request Jun 13, 2024

Scheduler Optimization: Pre-allocate vectors during buffering checks #1725

Merged

t-nelson reviewed Jun 14, 2024

View reviewed changes

apfitzge mentioned this pull request Jun 17, 2024

Validator does not forward transactions when central scheduler is enabled #1643

Closed

apfitzge mentioned this pull request Jun 20, 2024

scheduler opt-in forwarding #1801

Merged

willhickey closed this Sep 4, 2024

mergify bot deleted the mergify/bp/v1.18/pr-898 branch September 4, 2024 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.18: scheduler forward packets (backport of #898) #1663

v1.18: scheduler forward packets (backport of #898) #1663

mergify bot commented Jun 10, 2024

mergify bot commented Jun 10, 2024

t-nelson Jun 13, 2024

apfitzge Jun 13, 2024

t-nelson Jun 14, 2024

apfitzge Jun 17, 2024

t-nelson Jun 14, 2024

apfitzge Jun 17, 2024

t-nelson Jun 14, 2024

apfitzge Jun 17, 2024

t-nelson Jun 14, 2024

apfitzge Jun 17, 2024

diman-io commented Jun 18, 2024

apfitzge commented Jun 18, 2024

diman-io commented Jun 18, 2024

diman-io commented Jun 18, 2024 •

edited

Loading

apfitzge commented Jun 18, 2024

diman-io commented Jun 18, 2024 •

edited

Loading

apfitzge commented Jun 18, 2024 •

edited

Loading

diman-io commented Jun 18, 2024

diman-io commented Jun 18, 2024

JulI0-concerto commented Jul 5, 2024

willhickey commented Sep 4, 2024

ferric-sol commented Sep 7, 2024

		let transaction = self.container.get_transaction_ttl(&id.id).unwrap();
		txs.push(&transaction.transaction);

v1.18: scheduler forward packets (backport of #898) #1663

v1.18: scheduler forward packets (backport of #898) #1663

Conversation

mergify bot commented Jun 10, 2024

Problem

Summary of Changes

mergify bot commented Jun 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diman-io commented Jun 18, 2024

apfitzge commented Jun 18, 2024

diman-io commented Jun 18, 2024

diman-io commented Jun 18, 2024 • edited Loading

apfitzge commented Jun 18, 2024

diman-io commented Jun 18, 2024 • edited Loading

apfitzge commented Jun 18, 2024 • edited Loading

diman-io commented Jun 18, 2024

diman-io commented Jun 18, 2024

JulI0-concerto commented Jul 5, 2024

willhickey commented Sep 4, 2024

ferric-sol commented Sep 7, 2024

diman-io commented Jun 18, 2024 •

edited

Loading

diman-io commented Jun 18, 2024 •

edited

Loading

apfitzge commented Jun 18, 2024 •

edited

Loading