-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get_delegators_for_validator #3219
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time it takes for this endpoint to turn the response into JSON should be benchmarked here as this potentially more costly than the actual RockDB access.
This is awaiting a benchmark. We will perform this soon. |
What's the status on this PR? I see failing CIs across the board and no benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 6-hour network test was run against this PR.
Network Configuration
10 Validators: m5.8xlarge (32vCPU, 128 MB RAM)
30 Clients: m5.4xlarge (16vCPU, 64 MB RAM)
17000 Delegators bonded to a single delegator
Tests
The following tests were run in serial.
Test 1: 1 request cannon sent 100 requests at 1 request per second at a single client
Test 2: 30 request cannons attempting to send 100 blocking HTTP requests at 1 request per second against a single client
Test 3: 50 request cannons attempting to send 100 blocking HTTP requests at 1 request per second at the load balancer distributing traffic among all client nodes equally.
Test 1 Results:
REQUEST STATISTICS:
requests: 100
request-cannons: 1
avg request payload size: 1.4 mb
TIME:
min: 4228 ms
max: 6597 ms
mean: 5460 ms
std dev: 1234 ms
Number of Errors: 0
Number of Requests: 11
NOTES:
This test caused the node to fall behind in its peers in sync rate by a small amount during the test.
Test 2 Results
REQUEST STATISTICS:
requests: 3000 (via 30 cannons)
request-cannons: 30
avg request payload size (mb): 1.4 mb
TIME:
request roundtrip min: 44483 ms
request roundtrip max: 109171 ms
request roundtrip mean: 52374 ms
request roundtrip std dev: 19126 ms
Number of Requests: 100
Number of Errors: 0
NOTES:
This test caused the node to stop syncing during the requests.
It also caused calls to any other endpoint to extend to the range of `50000 ms`.
After the requests stopped the node continued to sync.
Test 3 Results
REQUEST STATISTICS:
requests: 5000
request cannons: 50
avg response payload size: 1.4 mb
TIME:
min: 4109 ms
max: 25658 ms
mean: 13242.851485148514 ms
std dev: 3711.8065908291583 ms
NOTES: This test caused block sync speed of several client nodes to fall during the test. Block sync speed recovered after the test concluded. Sustained heavy load on this endpoint would likely heavily slow down client performance.
Conclusion
When there are a lot of delegators attached to a validator, this endpoint seemingly has the ability to make nodes fall out of sync and block other tasks. This isn't entirely surprising given querying RocksDB for this information is a compute heavy task and this task is spawned into the main tokio task queue.
This endpoint needs to be throttled heavily.
Recommendations
- Spawn this task onto the blocking threadpool via
spawn_blocking!
(Addressed by commit1c74880
) - This endpoint should return an error if the block height is too far behind. Sending outdated committee data isn't a good idea (Addressed by commit
1c74880
) - Limit this endpoint to serving a maximum amount of 5 concurrent invocations of this task (and not per IP, 5 invocations total)
Signed-off-by: Mike Turner <mturner@aleo.org>
… node is too far behind
Thx for pushing on this. Looks like a valid DoS vector if someone exposes their endpoints. I think its likely other REST endpoints (e.g. Or perhaps our recommendation should be for clients with a public endpoint to have a very low general rate limit. Infura doesn't let you query geth directly either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another test was run on the latest changes.
It seems like these changes take care of the previous issues by not blocking the main thread and denying traffic if the node fails to sync further.
As is the rate limiting defaults should prevent dos-ing.
Details of the tests below.
Test 1
REQUEST STATISTICS:
requests: 100
request-cannons: 1
avg request payload size: 1.6 mb
TIME:
min: 4462 ms
max: 5760 ms
mean: 5007 ms
std dev: 448 ms
NOTES: No degradation in block_sync time was seen
Test 2
REQUEST STATISTICS:
requests: 3000
request-cannons: 30
avg request payload size: 1.6 mb
TIME:
min: 6389 ms
max: 30107 ms
mean: 17809.54 ms
std dev: 3377 ms
NOTES: The client under attack experienced lag in block sync rate. However after experiencing the lag in sync rate it denied further traffic and then was able to continue syncing. This lag-catchup response curve continued for as long as the load did. After the load relented - the node was able to catch back up.
Test 3
REQUEST STATISTICS:
requests: 5000
request-cannons: 30
avg request payload size: 1.6 mb
TIME:
min: 12481 ms
max: 50167 ms
mean: 34967 ms
std dev: 3249 ms
NOTES: Clients stayed at tip during the entire test
Path(validator): Path<Address<N>>, | ||
) -> Result<ErasedJson, RestError> { | ||
// Do not process the request if the node is too far behind to avoid sending outdated data. | ||
if rest.routing.num_blocks_behind() > SYNC_LENIENCY { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you update the doc comment on SYNC_LENIENCY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you like it to say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have a specific preference, but for example:
/// The maximum number of blocks the client can be behind it's latest peer before it skips
/// processing incoming transactions, solutions and GET delegator requests.
Or:
/// The maximum number of blocks the client can be behind it's latest peer before it skips
/// processing incoming transactions, solutions and heavy REST requests.
Motivation
Adds an endpoint to get all delegators for a particular validator.
Test Plan
See the sister PR: https://github.com/AleoHQ/snarkVM/pull/2434