Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject RPC requests if memory limit is exceeded #1738

Merged
merged 7 commits into from
Jul 7, 2023

Conversation

Tristan-Wilson
Copy link
Member

If the --node.resource-mgmt.mem-limit-percent option is used then if system memory usage exceds that limit, reject new RPC requests with a HTTP 429 error.

If the option is used, then Nitro attempts on startup to discover what method is available to check the system memory usage and limit. Currently Cgroups V1 is the only supported method, and if it is not detected then an error is logged once and the limit will not be enforced.

Testing done

Set system to use Cgroups V1

My Linux development system was using Cgroups V2 so I had to set the following kernel parameter and reboot to revert to V1:

$ cat /proc/cmdline
BOOT_IMAGE=... systemd.unified_cgroup_hierarchy=0

Update nitro-testnode docker conf

Add a memory limit to the docker container, and the new --node.resource-mgmt.mem-limit-percent option with a 50% limit.

$ git diff
diff --git a/docker-compose.yaml b/docker-compose.yaml
index 0c693f4..a9eac2c 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -152,6 +152,7 @@ services:
   sequencer:
     pid: host # allow debugging
     image: nitro-node-dev-testnode
+    mem_limit: "5G"
     ports:
       - "127.0.0.1:8547:8547"
       - "127.0.0.1:8548:8548"
@@ -159,7 +160,7 @@ services:
     volumes:
       - "seqdata:/home/user/.arbitrum/local/nitro"
       - "config:/config"
-    command: --conf.file /config/sequencer_config.json --node.feed.output.enable --node.feed.output.port 9642  --http.api net,web3,eth,txpool,debug --node.seq-coordinator.my-url  ws://sequencer:8548 --graphql.enable --graphql.vhosts * --graphql.corsdomain *
+    command: --conf.file /config/sequencer_config.json --node.feed.output.enable --node.feed.output.port 9642  --http.api net,web3,eth,txpool,debug --node.seq-coordinator.my-url  ws://sequencer:8548 --graphql.enable --graphql.vhosts * --graphql.corsdomain * --node.resource-mgmt.mem-limit-percent 50
     depends_on:
       - geth
       - redis

Start test node

./test-node.bash --dev --init

Send request, confirm it is not blocked

$ curl -H "Content-Type: application/json" -X POST --data {"jsonrpc":"2.0","method":"eth_getBlockByNumber","id":67,"params":["latest",true]}' http://localhost:8547                                          
{"jsonrpc":"2.0","id":67,"result":{"baseFeePerGas":...

Use up more than half the memory on the sequencer's container

$ docker exec -it 339966f66541 /bin/bash
user@339966f66541:~$ </dev/zero head -c 3G    | tail

Send request, confirm that it is blocked

$ curl -v -H "Content-Type: application/json" -X POST --data '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","id":67,"params":["latest",true]}' http://localhost:8547
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:8547...
* Connected to localhost (127.0.0.1) port 8547 (#0)
> POST / HTTP/1.1
> Host: localhost:8547
> User-Agent: curl/7.79.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 82
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 429 Too Many Requests
< Content-Type: text/plain; charset=utf-8
< Vary: Origin
< X-Content-Type-Options: nosniff
< Date: Thu, 06 Jul 2023 02:35:25 GMT
< Content-Length: 18
< 
Too many requests  

Release the memory and confirm further requests are not blocked

If the --node.resource-mgmt.mem-limit-percent option is used then if
system memory usage exceds that limit, reject new RPC requests with a
HTTP 429 error.

If the option is used, then Nitro attempts on startup to discover what
method is available to check the system memory usage and limit.
Currently Cgroups V1 is the only supported method, and if it is not
detected then an error is logged once and the limit will not be
enforced.
@cla-bot cla-bot bot added the s Automatically added by the CLA bot if the creator of a PR is registered as having signed the CLA. label Jul 6, 2023
@PlasmaPower PlasmaPower requested review from anodar and ganeshvanahalli and removed request for PlasmaPower and tsahee July 6, 2023 17:11
}

func (s *resourceManagementHttpServer) ServeHTTP(w http.ResponseWriter, req *http.Request) {
exceeded, err := s.c.isLimitExceeded()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to cache this value if this is happening for every request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for not caching it was that we want to be as responsive as possible to changes in memory utilization. My intuition is that re-reading from /sys/fs and doing some text parsing would add minimal overhead. I'll add in a latency metric for the limit check to confirm this.


"github.com/ethereum/go-ethereum/log"
"github.com/ethereum/go-ethereum/node"
flag "github.com/spf13/pflag"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whatever reason we had been aliasing it elsewhere in the code and I had just been copying it. I'll remove it here.

(Thanks for the in-depth review! I think with your help we can move towards writing more idiomatic go.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because pflag is a drop-in replacement for flag, it's standard practice to alias it. They recommend it: https://github.com/spf13/pflag#usage

Copy link
Contributor

@anodar anodar Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that would make sense to remove it here and we can clean up the rest as we touch those files.
(Glad to hear it's helpful!)

@PlasmaPower I think they just say if you alias it than all code will continue to function. It doesn't really say that you should. May have made sense if we used standard flag library and we were replacing it with this but that doesn't seem to be the case here.

// Copyright 2023, Offchain Labs, Inc.
// For license information, see https://github.com/nitro/blob/master/LICENSE

package arbnode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a separate package, arbnode package is already blown up too much with things that should be in separate packages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I moved it to arbnode/resourcemanager

}
}

type ResourceManagementConfig struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you address the above comment about separate package (e.g. name this as package as resourcemanager) this struct and everything below that contains "ResourceManagement" can just drop that substring since at call sites it will be clear enough e.g. resourcemanager.Config, resourcemanager.newHttpServer, etc...

Short concise names are encouraged in go.

https://google.github.io/styleguide/go/guide#concision
https://google.github.io/styleguide/go/guide#naming

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to arbnode/resourcemanager

return usage-inactive >= ((limit * c.memoryLimitPercent) / 100), nil
}

func (c cgroupsV1MemoryLimitChecker) getIntFromFile(fileName string) (int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop get prefix, same with getInactive.

https://google.github.io/styleguide/go/decisions#getters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it have a prefix like read instead since it is reading from a file? I understand that accessors shouldn't have a prefix, but is this an accessor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, read prefix would certainly be a good fit here!

Comment on lines 137 to 138
_, err = fmt.Fscanf(file, "%d", &limit)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inline

if _, err := ...; err != nil {
}

Makes it a bit more readable and limits the scope of err variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

String() string
}

func newLimitChecker(conf *ResourceManagementConfig) limitChecker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return concrete types, not interfaces.

https://google.github.io/styleguide/go/decisions#interfaces: "Accept Interfaces, Return Concrete Types"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with this function is that it's a factory method that auto-discovers the type of limitChecker to create. Currently there are only 2 (cgroupsv1 and trivial). Is there a better way to structure this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do that logic (that decides the type of limitChecker) at call site itself (newLimitChecker) in InitResourceManagement. It's the only call site and this method is private so you won't have to duplicate this logic somewhere else.

return usage-inactive >= ((limit * c.memoryLimitPercent) / 100), nil
}

func (c cgroupsV1MemoryLimitChecker) getIntFromFile(fileName string) (int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is a receiver method on cgroupsV1MemoryLimitChecker rather than just general helper method for reading an integer from a file ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason, fixed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems to just check isLimitExceeded() function which is boolean function and there's high chance even if other things are wrong this may succeed as outcome is just true/false.

Please consider testing other methods as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only function not covered is ServeHTTP, it is fairly simple and covered by the manual tests I ran.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it would be nice to have tests for inactive(), readIntFromFile() even though indirectly they are covered by isLimitExceeded() function.

}
}

func isSupported(c limitChecker) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't seem to be called from anywhere. Please drop this and introduce it when you have usecase (callsites) for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called from newLimitChecker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I overlooked that part.

(Not sure how did I search and not find it 🤦 )

if !exceeded {
Fail(t, "Expected over limit")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop empty line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@anodar anodar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing the comments.

String() string
}

func newLimitChecker(conf *ResourceManagementConfig) limitChecker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do that logic (that decides the type of limitChecker) at call site itself (newLimitChecker) in InitResourceManagement. It's the only call site and this method is private so you won't have to duplicate this logic somewhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it would be nice to have tests for inactive(), readIntFromFile() even though indirectly they are covered by isLimitExceeded() function.

}
}

func isSupported(c limitChecker) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I overlooked that part.

(Not sure how did I search and not find it 🤦 )

return err
}
_, err = fmt.Fprintf(limitFile, "%d\n", limit)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can inline this and few more below as well.

@Tristan-Wilson Tristan-Wilson merged commit ae00aa6 into master Jul 7, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s Automatically added by the CLA bot if the creator of a PR is registered as having signed the CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants