Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage when using RPKI #497

Closed
rubenskuhl opened this issue Apr 27, 2021 · 7 comments · Fixed by #516
Closed

Memory usage when using RPKI #497

rubenskuhl opened this issue Apr 27, 2021 · 7 comments · Fixed by #516
Assignees
Labels
release blocker blocks the next release

Comments

@rubenskuhl
Copy link

Is your feature request related to a problem? Please describe.
Memory usage for RPKI ROA Import is higher than similar solutions

Describe the solution you'd like
For a future version to run on 16GB RAM machines; it currently requires 32 GB, mostly because of memory usage during ROA Import

Describe alternatives you've considered
We considered turning RPKI off, but ended up managing to increase memory on the machine

Additional context
A single 4 GB RAM machine can, nowadays, run Krill and Routinator. Also, since there is only a fraction of signed routes today, even the current memory requirements might not be enough if all DFZ is signed.

@fischerdouglas
Copy link

As I can see, on IRRd, there are some specific jobs that take very much computational effort to be completed.

But the basics of software runs very well with not too many resources.

RPKI validation is one of the jobs that bring some computational spikes.
I believe that Full Imports are also another big resource consumer.

Considering the conception of IRRd, I guess that wouldn't be very much complex to define that some jobs could be executed in batch on another node(VM/Container).

This would be especially to slice the monolithic of the solution.
Allowing to run the basics of the software on an enhanced computation layer(smaller and expensive), and the seasonal jobs on a not-so-enhanced computational layer(cheaper allowing bigger machines).

Just to exemplify:
Running the RPKI validation on AWS Spot machines(or the equivalent on other cloud computing environments).

@mxsasha
Copy link
Collaborator

mxsasha commented May 28, 2021

I haven't put much work into optimising this so far - most of the focus in performance improvement was for queries and particularly certain queries. I will look into the possibilities :)

Technically it's a fairly independent process, so it could be separated. However, it would add overhead to run it on an a separate cloud environment, so I'm not sure this is the most practical approach for now.

Investigating this is a release blocker - what will end up in 4.2 depends on the findings.

@mxsasha mxsasha self-assigned this May 28, 2021
@mxsasha mxsasha added the release blocker blocks the next release label May 28, 2021
@mxsasha
Copy link
Collaborator

mxsasha commented Jun 20, 2021

I did some more digging into this. RPKI importing is a two phase process:

  1. Removing current ROAs from database, and then reading the JSON file into both the database and a trie in memory. ROAs means the roa_object table and the pseudo objects with source RPKI. Phase 1 is in a single transaction.
  2. Reading all current route(6) objects, checking their validation status against the trie in memory, determining which need a state change, and processing that change.

On a test server (with a not great CPU, so may run slower than in other setups), I kept a close eye on the process, and found:

  • During phase 1, memory was around 700-1000MB. This fluctuated up and down a bit. Phase 1 lasted 6 minutes on this test instance.
  • As phase 2 started, initial memory use was consistent at 930MB for a minute or two.
  • The next and final few minutes of phase 2, memory usage increased to 7GB.
  • This run did not result in any validation status changes.

Thoughts:

  • The initial 930MB use of phase 2 is probably mostly the trie of all ROAs. We need this as the fastest validation option. I already had unrelated ideas to improve the trie, but this isn't the big win in memory usage. This will probably scale linearly with increasing amount of ROAs.
  • The stable memory usage early in phase 2, followed by a sharp increase, may be due to time was spent to have the database retrieve all route(6) objects.
  • The huge 7GB memory use is almost certainly due to retrieval of route objects. This will probably scale linearly with an increase in route objects in the IRR.
  • This memory usage was not caused by the process of changing the validation status. It is possible that that has a memory impact itself that I did not measure, but not likely, considering the small size of the data.

Path forward:

  • See if we can iterate more efficiently over the query results.
  • Restrict the query data to the bare minimum. For example, we currently retrieve the object text because it is needed when sending notifications about RPKI invalid objects. However, we now retrieve it for every single object, which is a huge amount of data, but rarely need it. More efficient would be to query it afterwards only for relevant objects.
  • Check for other causes of retention of route objects in memory during validation.

This will likely result in significant improvements.

I also kept an eye on other memory usage. Also noteworthy is the preloader process, which peaked at 1.8GB. However, this only lasted 10-15 seconds. It may also be worth looking into.

@mxsasha
Copy link
Collaborator

mxsasha commented Jun 20, 2021

Restrict the query data to the bare minimum. For example, we currently retrieve the object text because it is needed when sending notifications about RPKI invalid objects. However, we now retrieve it for every single object, which is a huge amount of data, but rarely need it. More efficient would be to query it afterwards only for relevant objects.

This on its own cuts down RPKI memory use to 3GB, so big improvements are viable. (Only a quick test, which would break email notifications.)

On a more general note, I do think it should be possible to run IRRd already in 16GB, with a low amount of HTTP and whois workers. It's tight, especially during initial imports of large amounts of data, so you might need to add the sources a few at a time, but can be done. In general IRRd focuses on speed over memory efficiency, but I agree that the current RPKI memory use is excessive and also clearly not needed.

@fischerdouglas
Copy link

Hello @mxsasha
Thanks for this deeper analysis...

But I will insist a bit on the idea of breaking the monolithic of IRRd.

How hard would be to defining resources pools, and running external queries (whois, e-mail, http) on a small and stable resource pool... And all those "reprocessing" on resource pool that colud(or not) be auto-scalable and destroyed after the peak demand. ???

@job
Copy link
Member

job commented Jun 21, 2021

Very hard

@fischerdouglas
Copy link

Just to clarify...
When I mention the possibility of "auto-scalable", "self destroying"...

Is not expected that the IRRd deals with it... This would be dealed by other layers of compute node provisioning.

What would be expected is that IRRd points to different resource pools the "please do this to me"...
Based on the different type of jobs.

And is always possible that multiple resource-pools run on the same node.
That would assure that IRRd runs correctly on the environments used today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release blocker blocks the next release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants