Memory usage when using RPKI #497

rubenskuhl · 2021-04-27T18:50:51Z

Is your feature request related to a problem? Please describe.
Memory usage for RPKI ROA Import is higher than similar solutions

Describe the solution you'd like
For a future version to run on 16GB RAM machines; it currently requires 32 GB, mostly because of memory usage during ROA Import

Describe alternatives you've considered
We considered turning RPKI off, but ended up managing to increase memory on the machine

Additional context
A single 4 GB RAM machine can, nowadays, run Krill and Routinator. Also, since there is only a fraction of signed routes today, even the current memory requirements might not be enough if all DFZ is signed.

fischerdouglas · 2021-05-06T19:44:23Z

As I can see, on IRRd, there are some specific jobs that take very much computational effort to be completed.

But the basics of software runs very well with not too many resources.

RPKI validation is one of the jobs that bring some computational spikes.
I believe that Full Imports are also another big resource consumer.

Considering the conception of IRRd, I guess that wouldn't be very much complex to define that some jobs could be executed in batch on another node(VM/Container).

This would be especially to slice the monolithic of the solution.
Allowing to run the basics of the software on an enhanced computation layer(smaller and expensive), and the seasonal jobs on a not-so-enhanced computational layer(cheaper allowing bigger machines).

Just to exemplify:
Running the RPKI validation on AWS Spot machines(or the equivalent on other cloud computing environments).

mxsasha · 2021-05-28T14:41:22Z

I haven't put much work into optimising this so far - most of the focus in performance improvement was for queries and particularly certain queries. I will look into the possibilities :)

Technically it's a fairly independent process, so it could be separated. However, it would add overhead to run it on an a separate cloud environment, so I'm not sure this is the most practical approach for now.

Investigating this is a release blocker - what will end up in 4.2 depends on the findings.

mxsasha · 2021-06-20T17:39:48Z

I did some more digging into this. RPKI importing is a two phase process:

Removing current ROAs from database, and then reading the JSON file into both the database and a trie in memory. ROAs means the roa_object table and the pseudo objects with source RPKI. Phase 1 is in a single transaction.
Reading all current route(6) objects, checking their validation status against the trie in memory, determining which need a state change, and processing that change.

On a test server (with a not great CPU, so may run slower than in other setups), I kept a close eye on the process, and found:

During phase 1, memory was around 700-1000MB. This fluctuated up and down a bit. Phase 1 lasted 6 minutes on this test instance.
As phase 2 started, initial memory use was consistent at 930MB for a minute or two.
The next and final few minutes of phase 2, memory usage increased to 7GB.
This run did not result in any validation status changes.

Thoughts:

The initial 930MB use of phase 2 is probably mostly the trie of all ROAs. We need this as the fastest validation option. I already had unrelated ideas to improve the trie, but this isn't the big win in memory usage. This will probably scale linearly with increasing amount of ROAs.
The stable memory usage early in phase 2, followed by a sharp increase, may be due to time was spent to have the database retrieve all route(6) objects.
The huge 7GB memory use is almost certainly due to retrieval of route objects. This will probably scale linearly with an increase in route objects in the IRR.
This memory usage was not caused by the process of changing the validation status. It is possible that that has a memory impact itself that I did not measure, but not likely, considering the small size of the data.

Path forward:

See if we can iterate more efficiently over the query results.
Restrict the query data to the bare minimum. For example, we currently retrieve the object text because it is needed when sending notifications about RPKI invalid objects. However, we now retrieve it for every single object, which is a huge amount of data, but rarely need it. More efficient would be to query it afterwards only for relevant objects.
Check for other causes of retention of route objects in memory during validation.

This will likely result in significant improvements.

I also kept an eye on other memory usage. Also noteworthy is the preloader process, which peaked at 1.8GB. However, this only lasted 10-15 seconds. It may also be worth looking into.

mxsasha · 2021-06-20T17:55:21Z

Restrict the query data to the bare minimum. For example, we currently retrieve the object text because it is needed when sending notifications about RPKI invalid objects. However, we now retrieve it for every single object, which is a huge amount of data, but rarely need it. More efficient would be to query it afterwards only for relevant objects.

This on its own cuts down RPKI memory use to 3GB, so big improvements are viable. (Only a quick test, which would break email notifications.)

On a more general note, I do think it should be possible to run IRRd already in 16GB, with a low amount of HTTP and whois workers. It's tight, especially during initial imports of large amounts of data, so you might need to add the sources a few at a time, but can be done. In general IRRd focuses on speed over memory efficiency, but I agree that the current RPKI memory use is excessive and also clearly not needed.

fischerdouglas · 2021-06-21T11:32:12Z

Hello @mxsasha
Thanks for this deeper analysis...

But I will insist a bit on the idea of breaking the monolithic of IRRd.

How hard would be to defining resources pools, and running external queries (whois, e-mail, http) on a small and stable resource pool... And all those "reprocessing" on resource pool that colud(or not) be auto-scalable and destroyed after the peak demand. ???

job · 2021-06-21T12:06:29Z

Very hard

fischerdouglas · 2021-06-21T14:29:33Z

Just to clarify...
When I mention the possibility of "auto-scalable", "self destroying"...

Is not expected that the IRRd deals with it... This would be dealed by other layers of compute node provisioning.

What would be expected is that IRRd points to different resource pools the "please do this to me"...
Based on the different type of jobs.

And is always possible that multiple resource-pools run on the same node.
That would assure that IRRd runs correctly on the environments used today.

mxsasha mentioned this issue May 28, 2021

Add support new rpki-client JSON format? #499

Closed

mxsasha self-assigned this May 28, 2021

mxsasha added the release blocker blocks the next release label May 28, 2021

mxsasha linked a pull request Jun 25, 2021 that will close this issue

Reduce RPKI memory use #516

Merged

mxsasha closed this as completed in #516 Jun 27, 2021

mxsasha mentioned this issue Jul 28, 2021

Look into scope filter memory use #525

Closed

mxsasha added a commit that referenced this issue Jul 29, 2021

Fix #525 - Reduce scope filter memory use (ref #497)

907f631

mxsasha added a commit that referenced this issue Jul 29, 2021

Fix #525 - Reduce scope filter memory use (ref #497)

526594a

mxsasha added a commit that referenced this issue Jul 29, 2021

Fix #525 - Reduce scope filter memory use (ref #497)

583fe03

mxsasha added a commit that referenced this issue Jul 29, 2021

Fix #525 - Reduce scope filter memory use (ref #497)

c66005b

mxsasha added a commit that referenced this issue Jul 29, 2021

Fix #525 - Reduce scope filter memory use (ref #497) (#530)

14308e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage when using RPKI #497

Memory usage when using RPKI #497

rubenskuhl commented Apr 27, 2021

fischerdouglas commented May 6, 2021

mxsasha commented May 28, 2021 •

edited

Loading

mxsasha commented Jun 20, 2021

mxsasha commented Jun 20, 2021 •

edited

Loading

fischerdouglas commented Jun 21, 2021

job commented Jun 21, 2021

fischerdouglas commented Jun 21, 2021

Memory usage when using RPKI #497

Memory usage when using RPKI #497

Comments

rubenskuhl commented Apr 27, 2021

fischerdouglas commented May 6, 2021

mxsasha commented May 28, 2021 • edited Loading

mxsasha commented Jun 20, 2021

mxsasha commented Jun 20, 2021 • edited Loading

fischerdouglas commented Jun 21, 2021

job commented Jun 21, 2021

fischerdouglas commented Jun 21, 2021

mxsasha commented May 28, 2021 •

edited

Loading

mxsasha commented Jun 20, 2021 •

edited

Loading