Skip to content
This repository has been archived by the owner on Jul 15, 2021. It is now read-only.

validation failure scenarios and it's impact on RTR #273

Open
lukastribus opened this issue Sep 16, 2020 · 5 comments
Open

validation failure scenarios and it's impact on RTR #273

lukastribus opened this issue Sep 16, 2020 · 5 comments

Comments

@lukastribus
Copy link

Hello,

I'm currently evaluating RPKI RP's and have a few questions.

I'm concerned about bugs, misconfigurations or other issues (in all RP/RTR setups) that will cause obsolete VRP's on the production routers, because I believe this is the worst-case in RPKI ROV deployments.

I worry about:

  • crash bugs in the validation code
  • hangs during RPKI validation (even in rsync), that block the entire validation
  • memory allocation failures (failed malloc)
  • Linux OOM-killer (probably killing the process with the largest amount of memory usage)
  • admin mistakes

and how those impact the RTR service:

The best-case scenario in my mind is that the RTR server goes down completely and all RTR sessions die, so that the production routers are aware there is a problem with that RTR end point and stop using it (failing over to other RTR servers, if available).

According to the wiki:

The RPKI-RTR server is a separate daemon, that allows routers to connect using the RPKI-RTR protocol. It's set up as a separate instance because not everyone needs to run this, but more importantly, if you do need to run this then a separate daemon allows one to run more than one instance for redundancy (it keeps state even when the validator is down).

So it seems like expected behavior is keeping the RTR server online as long as possible, is that correct? How would we avoid serving obsolete VRP's to production routers in this case?

I'm also thinking about monitoring (other than parsing logs):

I'd say the api/validation-runs/latest-successful API provides all those informations, so we can query those informations and then feed an external monitoring system with it.

@lolepezy
Copy link
Contributor

Hi Lukas,

crash bugs in the validation code

There're no such bugs from what we are aware of at the moment. In principle, of course it's possible to run the validator into, say, OutOfMemoryException by intentionally introducing some very big objects or crafting the data in some other way. But even if it's happening it's usually is restarted by systemd. And in this case, one can, in principle, get in into some sort of restart loop, but we've never seen this in practice. The usual remedy is to bump the memory settings and see if it's able to proceed.

hangs during RPKI validation (even in rsync), that block the entire validation

That is unlikely, external communications have timeout logic and also happens asynchronously from the validation itself, so it is pretty hard for this to happen. One can come up with some sort of slow-loris attack, but it will just make some repositories not updating, not stopping the whole validation process.

memory allocation failures (failed malloc)
Linux OOM-killer (probably killing the process with the largest amount of memory usage)

As I said, it will be restarted by systemd (unless the user changes the setup, of course).

admin mistakes

Could you elaborate on this? It's pretty hard to say in general.

So it seems like expected behavior is keeping the RTR server online as long as possible, is that correct? How would we avoid serving obsolete VRP's to production routers in this case?

It's a good question. I believe there's no timeout there, so I can imagine that in some extreme cases rtr-server can keep a very outdated cache. I think it would make more sense to have good monitoring of the validator and take a decision about VRP cache being up-to-date based on this monitoring, rather than rtr-server having a timeout. But it could be a good extra-safety handrail. The only thing one can do at the moment about rtr-server is to grep for something like "validator http://validator-url:8080/api/objects/validated not ready yet, will retry later" in the rtr-server log and if it happens for too long, it's definitely an issue. But again, you would more likely notice the problem way earlier by monitoring the validator itself.

for periodic successful validation runs (thinking about how to trigger an ALIVE signal to something like healthchecks.io )

There is healthcheck output for the validator
https://rpki-validator.ripe.net/api/healthcheck

Also, there exists quite useful API call for all the background processes running within the validator
https://rpki-validator.ripe.net/api/healthcheck/backgrounds
We created it for ourselves in the past while fixing some kind of "it's stuck and not doing anything" bugs,
so it can sure be useful for users as well, especially the "lastFinished" field.

it also produces quite a lot of statistics as Prometheus metrics, so you can monitor validation time/download time/etc.
https://rpki-validator.ripe.net/actuator/prometheus

@lukastribus
Copy link
Author

Ok, thanks.

While I agree it's extremely important to monitor the validators health via prometheus or other methods, we all know just too well that this is not what happens in real life in a large number of cases.

That is why I'm looking for a fail-safe approach. To be fail-safe we need to go beyond just providing health informations (considering that this is also a RTR server still serving data).

And by the way, currently the wiki articles do not mention monitoring at all, so we also need to be realistic about what failure scenarios people really think of. If we don't warn people that monitoring is really important, the assumption very often is: if it fails, it fails in a safe way because no VRP's on the router just means everything is UNKOWN. Because that is how ROV was developed. But that doesn't apply to the case where the RTR server keeps serving obsolete VRP's and that is ...

what makes this dangerous: the underlying assumption that a validator failure is safe.

While this NANOG post is actually about router bugs, the widespread assumptions among network operators are obvious:

we are at fault for not deploying
the validation service in a redundant setup and for failing at monitoring
the service. But we did so because we thought it not to be too important,
because a failed validation service should simply lead to no validation

This assumption makes people not think about proper, actual monitoring (which for everybody reading this: pinging the validators IP is not).

crash bugs in the validation code

There're no such bugs from what we are aware of at the moment.

Yes, I'm talking about issues we don't know about. I'm not even talking about specific attacks, I'm just talking about normal bugs or operational issues.

hangs during RPKI validation (even in rsync), that block the entire validation

That is unlikely, external communications have timeout logic and also happens asynchronously from the validation itself, so it is pretty hard for this to happen. One can come up with some sort of slow-loris attack, but it will just make some repositories not updating, not stopping the whole validation process.

There just was an important delay bug, networks expecting convergence within 60 minutes while it really did not in 4 hours (also I don't know if at hour 4 the operator upgraded the RPKI-validator release, or whether the old release actually catched up for real).

Let's talk about monitoring then: what's the percentage of people that reached out to you regarding #264 in 3.1-2020.08.06.14.39, of the deployments you know about? I guess that is approximately the percentage of users who properly monitor the validator. I assume that number is discouraging. How could it be different, nobody ever told operators to be careful about validation failures and RTR servers distribute stale VRPs.

admin mistakes

Could you elaborate on this? It's pretty hard to say in general.

For example the administrator disabling the validator but not the RTR server, by mistake. This could be a fat finger thing, a honest mistake. The point is: how easy is it to cause the VRPs on the production routers to go stale? Is a single systemctl stop xxx command enough to cause the VRP to go stale on production routers, by shutting down the validator?

What if the admin wrongly assumes the validator and RTR server are one process/service, so the validator is shutdown instead of the RTR server, with the honest intention to disconnect the RTR sessions with the routers?

It's obvious that the admin can always deliberately sabotage the setup. I'm not talking about that, I'm talking about mistakes.

It's a good question. I believe there's no timeout there, so I can imagine that in some extreme cases rtr-server can keep a very outdated cache.

Yes, that's what I'm worried about.

I think it would make more sense to have good monitoring of the validator and take a decision about VRP cache being up-to-date based on this monitoring, rather than rtr-server having a timeout. But it could be a good extra-safety handrail.

Yes, in my mind we need to be fail-safe. And also, we need to think about networks that do not have senior system engineers available in different timezones across the planet. Sometimes the teams are small and the "linux guy" just went on vacation for 3 weeks, after applying the "fix" on a VM on at 1655 on a Friday.

for periodic successful validation runs (thinking about how to trigger an ALIVE signal to something like healthchecks.io )

There is healthcheck output for the validator
https://rpki-validator.ripe.net/api/healthcheck

Also, there exists quite useful API call for all the background processes running within the validator
https://rpki-validator.ripe.net/api/healthcheck/backgrounds
We created it for ourselves in the past while fixing some kind of "it's stuck and not doing anything" bugs,
so it can sure be useful for users as well, especially the "lastFinished" field.

it also produces quite a lot of statistics as Prometheus metrics, so you can monitor validation time/download time/etc.
https://rpki-validator.ripe.net/actuator/prometheus

Yes, this will come in very handy to monitor the health of the validator, thank you.

lukas

@ties
Copy link
Contributor

ties commented Oct 28, 2020

Hi,

A tiny follow-up, for documentation and clarification:

Yes, in my mind we need to be fail-safe. And also, we need to think about networks that do not have senior system engineers available in different timezones across the planet. Sometimes the teams are small and the "linux guy" just went on vacation for 3 weeks, after applying the "fix" on a VM on at 1655 on a Friday.

I agree that the situation where a router receives VRPs (and newly created routes become INVALID) is dangerous. My feeling is that in this situation detection is most important (together with documented operational procedures).

I would monitor the liveliness of the rtr server by checking that it's serial is increasing. You can pick this up from the prometheus endpoint:

# HELP rtrserver_validated_objects_serial Serial of the cache
# TYPE rtrserver_validated_objects_serial gauge
rtrserver_validated_objects_serial 2463.0

Looking at my historic data changes(rtrserver_validated_objects_serial[1h]) < 2 would be a suitable alert.

@lukastribus
Copy link
Author

Yes, I'm currently working on exactly that, just not based on a HTTP endpoint but via RTR (rtrdump based) instead, so it works for every RTR server.

@ties
Copy link
Contributor

ties commented Oct 28, 2020

Yes, I'm currently working on exactly that, just not based on a HTTP endpoint but via RTR (rtrdump based) instead, so it works for every RTR server.

That would be great. I agree it is best to do this at the RTR level and that rtrdump or rpki-rtr-client would be good candidates for this added functionality.

I looked at gortr and do not see an identical endpoint there, however, as often with prometheus, the metric names differ. Based on

# HELP rpki_change Last change.
# TYPE rpki_change gauge
rpki_change{path="https://rpki.cloudflare.com/rpki.json"} 1.603909789e+09
# HELP rpki_refresh Last successful request for the given URL.
# TYPE rpki_refresh gauge
rpki_refresh{path="https://rpki.cloudflare.com/rpki.json"} 1.603909789e+09

you can have a similar alert (time() - rpki_change > 3600 should be very similar) there.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants