-
Notifications
You must be signed in to change notification settings - Fork 27
validation failure scenarios and it's impact on RTR #273
Comments
Hi Lukas,
There're no such bugs from what we are aware of at the moment. In principle, of course it's possible to run the validator into, say, OutOfMemoryException by intentionally introducing some very big objects or crafting the data in some other way. But even if it's happening it's usually is restarted by systemd. And in this case, one can, in principle, get in into some sort of restart loop, but we've never seen this in practice. The usual remedy is to bump the memory settings and see if it's able to proceed.
That is unlikely, external communications have timeout logic and also happens asynchronously from the validation itself, so it is pretty hard for this to happen. One can come up with some sort of slow-loris attack, but it will just make some repositories not updating, not stopping the whole validation process.
As I said, it will be restarted by systemd (unless the user changes the setup, of course).
Could you elaborate on this? It's pretty hard to say in general.
It's a good question. I believe there's no timeout there, so I can imagine that in some extreme cases rtr-server can keep a very outdated cache. I think it would make more sense to have good monitoring of the validator and take a decision about VRP cache being up-to-date based on this monitoring, rather than rtr-server having a timeout. But it could be a good extra-safety handrail. The only thing one can do at the moment about rtr-server is to grep for something like "validator http://validator-url:8080/api/objects/validated not ready yet, will retry later" in the rtr-server log and if it happens for too long, it's definitely an issue. But again, you would more likely notice the problem way earlier by monitoring the validator itself.
There is healthcheck output for the validator Also, there exists quite useful API call for all the background processes running within the validator it also produces quite a lot of statistics as Prometheus metrics, so you can monitor validation time/download time/etc. |
Ok, thanks. While I agree it's extremely important to monitor the validators health via prometheus or other methods, we all know just too well that this is not what happens in real life in a large number of cases. That is why I'm looking for a fail-safe approach. To be fail-safe we need to go beyond just providing health informations (considering that this is also a RTR server still serving data). And by the way, currently the wiki articles do not mention monitoring at all, so we also need to be realistic about what failure scenarios people really think of. If we don't warn people that monitoring is really important, the assumption very often is: if it fails, it fails in a safe way because no VRP's on the router just means everything is UNKOWN. Because that is how ROV was developed. But that doesn't apply to the case where the RTR server keeps serving obsolete VRP's and that is ... what makes this dangerous: the underlying assumption that a validator failure is safe. While this NANOG post is actually about router bugs, the widespread assumptions among network operators are obvious:
This assumption makes people not think about proper, actual monitoring (which for everybody reading this: pinging the validators IP is not).
Yes, I'm talking about issues we don't know about. I'm not even talking about specific attacks, I'm just talking about normal bugs or operational issues.
There just was an important delay bug, networks expecting convergence within 60 minutes while it really did not in 4 hours (also I don't know if at hour 4 the operator upgraded the RPKI-validator release, or whether the old release actually catched up for real). Let's talk about monitoring then: what's the percentage of people that reached out to you regarding #264 in
For example the administrator disabling the validator but not the RTR server, by mistake. This could be a fat finger thing, a honest mistake. The point is: how easy is it to cause the VRPs on the production routers to go stale? Is a single What if the admin wrongly assumes the validator and RTR server are one process/service, so the validator is shutdown instead of the RTR server, with the honest intention to disconnect the RTR sessions with the routers? It's obvious that the admin can always deliberately sabotage the setup. I'm not talking about that, I'm talking about mistakes.
Yes, that's what I'm worried about.
Yes, in my mind we need to be fail-safe. And also, we need to think about networks that do not have senior system engineers available in different timezones across the planet. Sometimes the teams are small and the "linux guy" just went on vacation for 3 weeks, after applying the "fix" on a VM on at 1655 on a Friday.
Yes, this will come in very handy to monitor the health of the validator, thank you. lukas |
Hi, A tiny follow-up, for documentation and clarification:
I agree that the situation where a router receives VRPs (and newly created routes become INVALID) is dangerous. My feeling is that in this situation detection is most important (together with documented operational procedures). I would monitor the liveliness of the rtr server by checking that it's serial is increasing. You can pick this up from the prometheus endpoint:
Looking at my historic data |
Yes, I'm currently working on exactly that, just not based on a HTTP endpoint but via RTR (rtrdump based) instead, so it works for every RTR server. |
That would be great. I agree it is best to do this at the RTR level and that rtrdump or rpki-rtr-client would be good candidates for this added functionality. I looked at gortr and do not see an identical endpoint there, however, as often with prometheus, the metric names differ. Based on
you can have a similar alert ( |
Hello,
I'm currently evaluating RPKI RP's and have a few questions.
I'm concerned about bugs, misconfigurations or other issues (in all RP/RTR setups) that will cause obsolete VRP's on the production routers, because I believe this is the worst-case in RPKI ROV deployments.
I worry about:
and how those impact the RTR service:
The best-case scenario in my mind is that the RTR server goes down completely and all RTR sessions die, so that the production routers are aware there is a problem with that RTR end point and stop using it (failing over to other RTR servers, if available).
According to the wiki:
So it seems like expected behavior is keeping the RTR server online as long as possible, is that correct? How would we avoid serving obsolete VRP's to production routers in this case?
I'm also thinking about monitoring (other than parsing logs):
I'd say the api/validation-runs/latest-successful API provides all those informations, so we can query those informations and then feed an external monitoring system with it.
The text was updated successfully, but these errors were encountered: