Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux Kernel Safety First Principles #36
base: main
Are you sure you want to change the base?
Linux Kernel Safety First Principles #36
Changes from all commits
a900f91
a385648
594d183
418f6e0
11feb9a
4fe29e2
c007638
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove "at unit-level" as the concept of unit, usually, applies to newly developed SW, not when qualifying pre-existing one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit is meant to represent the granularity chosen for the qualification, whatver that might be.
I'm trying to say SeooC without using the terminology, but that is the intended meaning.
As long as he meaning is preserved, I'm open to re-formulating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reiterative I'm almost ok with this, but I do not like that "only" because it seems to imply that something can be done on components to prevent the cascading of interference.
In theory yes, you could add redundance and what not, but in practice I don't think anyone will be able to make components sufficiently hardened.
If that is what you are after, then let's spell it out. I want to paint an accurate picture of the problem, without closing avenues that are open, but I do not want to present a rocky mountain path as if it was a boulevard either :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here what do you mean by "generic solution"? Do you mean "currently there is no design solution to protect writable safety relevant memory"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Currently" is implied, but, again, as long as something is proven to work, it's an absolute statement.
It can be rephrased, if it helps with readability, without altering the meaning.
My point is that the "Very complex systematic corruption" cannot be detected through solutions that are applicable to - let's say - the memory map of a safety-relevant process, that might comprise tens, hundreds of MB of memory. You can certainly try to use ad-hoc verification, e.g. performing functional testing on that process, but it needs to be tailored to each individual process. No generic solution. Only a generic methodology, to be taylored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make a stronger (but narrower) statement here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reiterative: "neither prevented nor detected" is the right formualtion, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igor-stoppa I think your statement is not correct. For example CONFIG_STACKPROTECTOR_STRONG or KASAN are generic Kernel solution that provide detections of writable memory. So I would stick with Pauls' rewording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paolonig They are partial solutions. Placing canaries on the stack doesn't catch all the possible corruptions.
KASAN alters the system configuration, and anyways is not deployed.
Feel free to provide evidence countering my arguments, they are intentionally made in a very clan cut way, to allow easy confutation by evidence, if any can be found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they are partial generic kernel level solutions, I agree. However an integrator or a SW vendor may use such partial solution in combination with other measures to support a non-interference claim.
So IMO the fact is the monolithic nature of the Kernel that does not allow, today, interference prevention (like the one you can have by SMMU for a user space process) and we can all agree on this.
What I cannot agree is about principles that also imply or deny specific ways to support a certain claim; since the specific ways depend on the specification of the integration context and of other projects aspects (depending on the engineering workforce I may decide to use some verification strategies rather than others....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove "simple" as it is too much debatable IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple -> direct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree on this point. For instance even if an external safety monitor is required to meet the FDTI claim, it is still possible to tune the system and stress test it to claim that the unavailability risk due to temporal interference is acceptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to estimate in a verifiable way said probability, then.
We can review the soundness of the estimation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do that only if you define what is a good-enough RT performance. BTW you can start from this paper and the toolset mentioned within it as a good starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not talking about RT performance, at least not in the time scale commonly intendend.
I'm talking about minutes (or more) overheads to basic operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make an example of a basic operation and also explaining what is preventing it from running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine a set of processes implementing machine vision using tens of GB or ram.
Try to monitor that in realtime for corruption caused by the kernel ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here the topic is availability, not safety qualification....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safety requirements cover both FFI and availability, for higher ASIL. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the use cases I worked on did not require the OS to be safe operational. I.e. panicing the OS would lead the system in an acceptable safe state (with a very upset driver, hence the need for availability, that however is not related to safety)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then just keep the system off :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jokes apart I think that in order to agree to such principles we need first to define the integration context.
As I mentioned all the customers I worked with usually accept to have an availability test campaign to be separate from safety (since safety is guaranteed by the external monitor in case of missing the FDTI deadline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These principles are meant to not be democratically decided by what anyone customer's might think.
I understand "the customer is always right" approach, but that doesn't belong to this document.
Anyone is free to take the risks they prefer, small or large.
This document is not about that. It's about separating assessments that can be done with very high confidence, if not certainty, from others that are based on subjective analysis.
For example, can any of your customers prove that they have determined the coverage of that stress testing?
Performing the same or similar operation billions of times doesn't say anything about other scenario that might arise.
It can be done for sure in simple scenarios, but in more complex ones it cannot.
Again, I'm tryin to make it easy to prove me wrong, by making fairly clean cut statements.
Just add evidence ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above points are correct, and this is why a QM baseline is always required to support availability, but the same baseline applies to NSR components....so what is the goal here? Wouldn't it be easier to say "a QMS baseline is required to meet the products' availability requirements" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that a QM component is by definition less safe than an ASIL one.
Even in presence of perfect detection of interference QM->ASIL, the more QM components can interfere with the ASIL one, the lower the MTBF.
A QM baseline might not be enough to meet availability requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but you are wrong here. A detected interference is not a failure (for fail safe systems) of the system and hence it does not contribute to the final MTBF target. For safe operational then the story is different as detecting the interference is not a good-enough mechanism (but then you need to disambiguate this clearly as usually availability is considered separate from safety )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A detected interference tells you that an interference happened, but it doesn't tell you that the system is still safe.
It tells you that something is generating interference. And unless you have an ad hoc way od doing more than that, the only solution (in a non redundant system) is to put it into safe mode. That constitutes in practice a functional failure and it lowers the MTBF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to reword it, for improved clarity , but the meaning is what I wrote above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above it all depends on your system, on your integration context.
If for the integrator it is safety critical to avoid interference, since the only detection would still lead to a hazard, then you're right.
If instead detecting interference triggers the safe state that relies on a backup system, for example, pulling over the car, then you're not right.
So I think we should better clarify the assumptions on the integration context...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backup system is out of the scope of this document.
You can very well consider it as confirmation of my point.
This is talking about availability of the intended functionality of the "product", but let's agree it to be a car.
A pull over is a failure. It's not a safety failure, certainly. But it IS a failure of the product.
We can add a clarification about what MTBF means in this context (what I just wrote), if it can help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than "external" I would say "independent"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you name an independent internal component?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The combination of a Runtime Verification Monitor and an external WTD can be used as an independent monitor for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the external WD is not internal :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyways the RVM out of the box doesn't provide that functionality, you have to feed it the models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes of course, you need to design your specific RVM. It was just an example to justify independent as a better word compared to external....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehh. Perhaps it's splitting the hair, but it is not independent. It uses library code from the kernel itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I would say "Make sure that the interference modes are avoided or detected either at the source or at the recipient side with sufficient level of rigor"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's intentionally excluding the source because the whole point is that it's impossible to claim completeness about sources.
Again, if you disagree, feel free to prove me wrong and enumerate all the possible sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONFIG_STACKPROTECTOR_STRONG is a mechanism that can detect interference both at the source (the stack frame of the function where interference is generated) as well as at the recipient (the stack frame of a function where the interference trespasses the canaries).
KASAN is a mechanism that detects interference at the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stack protector adds a canary in certain places. even adding canaries to all the similar places, interference that doesn't kill a canary doesn't get noticed
again, feel free to prove me wrong (maybe I should find an acronym for this, considering how many times I've written it)
wrt KASAN, just to give an example, it can detect out of bounds errors. but if there is an error that doesn't systematically cross boundaries, it will not necessarily be detected.
that is just an easy example, on top of my head ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO point 6 is about "Make sure that mechanisms used to prevent or detect interference are properly verified. In case of HW mechanisms periodic or on-demand diagnostic tests may be required"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not about HW - not necessarily
it's about using QM or less SW while performing E2E verification that it's still functioning correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then rather than e2e verification it seems an actual ASIL decomposition (and in this case we need to make sure that an appropriate safety analysis is done on the QM element to make sure the ASIL monitors cover all the dangerous failure modes)....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if one end can make asusmptions about the other end, and validate them for correctness, isn't it e2e?