Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux Kernel Safety First Principles #36

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
141 changes: 141 additions & 0 deletions Contributions/Linux_Kernel_Safety_First_Principles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# **Linux Kernel Safety - First Principles**

## Index

[Terms and Abbreviations](#Terms-and-Abbreviations)

[References](#References)

[Disclaimer](#Disclaimer)

[Purpose of the document](#Purpose-of-the-document)

[Structure of the document](#Structure-of-the-document)

[Philosophy of the document](#Philosophy-of-the-document)

[First Principles - Integrity](#First-Principles---Integrity)

[First Principles - Availability](#First-Principles---Availability)

[Considerations - Toward the safety goals](#Considerations---Toward-the-safety-goals)

[License: CC BY-SA 4.0](#License-CC-BY-SA-40)

## **Terms and Abbreviations**

See the matching section in *Interference Scenarios for an ARM64 Linux System*

## **References**

1. ***Interference Scenarios for an ARM64 Linux System***
2. ***Safety Requirements for a Generic Linux System***
3. ***CheckList for Safety Claims on a Generic Linux System***
4. ***Using Linux in a Safe System***
5. ***Linux Memory Management Essentials***
6. [CC BY-SA 4.0 Deed | Attribution-ShareAlike 4.0 International | Creative Commons](https://creativecommons.org/licenses/by-sa/4.0/) - <https://creativecommons.org/licenses/by-sa/4.0/> License


## **Disclaimer**
This is not intended to be a complete list of facts, however it attempts
to summarise the most relevant ones from the documents listed in the
References section.

## **Purpose of the document**
Summarise the concepts from the documents in the Reference sections.
Justifications for each statement can be found throughout those documents.

## **Structure of the document**
The next section illustrates the philosohpy behind the approach used here to
deal with safety requirements vs a system using Linux.
The further two sections about "First Principles" address two aspects of safety:
- integrity: reliably detecting interference to components with safety requirements
- availability: protracted integrity of the safey-relevant functionality
The last section lists considerations that can help with raching the safety goals.


## **Philosophy of the Document**
* A system subject to safety goals might employ the Linux Kernel.
* Safety goals are specific to the overall system and the related use cases.
* Similarly, one must expect that the solution will be tailored to both system and goals.
* However, it is possible to describe the Linux Kernel from a safety point of view.
* Not all the issues listed will be necessarily relevant to each system/application.
* A Safety Analysis should, however, clearly state which issues are relevant, which are not, and why.


## **First Principles - Integrity**
This section provides touchstone concepts about interference of QM components with
safety-qualified ones.

1. The vanilla Linux Kernel, is not sufficient to support, alone, safey goals.
igor-stoppa marked this conversation as resolved.
Show resolved Hide resolved
2. No internal kernel barriers/protections against interference to variable data, including safety-relevant data.
igor-stoppa marked this conversation as resolved.
Show resolved Hide resolved
3. Any component within the kernel context can generate (cascaded) interference.
4. Internal interference can cascade also through components that have been qualified for safety at unit-level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove "at unit-level" as the concept of unit, usually, applies to newly developed SW, not when qualifying pre-existing one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit is meant to represent the granularity chosen for the qualification, whatver that might be.
I'm trying to say SeooC without using the terminology, but that is the intended meaning.
As long as he meaning is preserved, I'm open to re-formulating it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Internal interference can cascade also through components that have been qualified for safety at unit-level.
4. Internal interference can also cascade through components that have only been safety-qualified at a component level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reiterative I'm almost ok with this, but I do not like that "only" because it seems to imply that something can be done on components to prevent the cascading of interference.
In theory yes, you could add redundance and what not, but in practice I don't think anyone will be able to make components sufficiently hardened.
If that is what you are after, then let's spell it out. I want to paint an accurate picture of the problem, without closing avenues that are open, but I do not want to present a rocky mountain path as if it was a boulevard either :-D

5. The Linux Kernel is able to interfere with both itself and any part of user-space processes (linear map).
6. No generic solution for very complex systematic corruption of any writable memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here what do you mean by "generic solution"? Do you mean "currently there is no design solution to protect writable safety relevant memory"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Currently" is implied, but, again, as long as something is proven to work, it's an absolute statement.
It can be rephrased, if it helps with readability, without altering the meaning.
My point is that the "Very complex systematic corruption" cannot be detected through solutions that are applicable to - let's say - the memory map of a safety-relevant process, that might comprise tens, hundreds of MB of memory. You can certainly try to use ad-hoc verification, e.g. performing functional testing on that process, but it needs to be tailored to each individual process. No generic solution. Only a generic methodology, to be taylored.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make a stronger (but narrower) statement here:

Suggested change
6. No generic solution for very complex systematic corruption of any writable memory.
6. Complex systematic corruption of writable memory cannot be prevented by a generic, kernel-level solution

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reiterative: "neither prevented nor detected" is the right formualtion, I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igor-stoppa I think your statement is not correct. For example CONFIG_STACKPROTECTOR_STRONG or KASAN are generic Kernel solution that provide detections of writable memory. So I would stick with Pauls' rewording

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paolonig They are partial solutions. Placing canaries on the stack doesn't catch all the possible corruptions.
KASAN alters the system configuration, and anyways is not deployed.
Feel free to provide evidence countering my arguments, they are intentionally made in a very clan cut way, to allow easy confutation by evidence, if any can be found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are partial generic kernel level solutions, I agree. However an integrator or a SW vendor may use such partial solution in combination with other measures to support a non-interference claim.
So IMO the fact is the monolithic nature of the Kernel that does not allow, today, interference prevention (like the one you can have by SMMU for a user space process) and we can all agree on this.
What I cannot agree is about principles that also imply or deny specific ways to support a certain claim; since the specific ways depend on the specification of the integration context and of other projects aspects (depending on the engineering workforce I may decide to use some verification strategies rather than others....)

7. Stress testing alone is not sufficient for supporting safety claims.
igor-stoppa marked this conversation as resolved.
Show resolved Hide resolved
8. Where safety claims involving the Linux Kernel are supported by testing, both positive, requirements-based testing and focused, analysis-led negative testing are required.
9. Those security features which are based on randomisation decrease repeatability of testing (e.g. structure layout randomisation).
10. If safety claims relating to one component are based on support by another, then the supporting component must have the same or better level of safety integrity (e.g. a safety-qualified watchdog).
11. Any other process could potentially interfere with a safety-qualified process, indirectly, through the kernel (e.g. a memory management bug could trigger interference between such processes).
12. Using cgroups/containers/SELinux removes only certain simple types of user-space-induced interference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove "simple" as it is too much debatable IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple -> direct ?

13. In a mixed-criticality scenario, all unqualified code represents a potential risk of interference, which grows with frequency of execution; this includes code (e.g. cgroups/containers, LSM/SELinux, etc.) that may be intended to manage aspects of that risk
14. HW enhancements are not a catch-all solution (e.g. ECC Memory doesn't prevent interference from non-safety-qualified SW).
15. When there are many possible sources of interference in a system, we can only reliably model and detect it in the receiving context.
16. The precise timing of allocation (e.g. during system or process startup) and retention of dynamically managed shared resources (e.g. process memory pages) is difficult to predict in a Linux-based system


## **First Principles - Availability**
This section provides touchstone concepts about (lack of) system availability, in presence of non-safety-qualified components.

1. Detecting interference alone doesn't help with controlling/managing availability.
2. The nature of the very complex systematic interference makes estimating probability of a failure unrealistic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree on this point. For instance even if an external safety monitor is required to meet the FDTI claim, it is still possible to tune the system and stress test it to claim that the unavailability risk due to temporal interference is acceptable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to estimate in a verifiable way said probability, then.
We can review the soundness of the estimation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do that only if you define what is a good-enough RT performance. BTW you can start from this paper and the toolset mentioned within it as a good starting point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not talking about RT performance, at least not in the time scale commonly intendend.
I'm talking about minutes (or more) overheads to basic operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make an example of a basic operation and also explaining what is preventing it from running?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine a set of processes implementing machine vision using tens of GB or ram.
Try to monitor that in realtime for corruption caused by the kernel ...

3. Stress testing as a means of safety qualification might be realistic at most for very simple cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here the topic is availability, not safety qualification....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safety requirements cover both FFI and availability, for higher ASIL. No?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the use cases I worked on did not require the OS to be safe operational. I.e. panicing the OS would lead the system in an acceptable safe state (with a very upset driver, hence the need for availability, that however is not related to safety)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then just keep the system off :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jokes apart I think that in order to agree to such principles we need first to define the integration context.
As I mentioned all the customers I worked with usually accept to have an availability test campaign to be separate from safety (since safety is guaranteed by the external monitor in case of missing the FDTI deadline)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These principles are meant to not be democratically decided by what anyone customer's might think.
I understand "the customer is always right" approach, but that doesn't belong to this document.

Anyone is free to take the risks they prefer, small or large.

This document is not about that. It's about separating assessments that can be done with very high confidence, if not certainty, from others that are based on subjective analysis.

For example, can any of your customers prove that they have determined the coverage of that stress testing?

Performing the same or similar operation billions of times doesn't say anything about other scenario that might arise.
It can be done for sure in simple scenarios, but in more complex ones it cannot.
Again, I'm tryin to make it easy to prove me wrong, by making fairly clean cut statements.

Just add evidence ...

4. Qualitatively, the likelihood of a failure grows with invocativation of non-safety-qualified components. For example:
1. invocation of device drivers, especially if not related to safety operations
2. syscalls
3. memory allocation/release operations
4. I/Os, e.g. storage, network
5. evaluation of Linux Security Module hooks
6. evaluation of cgroups tests
7. presence of non-safety-relevant triggering many of the above points

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above points are correct, and this is why a QM baseline is always required to support availability, but the same baseline applies to NSR components....so what is the goal here? Wouldn't it be easier to say "a QMS baseline is required to meet the products' availability requirements" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that a QM component is by definition less safe than an ASIL one.
Even in presence of perfect detection of interference QM->ASIL, the more QM components can interfere with the ASIL one, the lower the MTBF.
A QM baseline might not be enough to meet availability requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but you are wrong here. A detected interference is not a failure (for fail safe systems) of the system and hence it does not contribute to the final MTBF target. For safe operational then the story is different as detecting the interference is not a good-enough mechanism (but then you need to disambiguate this clearly as usually availability is considered separate from safety )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A detected interference tells you that an interference happened, but it doesn't tell you that the system is still safe.
It tells you that something is generating interference. And unless you have an ad hoc way od doing more than that, the only solution (in a non redundant system) is to put it into safe mode. That constitutes in practice a functional failure and it lowers the MTBF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to reword it, for improved clarity , but the meaning is what I wrote above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above it all depends on your system, on your integration context.
If for the integrator it is safety critical to avoid interference, since the only detection would still lead to a hazard, then you're right.
If instead detecting interference triggers the safe state that relies on a backup system, for example, pulling over the car, then you're not right.
So I think we should better clarify the assumptions on the integration context...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backup system is out of the scope of this document.
You can very well consider it as confirmation of my point.
This is talking about availability of the intended functionality of the "product", but let's agree it to be a car.
A pull over is a failure. It's not a safety failure, certainly. But it IS a failure of the product.
We can add a clarification about what MTBF means in this context (what I just wrote), if it can help.


## **Considerations - Toward the safety goals**
Considerations based on the two previous sections.

1. Devise negative testing simulating interference toward safety-relevant components.
2. Rely only on external components/mechanisms that already qualified at same-o-better safety level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than "external" I would say "independent"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you name an independent internal component?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The combination of a Runtime Verification Monitor and an external WTD can be used as an independent monitor for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the external WD is not internal :-D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyways the RVM out of the box doesn't provide that functionality, you have to feed it the models

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course, you need to design your specific RVM. It was just an example to justify independent as a better word compared to external....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh. Perhaps it's splitting the hair, but it is not independent. It uses library code from the kernel itself.

3. Analyse effects of interference based on recipient (e.g. corrupted data harder to notice)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would say "Make sure that the interference modes are avoided or detected either at the source or at the recipient side with sufficient level of rigor"

Copy link
Collaborator Author

@igor-stoppa igor-stoppa Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's intentionally excluding the source because the whole point is that it's impossible to claim completeness about sources.

Again, if you disagree, feel free to prove me wrong and enumerate all the possible sources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONFIG_STACKPROTECTOR_STRONG is a mechanism that can detect interference both at the source (the stack frame of the function where interference is generated) as well as at the recipient (the stack frame of a function where the interference trespasses the canaries).
KASAN is a mechanism that detects interference at the source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stack protector adds a canary in certain places. even adding canaries to all the similar places, interference that doesn't kill a canary doesn't get noticed

again, feel free to prove me wrong (maybe I should find an acronym for this, considering how many times I've written it)

wrt KASAN, just to give an example, it can detect out of bounds errors. but if there is an error that doesn't systematically cross boundaries, it will not necessarily be detected.
that is just an easy example, on top of my head ...

4. When possible, leverage system configuration (e.g. external safety island / external monitor)
5. Pick adequate granularity for verification (e.g. fine grained component validation vs end-to-end)
6. In case of end-to-end verification, choose strategy: periodic test injection, or on demand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO point 6 is about "Make sure that mechanisms used to prevent or detect interference are properly verified. In case of HW mechanisms periodic or on-demand diagnostic tests may be required"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not about HW - not necessarily

it's about using QM or less SW while performing E2E verification that it's still functioning correctly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then rather than e2e verification it seems an actual ASIL decomposition (and in this case we need to make sure that an appropriate safety analysis is done on the QM element to make sure the ASIL monitors cover all the dangerous failure modes)....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one end can make asusmptions about the other end, and validate them for correctness, isn't it e2e?



## **License: CC BY-SA 4.0**

### **DEED**
### **Attribution-ShareAlike 4.0 International**

Full License text: <https://creativecommons.org/licenses/by-sa/4.0/>

**You are free to:**

* **Share** — copy and redistribute the material in any medium or format for any purpose, even commercially.

* **Adapt** — remix, transform, and build upon the material for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

**Under the following terms:**

* **Attribution** — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

* **ShareAlike** — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

* **No additional restrictions** — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

**Notices:**

You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
Loading