-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assert() in main-line code #520
Comments
I would contend that we expend too much effort trying to ensure that the user application isn't misbehaving. The reality is that we've published the API specifications and if a user violates those specifications (by bugs or otherwise) then it is IMO acceptable that the software crashes. I actually like the idea that we put this error-checking argument validation stuff in a macro which could be turned off for "normal" builds, maybe enabled only during early development. Basically similar to what |
I agree with your comments, Joe. I think that a lot of our error checking code is akin to asserts, so having them being formal assert statements rather than "if (ptr == NULL) { SendEvent(MSG); status=ERROR}" it would be normalized and simplified to an "assert_true(ptr != NULL, MSG, ERROR)" or somesuch. |
Yes - I would take it a step further though and basically abort (crash) if an assert fails, rather than returning early with a code, for many reasons:
|
I'm a bit concerned with an approach where the core software "crashes" based on bad parameters to an API or lacks checking of those parameters. That really is opposite to many of the design principals, processes, and requirements levied on us for developing certifiable software. Definitely report the error (get your early feedback), and I'd be fine with the option to abort/exit (could redefine the assert actions as needed), but I don't think forcing an exit/abort is a good idea. |
When built in "release" mode they are compiled out. They only have an effect when building in "debug" mode. This is the standard behavior of the standard assert.h and there should be a similar behavior of whatever is implemented here. The reality is there isn't much value in constantly expending CPU cycles to validate parameter inputs from app code that probably isn't checking/handling the error return code anyway. Again, if the app is buggy enough to pass in a NULL string, is it reasonable to think it will properly handle the During early debug, when these problems are most prevalent, the preference should be to catch it early and get a backtrace. During deployment/flight, this should never happen, because the code should be debugged already, so it can be compiled out. |
Removing input validation is in direct violation (as interpreted by GSFC) of SWE-134 part g and k from NPR 7150.2C as well as the GSFC software development and unit test standards required by SWE-061, SWE-062. Feel free to argue it with our software division chief engineer, but I will not accept software modification that reduce or eliminate the validation of parameter inputs for APIs or commands. This really is a non-starter, unacceptable design approach, etc. I'm OK with improving the design related to the checks, which is what I understood the origin of this issue to be. |
I have a couple of thoughts:
The assert based macros could certainly make our parameter validation and error handling more consistent, and it could enable the debug mode that makes catching these development errors easier. Is at least that step compatible with the standards and goals stated in this thread? |
I am thinking of taking this step-wise...First replacing if(CHECKFAILS) { status=ERR; SendEvent(MSG)} with assert(CHECK, ERR, MSG) standardization, where the behavior is the same as current (sets status, generates event) and then consider whether asserts should fail more aggressively. Right now a lot of our code (like our UT code) has a lot of very similar check logic that makes it verbose and difficult to read and impossible to change the error-handling logic globally. |
I think the point is to macro-ize the behavior associated with the parameter checking, so it can be can be handled consistently but also tuned to the appropriate mode of operation. If removal of a parameter check violates some coding standard, then use a macro implementation that preserves the check. This would be precisely why we would not use the C library "assert.h" for this but rather have a local definition so more control can be retained. (apologies if the statement regarding "similar behavior" to C library implied compiling it out, I really meant that it it can be tuned as to whether it creates an abort/backtrace or not depending on whether you are in a debug build or not) Note that parameter checking is hardly conclusive, particularly for pointers. The code usually checks for Integer ranges can be validated, which is why I usually try to prefer this for APIs rather than free-range pointers. Still - the fundamental issue remains - if some environmental degradation/radiation/etc has caused an app to pass a bad value to an API, why do we think the app is healthy enough to respond to and handle this error response in a sane manner? It is running on the same processor and shares the same memory. What are the chances that the degradation only affected one variable? What if it hit program space and not just the stack/data space item? Has there ever been any case where an app actually is able to "recover" itself from such a condition as part of its own runtime (i.e. by getting an parameter validation error from a library and cleaning itself up?) The whole premise seems very weak. A processor restart is required in order to bring the system back to operation, I just don't see a way around this. |
For consideration. From a general safety and robustness point of view, a user app, or command/telemetry interface should not be able to take the core (cFE, OSAL, or PSP) out or otherwise cause a reset. I would break interface parameter checking into broad three categories:
Notes:
|
I am assuming this is true of rad-hard systems. And I'm more inclined to assume SEU's are outside the scope of cFS to handle. As an aside, when I was at JPL we were discussing how the COTS CPU/memory/microcontrollers fared in rad testing, and they said they did well--they are thinking partly because of the small die sizes and low power, so a hit doesn't cause a cascade (ich bin ein software geek, so I know little of the physics). I'm sure they don't have CRC/EDAC beyond what COTS hardware already includes. 'Of course, this is a high-risk experiment but will be interesting to see how it does. Flights will be brief (~90s) but the captured data will sit in the copter for a Sol so it can charge enough to transmit the data to the rover. |
I completely agree that SEU's/environmental degradation should be outside the scope of CFS. Correction for memory errors has to exist at the hardware level within the memory controller so it can be synchronized with software access. Any spaceflight qualified hardware must have ECC/EDAC built into its memory subsystem. The only time CFS plays a part is for apps like a memory scrubber - which IMO the primary purpose of this would be to ensure that the memory is accessed enough to ensure that a correctable error doesn't further degrade into an non-correctable error (many ECC/EDAC systems only perform correction when a memory location is accessed, so having a background task slowly "read" all memory can be beneficial). But the memory scrubber wouldn't actually do the correction - it cannot, and it cannot synchronize itself with other threads that might be running. In short, if a degradation causes the error to be observed at the software level where CFE/apps are running, it's too late -- the system is too far degraded to recover at this point. We certainly cannot rely on the same software to fix itself. |
We need a path forward on API asserts. |
Discussed again, separate flag to be able to compile out or use (or action can be redefined) and not dependent on build type. The ASSERTs are only used for confirming code that should never be broken (null pointers, etc). Things that should never happen. Alternative - return error codes like we do now... and discussion was they are not exclusive, do both (error return and configurable ASSERT). Where to put re-definable mission ASSERT logic? Thinking of "module" concept where it is built into the framework. |
CCB 2020-08-05: See comment above for discussion notes and minutes |
We've now got |
Is your feature request related to a problem? Please describe.
In unit tests, we use assert statements.
A lot of our code performs a number of safety checks (pointer is not null, index is less than array size, etc.) I suggest we develop something similar to the UT assert code, and use as much as possible in the main-line code base in lieu of our current safety checks as it will improve readability and support tooling of code analysis.
Describe the solution you'd like
Assert library should generate events, return values, and will need to be able to unwind such things as semaphores.
Describe alternatives you've considered
Implementing this will introduce risk that we miss a safety check or otherwise mis-translate the check into an assert. Need to develop the library and slowly migrate current safety checks to the library.
Additional context
Add any other context about the feature request here.
Requester Info
Christopher.D.Knight@nasa.gov
The text was updated successfully, but these errors were encountered: