-
Notifications
You must be signed in to change notification settings - Fork 509
Vehicle crash due do estimator GPS failure #771
Comments
Without more context and the preflight log it is hard to say for sure. But you definitely have magnetic interference on the ground disturbing the estimated heading. I would first try to remove this. Second thing you can try is to update ECL to master since there is now a yaw recovery estimator running in the background, specifically designed to handle failure cases like this |
Thanks @CarlOlsson for having a look and pointing out the magnetic interference. That is potentially an issue. However, EKF2 already has a feature that resets the heading to the current mag measurement once the vehicle reaches a certain altitude (1.5m above ground) https://github.com/PX4/ecl/blob/dec1ed0aeb395a885b60b1d61cfb57bd932c7d34/EKF/control.cpp#L1451 Additionally, in the previous crash we flew for several minutes (with good position hold) and crashed only as we were flying back to land. For that crash, #766 would not have helped, as that is only used in the first 30s of flight. So I'm not convinced that a) mag interference is to be blamed for the crashes, and b) the new failure recovery feature would have prevented the crashes. |
We just had another crash with this drone. Not sure if related at all. This time the altitude estimate was the problem. The setpoint and estimated altitude was 4m above home, but baro and GPS measurements, as well as onboard video, show that the vehicle descended back to the ground. 5 seconds after hitting the ground, EKF2 says https://logs.px4.io/plot_app?log=fa9c406a-e45f-4327-92fe-209db667d7bf |
We are using a v4 based board (identical schematics, different board form factor). |
Basically, what we're seeing is that whenever the estimator fails, the GPS/baro measurements would have been fine. So I would like to limit EKF2's ability to "mess things up" as much as possible. |
In both crashes, I do see that the logged covariance values are strange in that log. E.g. If I'm not mistaken, such covariance issues should not come up even if the measurements are bad. Rather, it points to numerical issues in the estimator. |
@ndepal if you can live without bias estimations, you can disable them in the EKF_AID_MASK. But we should really figure out what is happening here. If you can get us a log-from-boot with SDLOG_PROFILE having replay enabled, and showing the issue, it makes it a ton easier to debug. We've seen intermittent issues popping up from the delta-velocity bias estimation for a long time now, so I really do think this is a real problem and not something specific to your setup. |
Thanks @jkflying |
Looking at what happens when a covariance issue is detected, the columns and rows of the affected states are zeroed out. But it looks like the diagonal element is set to zero as well. Is that how it's supposed to be? Should it not be reset to some reasonably large value? |
We're going to try that. How do I decide that I "can live without bias estimations"? |
This logfile has extremely high magnetic interference
I meant before the logfile starts but after the vehicle was powered.
That is very interesting! I have never seen that before, I guess it is related to the very long (<20h, I guess static) preflight time.
A normal logfile with a low logging rate would be better than nothing, then at least we can see when and how the covariances collaps.
That also looks like a bug to me. FYI @priseborough |
@ndepal I will have a look today in the afternoon. I think these cases a quite interesting because not many people do this kind of testing. |
It is important to note that from 75704 on, the drone is skidding along the ground. The crazy parts of the magnetic field norm are after the impact. Here is a good log of the same vehicle in the same environment: We certainly do have a visible change in magnetic field between on-ground and in-air of about 0.1 G. I'm not sure if that is big enough to be worrying.
Very likely. We moved to v1.10.0 in December. Before that, we were based on v1.7.4 for about 2 years. With that version we regularly flew systems that had been turned on for weeks, even months. We had a few incidents that were related to the estimator, but we seem to be having a lot more problems since updating. |
(Y)
As long as the vehicle is not moving around in the field with the disturbances I think it should be fine. But we have seen problems where users boot up the vehicle e.g. inside a car and then moves out and places it on the ground. Even if the disturbances are gone at takeoff they have already confused the delta vel biases etc.
We are actually in exactly the same situation. Have not experienced any problems yet but will make sure to do some testing with long time before takeoff |
I did the following experiment in HITL
Results of the experiment:
|
Did another experiment using HITL simulations: Log: |
In the log from my previous comment, it can be seen that while the drone is landed, z velocity and position covariances are very slowly decreasing, until the moment when they start decreasing drastically and quickly collapse to values that are almost zero. When this happens nothing is being fused. |
the z velocity variance slowly decreases. In the predict covariance step, after some time z velocity predicted variance P[6] becomes negative so fixCovarianceErrors constrains it to 0. Since 0 < KHP[6][6], measurements stop being fused which causes the drone to drift away, and also _fault_status reports numerical error. |
The variance decreases but i think it should converge to some value in this experiment. Instead it keeps decreasing to 0. There are these functions that are used to deal with numerical problems such as having negative variances, but at that point its too late already. The variances are bad all the time while they are smaller than the value they should have converged to. |
Something else worth noting is a time slip present when the logging starts , eg: Time slip is calculated as _integrated_time_us += imu_dt; This indicates that the ekf2_main update is running slow and/or the IMU driver is not calculating the delta time correctly or dropping data. If this slip is occurring as series of small 4msec steps, then it would indicate the ekf2 thread not keeping up. |
@ndepal Can you please provide a git commit hash for the version of ecl that was running for this flight. |
@dusan19 Can you please provide a link to a compatible PX4/Firmware version for replay of your logs and also a link to the version of ecl you are using. Both are required for us to replay your logs using the elf2 replay facility, determine affected versions and devise a fix. |
@dusan19 I did a quick replay check on your log using upstream master and it has to many dropouts to be used for that purpose. |
@priseborough Thank you for your input. We are based on PX4 v1.10.0, which uses ECL version https://github.com/PX4/ecl/tree/dec1ed0aeb395a885b60b1d61cfb57bd932c7d34. Our branch of ECL has two commits on top that implement getter functions for some internal data, but do not change the functionality of the code. This is valid for both mine and @dusan19 s logs. |
@priseborough I really like the idea of switching to #762 for code generation - it's possible that the more aggressive expression simplification solves some numerical issues by cancelling terms, so we might avoid this entire issue. Also I don't see an issue with having a hard lower limit on the vehicle position accuracy of 1mm - that's just adding some prior knowledge about our positioning systems - although I think it would be better to increase the process noise to achieve this rather than have a hard clamp. The process noise should actually be adjusted to be something more like ~3cm with a good RTK setup left for a long time. But hard clamp as a safety mechanism is still good. @ndepal About the negative off-diagonals, I don't have intuition here, sorry. This sounds like a promising way of detecting numerical issues though. |
I found that too (for the accel bias) in this log but we got numerical errors reported in cases when we had accel bias estimation active (never this quickly after boot), unfortunately we dont have logs from boot of those. |
I agree on the fact that dt changes the time at which collapsing occurs, I observed that before as well, same happens when changing the heading measurement noise parameter. But when you say that your changes fix the problem, you mean within the duration of this log? Can you be certain that collapsing is not just shifted for a few minutes or a few hours? This log is way too short to claim that the problem is gone just because it didnt happen in the first few minutes after boot. In some cases we got the failures hours after boot, it seems to be dependent on the accel and mag measurements when the drone is static. |
With the new Invensense IMU drivers in PX4/Firmware the jitter in the integrated IMU data is drastically reduced. With ecl/EKF If we schedule actually running the filter (ImuDownSampler resets) as a multiple of IMU samples (computed from desired FILTER_UPDATE_PERIOD_MS) we can remove most of the jitter. |
If we could solve the problem with the python generated equations that you referenced, or by using double math, isnt it better to use the actual dt as it is now rather than setting it to FILTER_UPDATE_PERIOD_S. Also could we get away with 8ms update period? |
@dusan19 do you have any comparison of the cost of that change on stm32f4 (no double precision floating point hardware)? Output from For reference here's a crude benchmark to give you an idea of what it looks like on a pixracer (stm32f427). It's the trigonometry where you really need to be careful. nsh> tests microbench_math
INFO [tests] RUNNING TEST: time_single_precision_float
float add: 1000 events, 632us elapsed, 0.63us avg, min 0us max 1us 0.482us rms
float sub: 1000 events, 647us elapsed, 0.64us avg, min 1us max 1us 0.478us rms
float mul: 1000 events, 641us elapsed, 0.64us avg, min 0us max 1us 0.479us rms
float div: 1000 events, 701us elapsed, 0.70us avg, min 1us max 1us 0.458us rms
float sqrt: 1000 events, 740us elapsed, 0.74us avg, min 1us max 1us 0.438us rms
INFO [tests] TEST PASSED: time_single_precision_float
INFO [tests] RUNNING TEST: time_single_precision_float_trig
sinf(): 1000 events, 2111us elapsed, 2.11us avg, min 2us max 3us 0.314us rms
cosf(): 1000 events, 1987us elapsed, 1.99us avg, min 1us max 2us 0.113us rms
tanf(): 1000 events, 2769us elapsed, 2.77us avg, min 2us max 4us 0.503us rms
acosf(): 1000 events, 2711us elapsed, 2.71us avg, min 2us max 3us 0.453us rms
asinf(): 1000 events, 2739us elapsed, 2.74us avg, min 2us max 3us 0.439us rms
atan2f(): 1000 events, 2605us elapsed, 2.61us avg, min 2us max 3us 0.489us rms
INFO [tests] TEST PASSED: time_single_precision_float_trig
INFO [tests] RUNNING TEST: time_double_precision_float
double add: 1000 events, 1396us elapsed, 1.40us avg, min 1us max 2us 0.489us rms
double sub: 1000 events, 1428us elapsed, 1.43us avg, min 1us max 2us 0.495us rms
double mul: 1000 events, 1128us elapsed, 1.13us avg, min 1us max 2us 0.334us rms
double div: 1000 events, 4461us elapsed, 4.46us avg, min 4us max 5us 0.498us rms
double sqrt: 1000 events, 5589us elapsed, 5.59us avg, min 4us max 8us 1.421us rms
INFO [tests] TEST PASSED: time_double_precision_float
INFO [tests] RUNNING TEST: time_double_precision_float_trig
sin(): 1000 events, 16901us elapsed, 16.90us avg, min 16us max 17us 0.298us rms
cos(): 1000 events, 18307us elapsed, 18.31us avg, min 18us max 19us 0.461us rms
tan(): 1000 events, 25796us elapsed, 25.80us avg, min 23us max 33us 3.676us rms
acos(): 1000 events, 4380us elapsed, 4.38us avg, min 4us max 5us 0.485us rms
asin(): 1000 events, 4554us elapsed, 4.55us avg, min 4us max 5us 0.497us rms
atan2(): 1000 events, 26604us elapsed, 26.60us avg, min 26us max 27us 0.489us rms
sqrt(): 1000 events, 4774us elapsed, 4.77us avg, min 4us max 8us 1.226us rms
INFO [tests] TEST PASSED: time_double_precision_float_trig |
yeah i realized it might be heavy, but just wanted to check if conceptually that solves the problem and to check whether the problem is in the prediction at all. I might be too picky but i wish we had a longer log, few hours long to see the changes on a bigger time scale. Ill do the load comparisons now and post |
Makes sense. I'm personally in favor of pursuing a combination of all of these (targeted use of double precision, moving to the new scipy derivation, making the FILTER_UPDATE_PERIOD configurable, AND setting that update so it's perfectly aligned with your actual IMU) . |
After discussing the options with other ecl/EKF developers, the preferred approach is to incorporate the time step and minimum variance changes for the upcoming 1.11 stable release and submit the changes to the autocode as a separate PR for the following 1.11.1 update to give sufficient time for 'soak' testing. The reason for this is that although the subtle numerical differences in the output comapted to the matlab autocode are better in this scenario, this may not be universal and more extensive testing is required to be confident about this. |
i run some cpu load analysis for the case when double math is used for computing nextP. On my pixracer the cpu load for the att_pos_crtl workqueue (where ekf is running) increased for 15%, so that approach is not feasible (at least not with this implementation). double addition takes ~6 times longer than float addition on the pixracer, and also some cpu is used on double-float conversions. Then I wanted to find where the numerical differences between float and double math are the biggest, in order to try to move only a part of the computation to double math, so the load is not as heavy on the cpu. The goal of the following analysis is to show which fields of the cov matrix have to be computed using double math and for which float math is sufficient, i.e. which fields are obtained by adding values that differ by orders of magnitude. Inspired by #762 (matrices obtained in the PR description) i did the following: Next step: I set the whole Next step: I did only the quaternion fields using double math: nextP[i][j]=(float)dnextP[i][j], 0<=i,j<=3, and kept the rest using float math as it normally is done. |
when i only impose a lower limit on P[1][1] and P[2][2] in fix covariance function and make no other "fixes" there is no collapsing. |
@dusan19 Yes, that could be worthwhile, but relating those values to operation minimums required is harder than for velocity and position states, eg 1E-6 equates to 1 mm/sec uncertainty on a velocity state. I've submitted a minimum fix #795 to get something into the upcoming 1.11.0 release, but your suggestion should be looked at for changes (including using the python autocode version) post 1.11.0 |
@dusan19 Could you test the CPU load implications when running this on an STM32F7? With it's double precision FPU you can take a different approach. |
Unfortunately I only have a pixracer (STM32F4), but could share the code if someone else wants to run it on the STM32F7. |
I managed to create a 2h long log from boot where the estimator crashes every ~15min. It might be useful for testing any of the fixes since it gives better insight of whats going on compared to the 5min log. Unfortunately its still based on 1.10 so master replay might not work |
i tested the fixes proposed in #795 with the long log from the previous comment. Original log replayed: Replayed log with the fixes from #795 |
There are 2 changes in #795
Effectively both changes together look the same as having only the variance limit change Only dt changes: Only variance limit changes: |
With these changes: https://github.com/dusan19/ecl/tree/cov_num_fixes_1_10 |
@dusan19 Have you tried using the python derivation? Would be curious how that affects it. |
@RomanBapst I am replaying that at the moment, will post the results in that PR |
its worth mentioning that I run the changes in https://github.com/dusan19/ecl/tree/cov_num_fixes_1_10 |
@dusan19 It would really help if you could install master on your hardware for the ground testing. |
i understand, I could create such a setup where I could wait for the estimator crash while the drone is landed, but the problem is that I dont know how to reproduce it. It seems it depends on various initial conditions. Once it starts failing, it fails quite often, and also the reboot doesnt fix it which is good for debugging. Unfortunately I dont know how to bring it to that state. Would it be possible for you to create a master replay compatible log using the sensor measurements from the 1.10 log I sent. I guess the problem is that some uorb message definitions have changed in the meantime? |
The messages have changed sufficiently that that is not possible without significant work. If you would like to ensure your problem is solved on 1.11, then I really do need that log. |
As mentioned above, to fix the numerical stability issue, I added the minimal theoretical increment constraint on the diagonal elements, and additionally switched the quaternion covariance prediction to use double precision arithmetic. |
Agreed |
We just had another vehicle crash. EKF2 reported a GPS glitch and a delta velocity bias estimate issue.
The firmware is based on PX4 v1.10.0.
We are using the Ublox F9P GPS module in the vehicle and on the ground.
The vehicle took off in mission mode and briefly after started drifting when it crashed into a tree.
At the point of takeoff, the vehicle had been turned on for ~6 hours.
The symptoms are very similar to the crash mentioned in PX4/PX4-Autopilot#14312 (comment), but this time we do have a log, fortunately. This crash did not happen with the same vehicle.
Looking at the log, the
vehicle_gps_position
looks sane, whereas thevehicle_global_position
and the local position does not. Therefore, I feel that the issue is not with the GPS measurements (i.e. they should not have been rejected) but rather with the estimator that falsely rejected the measurements.As this is not the first incident of this kind, I would really appreciate some assistance in looking into this and preventing it from happening in the future.
The flight log can be found here: https://logs.px4.io/plot_app?log=c231b7b3-ab35-4f48-be8c-0cb0a836cae4
The text was updated successfully, but these errors were encountered: