Skip to content
This repository has been archived by the owner on May 1, 2024. It is now read-only.

Vehicle crash due do estimator GPS failure #771

Closed
ndepal opened this issue Mar 9, 2020 · 85 comments
Closed

Vehicle crash due do estimator GPS failure #771

ndepal opened this issue Mar 9, 2020 · 85 comments

Comments

@ndepal
Copy link

ndepal commented Mar 9, 2020

We just had another vehicle crash. EKF2 reported a GPS glitch and a delta velocity bias estimate issue.

The firmware is based on PX4 v1.10.0.
We are using the Ublox F9P GPS module in the vehicle and on the ground.

The vehicle took off in mission mode and briefly after started drifting when it crashed into a tree.
At the point of takeoff, the vehicle had been turned on for ~6 hours.

The symptoms are very similar to the crash mentioned in PX4/PX4-Autopilot#14312 (comment), but this time we do have a log, fortunately. This crash did not happen with the same vehicle.

Looking at the log, the vehicle_gps_position looks sane, whereas the vehicle_global_position and the local position does not. Therefore, I feel that the issue is not with the GPS measurements (i.e. they should not have been rejected) but rather with the estimator that falsely rejected the measurements.

As this is not the first incident of this kind, I would really appreciate some assistance in looking into this and preventing it from happening in the future.

The flight log can be found here: https://logs.px4.io/plot_app?log=c231b7b3-ab35-4f48-be8c-0cb0a836cae4

@CarlOlsson
Copy link
Contributor

Without more context and the preflight log it is hard to say for sure. But you definitely have magnetic interference on the ground disturbing the estimated heading.
image

I would first try to remove this. Second thing you can try is to update ECL to master since there is now a yaw recovery estimator running in the background, specifically designed to handle failure cases like this
#766

@LorenzMeier
Copy link
Member

@ndepal Given your mag interference issues I will close this issue since this is not a PX4 performance issue at the core - it is a specific use case (lots of metal around the vehicle) that requires a new feature (as below):

#766

@ndepal
Copy link
Author

ndepal commented Mar 9, 2020

Thanks @CarlOlsson for having a look and pointing out the magnetic interference. That is potentially an issue.

However, EKF2 already has a feature that resets the heading to the current mag measurement once the vehicle reaches a certain altitude (1.5m above ground) https://github.com/PX4/ecl/blob/dec1ed0aeb395a885b60b1d61cfb57bd932c7d34/EKF/control.cpp#L1451
(this code looks quite different in master, the above links to what PX4 v1.10.0 uses)
This reset happens at time 23100 (4.5s into the log), as seen by the estimator_status/control_mode_flags and by the fact that the heading innovations become small at that time. So, assuming that in air we have reasonable mag measurements (and I do believe that, we normally have good position hold performance), the on ground-mag issues should be a thing of the past at the time of GPS rejection.

Additionally, in the previous crash we flew for several minutes (with good position hold) and crashed only as we were flying back to land. For that crash, #766 would not have helped, as that is only used in the first 30s of flight.

So I'm not convinced that a) mag interference is to be blamed for the crashes, and b) the new failure recovery feature would have prevented the crashes.

@ndepal ndepal reopened this Mar 9, 2020
@CarlOlsson
Copy link
Contributor

So, assuming that in air we have reasonable mag measurements (and I do believe that, we normally have good position hold performance), the on ground-mag issues should be a thing of the past at the time of GPS rejection.

Yes but magnetic interference does not only affect the heading estimate. Again, without knowing more of the background or at least having the preflight logfile it is hard to say, but it looks like your delta velocity bias states have diverged before takeoff.

image

The reason for this can be the magnetic interference, or e.g. a high temperature gradient of the IMU (since the IMU is not temperature calibrated and it looks like you are using a standard v4 board).

This most likely leads to the bad velocity estimation after takeoff and the innovation checks failing triggering the emergency landing
image

@ndepal
Copy link
Author

ndepal commented Mar 9, 2020

We just had another crash with this drone. Not sure if related at all. This time the altitude estimate was the problem. The setpoint and estimated altitude was 4m above home, but baro and GPS measurements, as well as onboard video, show that the vehicle descended back to the ground. 5 seconds after hitting the ground, EKF2 says [ecl/EKF] EKF baro hgt timeout - reset to baro.

https://logs.px4.io/plot_app?log=fa9c406a-e45f-4327-92fe-209db667d7bf

@ndepal
Copy link
Author

ndepal commented Mar 9, 2020

or e.g. a high temperature gradient of the IMU (since the IMU is not temperature calibrated and it looks like you are using a standard v4 board).

We are using a v4 based board (identical schematics, different board form factor).
I don't see any drastic temperature changes in the log.

@ndepal
Copy link
Author

ndepal commented Mar 9, 2020

Basically, what we're seeing is that whenever the estimator fails, the GPS/baro measurements would have been fine. So I would like to limit EKF2's ability to "mess things up" as much as possible.

@ndepal
Copy link
Author

ndepal commented Mar 9, 2020

In both crashes, estimator_status/filter_fault_flags reports mag numerical errors (even before takeoff). The only place where this can happen without an accompanying print statement is https://github.com/PX4/ecl/blob/dec1ed0aeb395a885b60b1d61cfb57bd932c7d34/EKF/mag_fusion.cpp#L773
which points to an unhealthy covariance matrix.

I do see that the logged covariance values are strange in that log. E.g. estimator_status/covariance.06 completely collapses early on in the log of #771 (comment).

If I'm not mistaken, such covariance issues should not come up even if the measurements are bad. Rather, it points to numerical issues in the estimator.

@jkflying
Copy link
Contributor

jkflying commented Mar 9, 2020

@ndepal if you can live without bias estimations, you can disable them in the EKF_AID_MASK. But we should really figure out what is happening here. If you can get us a log-from-boot with SDLOG_PROFILE having replay enabled, and showing the issue, it makes it a ton easier to debug.

We've seen intermittent issues popping up from the delta-velocity bias estimation for a long time now, so I really do think this is a real problem and not something specific to your setup.

@ndepal
Copy link
Author

ndepal commented Mar 9, 2020

Thanks @jkflying
Getting a log since boot could be tricky since it's not very reproducible (except for today...) and since our system is on for a long time that log would be huge. I'll see what we can do.

@ndepal
Copy link
Author

ndepal commented Mar 9, 2020

Looking at what happens when a covariance issue is detected, the columns and rows of the affected states are zeroed out. But it looks like the diagonal element is set to zero as well.

https://github.com/PX4/ecl/blob/dec1ed0aeb395a885b60b1d61cfb57bd932c7d34/EKF/mag_fusion.cpp#L766

Is that how it's supposed to be? Should it not be reset to some reasonably large value?

@ndepal
Copy link
Author

ndepal commented Mar 9, 2020

@ndepal if you can live without bias estimations, you can disable them in the EKF_AID_MASK.

We're going to try that. How do I decide that I "can live without bias estimations"?
I guess I'd prefer consistency over slightly improved performance (if/when biases are estimated correctly), but I'm worried about GPS/baro measurements being rejected because of a real IMU bias that I'm now no longer estimating.

@CarlOlsson
Copy link
Contributor

We just had another crash with this drone. Not sure if related at all. This time the altitude estimate was the problem.

This logfile has extremely high magnetic interference
image

I don't see any drastic temperature changes in the log.

I meant before the logfile starts but after the vehicle was powered.

I do see that the logged covariance values are strange in that log. E.g. estimator_status/covariance.06 completely collapses early on in the log of #771 (comment).

That is very interesting! I have never seen that before, I guess it is related to the very long (<20h, I guess static) preflight time.

Getting a log since boot could be tricky since it's not very reproducible (except for today...) and since our system is on for a long time that log would be huge. I'll see what we can do.

A normal logfile with a low logging rate would be better than nothing, then at least we can see when and how the covariances collaps.

Looking at what happens when a covariance issue is detected, the columns and rows of the affected states are zeroed out. But it looks like the diagonal element is set to zero as well.

That also looks like a bug to me. FYI @priseborough

@RomanBapst
Copy link
Contributor

@ndepal I will have a look today in the afternoon. I think these cases a quite interesting because not many people do this kind of testing.

@ndepal
Copy link
Author

ndepal commented Mar 10, 2020

This logfile has extremely high magnetic interference

It is important to note that from 75704 on, the drone is skidding along the ground. The crazy parts of the magnetic field norm are after the impact.

log_2020-03-09-16-19-10

Here is a good log of the same vehicle in the same environment:

log_2020-03-09-02-36-36

We certainly do have a visible change in magnetic field between on-ground and in-air of about 0.1 G. I'm not sure if that is big enough to be worrying.

That is very interesting! I have never seen that before, I guess it is related to the very long (<20h, I guess static) preflight time.

Very likely. We moved to v1.10.0 in December. Before that, we were based on v1.7.4 for about 2 years. With that version we regularly flew systems that had been turned on for weeks, even months. We had a few incidents that were related to the estimator, but we seem to be having a lot more problems since updating.

@CarlOlsson
Copy link
Contributor

It is important to note that from 75704 on, the drone is skidding along the ground. The crazy parts of the magnetic field norm are after the impact.

(Y)

We certainly do have a visible change in magnetic field between on-ground and in-air of about 0.1 G. I'm not sure if that is big enough to be worrying.

As long as the vehicle is not moving around in the field with the disturbances I think it should be fine. But we have seen problems where users boot up the vehicle e.g. inside a car and then moves out and places it on the ground. Even if the disturbances are gone at takeoff they have already confused the delta vel biases etc.

Very likely. We moved to v1.10.0 in December. Before that, we were based on v1.7.4 for about 2 years.

We are actually in exactly the same situation. Have not experienced any problems yet but will make sure to do some testing with long time before takeoff

@dusan19
Copy link
Contributor

dusan19 commented Mar 10, 2020

I did the following experiment in HITL

  • I run automated repeated missions:

    1. Takeoff and fly to a specified location
    2. Go back to the mission start location and land (precision landing)
    3. Wait 3 seconds
    4. Repeat 1)
  • I run this until the drone gets an estimator failure and does an emergency land

  • The flight controller is never rebooted

  • The flight controller was on for a long time, but the simulator was started just before this experiment.

  • I have the logs of all the flights

  • There are a few logged print statements added to indicate when the mission is done or updated and similar stuff.

Results of the experiment:

  1. I got the first estimator failure after less then 20 flights:
    https://review.px4.io/plot_app?log=75b8bf4a-495a-4621-abaf-2989e8cd57c6
    Critical navigation failure was reported and some innovation checks failure. This is not necessarily the same problem as the one in this issue. There is no numerical problems reported and covariances are not clearly weird.

  2. I started running repeated missions again after previous fallback landing. After 17 flights the estimator fails again:
    https://review.px4.io/plot_app?log=fe2e060b-9eed-4981-a7dd-6bf01b694a68
    Here, additionally, a gps glitch was reported

  3. The drone landed far away from home after the previous estimator failure, and was landed there for another 20 minutes when it started repeatedly reporting estimator failure, numerical errors, height innovation failures (which looks like the same problems as in the second log reported in this issue: Vehicle crash due do estimator GPS failure #771 (comment))
    https://review.px4.io/plot_app?log=d35b2d34-02ae-4b0a-9e24-1c6932f1a551
    I dont have a log of this 20 minutes period between the flights, i only started logging once I realized its crashing. The log is of the drone landed and disarmed.

@dusan19
Copy link
Contributor

dusan19 commented Mar 10, 2020

Did another experiment using HITL simulations:
I made a very big log from boot and did repeated missions, and a twice between the missions left the drone landed for ~20min.
Twice during flights I got critical navigation failure with innovation_check_flags = 1.
Interestingly i got filter_fault_flags = 8 (the fusion of the magnetic heading has encountered a numerical error) while the drone was landed for longer time (~10 min), this error repeated frequently for more than a minute, and finally stopped. During this time innovation_check_flags switched between 0, 4 and 5

Log:
https://review.px4.io/plot_app?log=e2d7e8f0-d58a-43b0-b8a7-e25dcd124705

@dusan19
Copy link
Contributor

dusan19 commented Mar 11, 2020

In the log from my previous comment, it can be seen that while the drone is landed, z velocity and position covariances are very slowly decreasing, until the moment when they start decreasing drastically and quickly collapse to values that are almost zero. When this happens nothing is being fused.
This motivated me to make another HITL simulation log, where the drone is only landed the whole time:
The log is more than an hour long, and these numerical errors (estimator_status/filter_fault_flags = 8) happen multiple times in batches (first time after ~30min, which is not really that long of a runtime).
https://review.px4.io/plot_app?log=b39db461-2927-4bc1-9b0b-e5310ecac16c
The covariance collapsing looks like the same problem as @ndepal had in the logs he posted when he got numerical errors.

@dusan19
Copy link
Contributor

dusan19 commented Mar 11, 2020

the z velocity variance slowly decreases. In the predict covariance step, after some time z velocity predicted variance P[6] becomes negative so fixCovarianceErrors constrains it to 0. Since 0 < KHP[6][6], measurements stop being fused which causes the drone to drift away, and also _fault_status reports numerical error.

@dusan19
Copy link
Contributor

dusan19 commented Mar 11, 2020

The variance decreases but i think it should converge to some value in this experiment. Instead it keeps decreasing to 0. There are these functions that are used to deal with numerical problems such as having negative variances, but at that point its too late already. The variances are bad all the time while they are smaller than the value they should have converged to.

@priseborough
Copy link
Collaborator

I think its due to badly conditioned delta velocity bias estimates and covariances forming during the static period on ground.

These are delta velocity bias states rescaled to equivalent m/s/s:
Screen Shot 2020-03-12 at 4 46 18 pm

These are the corresponding variances:
Screen Shot 2020-03-12 at 4 47 02 pm

Can you please confirm the git commit hash for the version of ecl you are using.

@priseborough
Copy link
Collaborator

Something else worth noting is a time slip present when the logging starts , eg:

Screen Shot 2020-03-12 at 4 57 07 pm

Time slip is calculated as

_integrated_time_us += imu_dt;
_last_time_slip_us = (now - _start_time_us) - _integrated_time_us;

This indicates that the ekf2_main update is running slow and/or the IMU driver is not calculating the delta time correctly or dropping data.

If this slip is occurring as series of small 4msec steps, then it would indicate the ekf2 thread not keeping up.

@priseborough
Copy link
Collaborator

The filter_fault_flag bit position 15 is going high, so the EKF was aware of the bad delta velocity bias estimates:

Screen Shot 2020-03-12 at 5 10 57 pm

@priseborough
Copy link
Collaborator

@ndepal Can you please provide a git commit hash for the version of ecl that was running for this flight.

@priseborough
Copy link
Collaborator

priseborough commented Mar 12, 2020

@dusan19 Can you please provide a link to a compatible PX4/Firmware version for replay of your logs and also a link to the version of ecl you are using. Both are required for us to replay your logs using the elf2 replay facility, determine affected versions and devise a fix.

@priseborough
Copy link
Collaborator

@dusan19 I did a quick replay check on your log using upstream master and it has to many dropouts to be used for that purpose.

@ndepal
Copy link
Author

ndepal commented Mar 12, 2020

@priseborough Thank you for your input.

We are based on PX4 v1.10.0, which uses ECL version https://github.com/PX4/ecl/tree/dec1ed0aeb395a885b60b1d61cfb57bd932c7d34. Our branch of ECL has two commits on top that implement getter functions for some internal data, but do not change the functionality of the code.

This is valid for both mine and @dusan19 s logs.

@jkflying
Copy link
Contributor

jkflying commented Apr 8, 2020

@priseborough I really like the idea of switching to #762 for code generation - it's possible that the more aggressive expression simplification solves some numerical issues by cancelling terms, so we might avoid this entire issue.

Also I don't see an issue with having a hard lower limit on the vehicle position accuracy of 1mm - that's just adding some prior knowledge about our positioning systems - although I think it would be better to increase the process noise to achieve this rather than have a hard clamp. The process noise should actually be adjusted to be something more like ~3cm with a good RTK setup left for a long time. But hard clamp as a safety mechanism is still good.

@ndepal About the negative off-diagonals, I don't have intuition here, sorry. This sounds like a promising way of detecting numerical issues though.

@dusan19
Copy link
Contributor

dusan19 commented Apr 8, 2020

Replay of the https://logs.px4.io/plot_app?log=135821dc-7839-4eef-a6aa-ec875604ce8e log shows either turning bias estimation back on or applying a lower limit of 1E-8 to the velocity variances in the fixCovarianceErrors() function preventing the behaviour. I am investigating further.

I found that too (for the accel bias) in this log but we got numerical errors reported in cases when we had accel bias estimation active (never this quickly after boot), unfortunately we dont have logs from boot of those.

@dusan19
Copy link
Contributor

dusan19 commented Apr 8, 2020

Do all of your results use the variance increment lower limit that i wrote or not? Just so I can bring my code to the same state as in your tests.

applying a lower limit of 1E-8 to the velocity variances in the fixCovarianceErrors() function preventing the behaviour

When I put a lower limit to velocity variances in fixCovarianceErrors() to 1e-8, I still get the same errors, here are the plots with and without my variance increment limit "fix":
vel1e8fix

velvar1e8nofix

@dusan19
Copy link
Contributor

dusan19 commented Apr 8, 2020

I then reverted it back to 8 and printed the value of dt used in the covariance prediction. This showed that it is regularly dropping to 4.8msec. I then hardcoded the value of dt used to FILTER_UPDATE_PERIOD_S but that did not solve the issue indicating that it is the average value is more important.
Using FILTER_UPDATE_PERIOD_MS = 10 and hardcoding the dt used by the covariance prediction to
FILTER_UPDATE_PERIOD_S totally eliminated the behaviour.

This testing suggests that reductions in FILTER_UPDATE_PERIOD_MS and jitter in the dt used in the covariance prediction are contributors

I then changed FILTER_UPDATE_PERIOD_MS to 10 and the vertical velocity variance collapse is postponed to right at the end of the log, eg:

I agree on the fact that dt changes the time at which collapsing occurs, I observed that before as well, same happens when changing the heading measurement noise parameter. But when you say that your changes fix the problem, you mean within the duration of this log? Can you be certain that collapsing is not just shifted for a few minutes or a few hours? This log is way too short to claim that the problem is gone just because it didnt happen in the first few minutes after boot. In some cases we got the failures hours after boot, it seems to be dependent on the accel and mag measurements when the drone is static.

@dusan19
Copy link
Contributor

dusan19 commented Apr 8, 2020

Without changing anything, but only switching to using doubles in mid computations for Pnext the results are good. So P is still float, doubles are only used in huge equations for nextP, and then are converted back to float when setting P=(float)nextP. The idea was to see how doubles handle this additions and multiplications of values that are several orders of magnitude apart. This is the result for the replayed log:
doublemath

@dagar
Copy link
Member

dagar commented Apr 8, 2020

This testing suggests that reductions in FILTER_UPDATE_PERIOD_MS and jitter in the dt used in the covariance prediction are contributors.

With the new Invensense IMU drivers in PX4/Firmware the jitter in the integrated IMU data is drastically reduced.

With ecl/EKF If we schedule actually running the filter (ImuDownSampler resets) as a multiple of IMU samples (computed from desired FILTER_UPDATE_PERIOD_MS) we can remove most of the jitter.

@dusan19
Copy link
Contributor

dusan19 commented Apr 8, 2020

Increasing FILTER_UPDATE_PERIOD_MS to 10msec and changing the covariance prediction to use an average dt of FILTER_UPDATE_PERIOD_S

If we could solve the problem with the python generated equations that you referenced, or by using double math, isnt it better to use the actual dt as it is now rather than setting it to FILTER_UPDATE_PERIOD_S. Also could we get away with 8ms update period?
I am not convinced that this changes actually solve the problem but only postpone it anyway

@dagar
Copy link
Member

dagar commented Apr 8, 2020

@dusan19 do you have any comparison of the cost of that change on stm32f4 (no double precision floating point hardware)? Output from top and the ekf2 perf counters might be sufficient.

For reference here's a crude benchmark to give you an idea of what it looks like on a pixracer (stm32f427). It's the trigonometry where you really need to be careful.
http://ci.px4.io:8080/blue/organizations/jenkins/PX4_misc%2FFirmware-hardware/detail/master/1661/pipeline/1000

nsh> tests microbench_math
INFO  [tests] RUNNING TEST: time_single_precision_float
float add: 1000 events, 632us elapsed, 0.63us avg, min 0us max 1us 0.482us rms
float sub: 1000 events, 647us elapsed, 0.64us avg, min 1us max 1us 0.478us rms
float mul: 1000 events, 641us elapsed, 0.64us avg, min 0us max 1us 0.479us rms
float div: 1000 events, 701us elapsed, 0.70us avg, min 1us max 1us 0.458us rms
float sqrt: 1000 events, 740us elapsed, 0.74us avg, min 1us max 1us 0.438us rms
INFO  [tests] TEST PASSED: time_single_precision_float

INFO  [tests] RUNNING TEST: time_single_precision_float_trig
sinf(): 1000 events, 2111us elapsed, 2.11us avg, min 2us max 3us 0.314us rms
cosf(): 1000 events, 1987us elapsed, 1.99us avg, min 1us max 2us 0.113us rms
tanf(): 1000 events, 2769us elapsed, 2.77us avg, min 2us max 4us 0.503us rms
acosf(): 1000 events, 2711us elapsed, 2.71us avg, min 2us max 3us 0.453us rms
asinf(): 1000 events, 2739us elapsed, 2.74us avg, min 2us max 3us 0.439us rms
atan2f(): 1000 events, 2605us elapsed, 2.61us avg, min 2us max 3us 0.489us rms
INFO  [tests] TEST PASSED: time_single_precision_float_trig

INFO  [tests] RUNNING TEST: time_double_precision_float
double add: 1000 events, 1396us elapsed, 1.40us avg, min 1us max 2us 0.489us rms
double sub: 1000 events, 1428us elapsed, 1.43us avg, min 1us max 2us 0.495us rms
double mul: 1000 events, 1128us elapsed, 1.13us avg, min 1us max 2us 0.334us rms
double div: 1000 events, 4461us elapsed, 4.46us avg, min 4us max 5us 0.498us rms
double sqrt: 1000 events, 5589us elapsed, 5.59us avg, min 4us max 8us 1.421us rms
INFO  [tests] TEST PASSED: time_double_precision_float

INFO  [tests] RUNNING TEST: time_double_precision_float_trig
sin(): 1000 events, 16901us elapsed, 16.90us avg, min 16us max 17us 0.298us rms
cos(): 1000 events, 18307us elapsed, 18.31us avg, min 18us max 19us 0.461us rms
tan(): 1000 events, 25796us elapsed, 25.80us avg, min 23us max 33us 3.676us rms
acos(): 1000 events, 4380us elapsed, 4.38us avg, min 4us max 5us 0.485us rms
asin(): 1000 events, 4554us elapsed, 4.55us avg, min 4us max 5us 0.497us rms
atan2(): 1000 events, 26604us elapsed, 26.60us avg, min 26us max 27us 0.489us rms
sqrt(): 1000 events, 4774us elapsed, 4.77us avg, min 4us max 8us 1.226us rms
INFO  [tests] TEST PASSED: time_double_precision_float_trig

@dusan19
Copy link
Contributor

dusan19 commented Apr 8, 2020

yeah i realized it might be heavy, but just wanted to check if conceptually that solves the problem and to check whether the problem is in the prediction at all. I might be too picky but i wish we had a longer log, few hours long to see the changes on a bigger time scale.

Ill do the load comparisons now and post
P.S. the trigonometry is linearized so its effecting a bunch of additions and multiplications, if thats what you meant

@dagar
Copy link
Member

dagar commented Apr 8, 2020

yeah i realized it might be heavy, but just wanted to check if conceptually that solves the problem and to check whether the problem is in the prediction at all. I might be too picky but i wish we had a longer log, few hours long to see the changes on a bigger time scale.

Makes sense. I'm personally in favor of pursuing a combination of all of these (targeted use of double precision, moving to the new scipy derivation, making the FILTER_UPDATE_PERIOD configurable, AND setting that update so it's perfectly aligned with your actual IMU) .

@priseborough
Copy link
Collaborator

After discussing the options with other ecl/EKF developers, the preferred approach is to incorporate the time step and minimum variance changes for the upcoming 1.11 stable release and submit the changes to the autocode as a separate PR for the following 1.11.1 update to give sufficient time for 'soak' testing. The reason for this is that although the subtle numerical differences in the output comapted to the matlab autocode are better in this scenario, this may not be universal and more extensive testing is required to be confident about this.

@dusan19
Copy link
Contributor

dusan19 commented Apr 9, 2020

i run some cpu load analysis for the case when double math is used for computing nextP. On my pixracer the cpu load for the att_pos_crtl workqueue (where ekf is running) increased for 15%, so that approach is not feasible (at least not with this implementation).

double addition takes ~6 times longer than float addition on the pixracer, and also some cpu is used on double-float conversions.

Then I wanted to find where the numerical differences between float and double math are the biggest, in order to try to move only a part of the computation to double math, so the load is not as heavy on the cpu.

The goal of the following analysis is to show which fields of the cov matrix have to be computed using double math and for which float math is sufficient, i.e. which fields are obtained by adding values that differ by orders of magnitude.

Inspired by #762 (matrices obtained in the PR description) i did the following:
I kept the float nextP matrix and computed double dnextP matrix. Then I computed the percent of how different the effect on the covariance prediction is when using floats and doubles:
prec_diff = 10000 * fabs(dnextP[i][j]-(double)nextP[i][j])/fabs(dnextP[i][j]-(double)P[i][j])
considering that double math is ground truth, this tells us the error of the covariance matrix increments relative to what should have been the true increment presented in 1/100 of percent for readability.
Result:
none_double
all fields are relatively small when talking about absolute precision but we see that some fields have values that are not insignificantly small (0.03%, 0.44%..) considering we make this much or an error every time we do a prediction and that it effects a lot of other fields.

Next step: I set the whole nextP=(float)dnextP. This is the case which fixes the collapsing problem but is heavy on the cpu load. I did the same analysis obtaining the values expressed in 1/100 of percent. Result:
all_doubles
Ideally all values would be ~0 here since dnextP-nextP=0, but they are not, and that is because (double)((float)dnextP) != dnextP. Having this values here that are not insignificantly small shows that even though there is no collapsing of the variances now, using double math just in mid computations and converting back to float nextP is still not great (errors are lower in general probably some terms cancel out that didnt due to float precision, but some remain up to 0.22%). P should be double as well because it needs to capture values that are created as a sum of values in ranges that differ orders of magnitude. This would not be fixed by using python generated equations because it means that not for every field this different order of magnitude terms cancel out. But this double P would take a lot of memory as well.

Next step: I did only the quaternion fields using double math: nextP[i][j]=(float)dnextP[i][j], 0<=i,j<=3, and kept the rest using float math as it normally is done.
Interestingly this was enough to remove the collapsing (in this log), and is much cheaper than using double math on the whole matrix. These are the results of the same analysis:
quat_double
Some other fields are still high (P[4][5]=204 this changes from 10 to ~200), so it might be questionable if the problem wont occur at a later time.

@dusan19
Copy link
Contributor

dusan19 commented Apr 10, 2020

when i only impose a lower limit on P[1][1] and P[2][2] in fix covariance function and make no other "fixes" there is no collapsing.
Is it worth finding lower limits for all variances, as proposed above for position and velocity? Probably quaternion variances lower limits would also be valuable since everything is correlated with those states.

@priseborough
Copy link
Collaborator

priseborough commented Apr 10, 2020

@dusan19 Yes, that could be worthwhile, but relating those values to operation minimums required is harder than for velocity and position states, eg 1E-6 equates to 1 mm/sec uncertainty on a velocity state. I've submitted a minimum fix #795 to get something into the upcoming 1.11.0 release, but your suggestion should be looked at for changes (including using the python autocode version) post 1.11.0

@LorenzMeier
Copy link
Member

@dusan19 Could you test the CPU load implications when running this on an STM32F7? With it's double precision FPU you can take a different approach.

@dusan19
Copy link
Contributor

dusan19 commented Apr 10, 2020

@dusan19 Could you test the CPU load implications when running this on an STM32F7? With it's double precision FPU you can take a different approach.

Unfortunately I only have a pixracer (STM32F4), but could share the code if someone else wants to run it on the STM32F7.

@dusan19
Copy link
Contributor

dusan19 commented Apr 18, 2020

I managed to create a 2h long log from boot where the estimator crashes every ~15min. It might be useful for testing any of the fixes since it gives better insight of whats going on compared to the 5min log. Unfortunately its still based on 1.10 so master replay might not work
https://drive.google.com/open?id=15nDj5B1Oa2yzofSc1EpotvtwqexrhO76

@dusan19
Copy link
Contributor

dusan19 commented Apr 23, 2020

i tested the fixes proposed in #795 with the long log from the previous comment.
The results show that unfortunately the fixes from #795 do not solve the problem but only postpone it for few tens of seconds, and the estimator is in the failure state shorter amount of time.

Original log replayed:

orig_replay

Replayed log with the fixes from #795

fixed_replay

@dusan19
Copy link
Contributor

dusan19 commented Apr 23, 2020

There are 2 changes in #795

  • Use of avarage dt = 10ms
  • Set lower limits on position and velocity variances to 1e-6
    I wanted to see how much each of those changes independently affect the stability.
    Turns out that the dt changes only slightly change the time when the failure happens but otherwise do not improve anything (at least visibly)
    While the variance limit changes make the duration of the failure shorter.

Effectively both changes together look the same as having only the variance limit change

Only dt changes:

only_dt_changes

Only variance limit changes:

only_limits

@dusan19
Copy link
Contributor

dusan19 commented Apr 24, 2020

With these changes: https://github.com/dusan19/ecl/tree/cov_num_fixes_1_10
I get the log where the variances converge to a certain value and do not collapse.
I am running these changes on fmu-v4 and havent noticed any problems yet. I had to increase the relevant workqueue stack size.

cov_fixed1

@RomanBapst
Copy link
Contributor

@dusan19 Have you tried using the python derivation? Would be curious how that affects it.

@dusan19
Copy link
Contributor

dusan19 commented Apr 24, 2020

@RomanBapst I am replaying that at the moment, will post the results in that PR

@dusan19
Copy link
Contributor

dusan19 commented Apr 24, 2020

its worth mentioning that I run the changes in https://github.com/dusan19/ecl/tree/cov_num_fixes_1_10
for 2 days without rebooting in the same conditions in which the estimator was reproducibly crashing before the changes. I didnt log the variances but I have monitored the fault flag and never got the fault during this time.

@priseborough
Copy link
Collaborator

@dusan19 It would really help if you could install master on your hardware for the ground testing.

@dusan19
Copy link
Contributor

dusan19 commented Apr 26, 2020

i understand, I could create such a setup where I could wait for the estimator crash while the drone is landed, but the problem is that I dont know how to reproduce it. It seems it depends on various initial conditions. Once it starts failing, it fails quite often, and also the reboot doesnt fix it which is good for debugging. Unfortunately I dont know how to bring it to that state.

Would it be possible for you to create a master replay compatible log using the sensor measurements from the 1.10 log I sent. I guess the problem is that some uorb message definitions have changed in the meantime?

@priseborough
Copy link
Collaborator

The messages have changed sufficiently that that is not possible without significant work. If you would like to ensure your problem is solved on 1.11, then I really do need that log.

@dusan19
Copy link
Contributor

dusan19 commented Jul 22, 2021

As mentioned above, to fix the numerical stability issue, I added the minimal theoretical increment constraint on the diagonal elements, and additionally switched the quaternion covariance prediction to use double precision arithmetic.
Those fixes are not merged, however, when instead of those I applied recent PR #1026, I got the same results, meaning the replayed log doesnt show numerical stability issues, I also tried it out on some more logs (8h long) with no issues. Since this is merged now, I think we can close this issue @ndepal ?

@ndepal
Copy link
Author

ndepal commented Jul 22, 2021

Agreed

@ndepal ndepal closed this as completed Jul 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants