-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Precision in Unit Test #610
Comments
@seanlaw However, according to the calculation of
A potential solution: |
@seanlaw How to handle identical case without using something like So, how about just handling identical cases? The following solution works for me BUT it is not efficient for massive data, say 10M data points. I think large-size data has higher chance of hash collision and thus increase the computing time (?)
And, then we check if two subsequences are identical in
An alternative way is to let collisions happen to make the process faster, and check it later.
Then, we can correct the matrix profile (i.e. matrix profile/ matrix profile left / matrix profile right) at the end:
Is it worth it? we are increasing time in pre-processing and post-processing steps. (I believe both of these steps can be parallelized via |
@NimaSarajpoor I don't think I don't know if it's worth addressing or what the best option is. Setting a Alternatively, I wonder if we can check |
I was wondering if what you have in mind is to check each single
might (hugely ?) increase the computing time if the data mostly show periodical behavior. (because, we will have many identical subsequences) On the other hand, I think we also want to make sure the values in final matrix profile are precise. I mean...if we say So, I think we can just re-calculate distance at the end after checking |
I think the only thing we'd need to change is add a line after L188 in Lines 183 to 188 in 576de37
I think that this is the simplest and should be a negligible cost as there is an This should be simple, easy to maintain, easy to interpret, and if users don't like it then
I don't know. Unfortunately, I don't think we'll get much more "precise". It's a trade off between speed and a slight loss in precision (but not much, actually). In the worst case, if you have a long sine wave (time series) OR when each subsequence has a perfect match, you'll end up computing all of the pairwise distances twice (once during |
As I think about it more, if we were to do this as a post-processing step, we would write a separate function to |
Correct. As you said, I think is better and enough for such rare case (you do not have to recalculate distances and you modify the values throughout the process). |
As a cross reference to #655 this related test is also failing:
|
@seanlaw Here is the test function:
And, here is the error:
|
Hmm, so what do you think is the issue? I'm assuming that |
Yeah...I think so. I took a look at the According to my investigation, I think one problem is the imprecision in the calculation of sliding standard deviation One way to make it a little bit better is standardization on the whole time series Another way that I think should work (I tested it to some extent) is that we refine the welford nanvar/nanstd. We can do it by calculating the rolling nanvar/nanstd for the subsequences each scaled by their minmax, and then scale back at the end.
Then, we use this to modify the welford as follows:
And finally, reverse everything back to normal at the end:
For instance, I am comparing the proposed one with the existing version:
|
@seanlaw |
Yeah, I'm pretty confident that Welford is about as precise as we're going to get and it has the best speed to precision tradeoff. 😢 |
@seanlaw I think I found the issue. Note that in Line 188 in d57169f
And, it seems that the high precision in Example:
Let's take a look at std of this input
And, if we compare the matrix profile ref_P (from (The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12)) |
Ohh, wow! That's REALLY bad but also fun at the same time 😭 |
Yeah...I also tried many many different ways to resolve this. I have not been successful so far. FYI:
If we scale up (down) by 10000 (0.0001), then:
will be around 1000! (btw, we should also consider the impact of The exact distance between |
What happens when you Can you also print |
In case that matters: |
Can you tell me what seed and code you are using to generate |
I think this should give you the
|
Hmm, I wonder in
The reason why we use Welford is because of its speed and reasonable precision. As you identified, for the most part, what Welford produces is fine. However, for really small numbers, it's insufficient. And so maybe we add a correction where we use Any thoughts? |
So, after line #869 in Lines 860 to 876 in 67b822c
we can add something like this:
This might not work for the multi-dimensional case but I hope you can get my point |
After I saw your suggestion, I tried it by calculating an upper bound for std, Now, there is one more thing we might be interested in taking care of. See test function below:
Here, I randomly scaled up or down the values. btw, I just realized a cleaner way was to do:
Okay, let's get back to our main topic. This test function still fails. In fact, I used the exact mean and std, i.e. Here is the error:
This shows that the pearson approach, i.e. using Side note: |
@seanlaw |
Yeah, I don't think we should move away from using welford because it is not only reasonably precise but it doesn't use up all of our memory. With |
Alternatively, can we somehow avoid the inverse and still use |
Correct! I think the current version is already good and because of that we should just try to improve it without drastically sacrificing the speed.
Yeah...I think that might help if it is possible. I thought about it and even tried to re-arrange the equations in the code but that did not help. I will think about it again. |
Update
the test function |
Can you help me understand what that means and if there is anything we could do? |
So I did I am still investigating this. I just wanted to update you and seek your opinion. |
So, Is that the correct interpretation? |
Yes...exactly. According to my investigation, I realized the error (in that case) is coming from the error of Also: So, whenever we get small value for
So, one idea can be:
(I did a similar thing. I did However, that means if the input time series |
Interesting!
Wouldn't |
Is this always true? Is it possible to have the wrong index if the distance is wrong? If not, then maybe as a post-processing step we simply compute the
Frankly, if the user's data contains mostly of almost-flat subsequences then it's probably a really, really poor dataset to begin with for matrix profiles in general 😢 |
I tried my proposed solution (i.e. using
I don't think so. It seems we are using
Well, it was true at the time I was writing that comment :) If I remember correctly, in one of the cases through my investigation, I faced a case where the indices of
Yeah...I meant if someone has |
Well, the reason we are passing |
Ah...right!
Correct. Well, maybe I should do this first and then if I realize that the cost is negligible, I can try to work on imprecision issue. |
Sounds good! |
So, I have been working on this from time to time. There are a couple of things that I would like to share. (1) Maybe I am missing something but it seems there is a small inconsistency between the outputs of
And, the sliding standard deviation:
Also, note that the second element and one-before-last element of So, I did:
And, it works. I tested (2) I also tried a more-expensive approach to calculate Σ_T. In short, I did something like this:
I didn't provide the check for Regarding the computing time, for What do you think? update |
Maybe I'm not seeing it but it looks like The source of the difference is when you allow So, there is no imprecision.
We currently handle this in |
It's been a while and I don't remember what happened but I believe that this approach may lead to "catastrophic cancellation" and even worse precision issues.
Exactly! This stuff is really nasty :( |
Thanks for the explanation! It makes sense. I can see it now!
Is that okay if I go ahead and try some changes and push them to the existing PR (#657) and see how it goes? (I usually test them on some important test functions. Then I can push the commits and then test it on all test functions. You can also see the code) |
Can you use LaTeX and write out the exact formulas or provide the full code. This looks like some things are missing? |
Sure |
In below, I am calculating the variance,
So, the variance is So, I can use Downside:
I will push the changes to the PR #657. |
What happens when we try this test case with the Matlab code? Do they experience any issues with precision? If so, then maybe we can email Eamonn about it and see if he has any suggestions. |
This unit test causes a failure:
The text was updated successfully, but these errors were encountered: