-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix precision #668
Fix precision #668
Conversation
@seanlaw
|
@NimaSarajpoor Instead of running
I get:
If I do:
Then I get the following:
|
I got a different result:
Btw, I compared the my local branch and the remote branch in origin (for both |
I have a few observations:
I generated this by adding the following three lines to
According to this, subsequence with |
I think we should. btw, I tried PLEASE IGNORE THE FOLLOWING COMMENT
from In my local repo, I got: UPDATE: |
From
This suggests that |
Just got that! You are right!
That is probably correct. I am going to do what you suggested... in the meantime, could you please remove
? Maybe numba performs differently on different platforms... btw, the code in this branch -in my local repo- still fails for seed=26 !!! Do you feel we are putting too much energy on this? I mean....how do you feel about it? :) |
|
I think they are the same. So, as you suggested, this matter is related to |
After commenting out
I do not feel this way at all! This shouldn't be happening and it would be good to track it down |
The good thing is that we've isolated to the |
I think it passes
The operation got cancelled for other versions of python. |
And I got:
Now, without Numba, the values that correspond to |
@seanlaw |
Codecov ReportBase: 99.89% // Head: 99.89% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #668 +/- ##
=======================================
Coverage 99.89% 99.89%
=======================================
Files 80 80
Lines 11453 11538 +85
=======================================
+ Hits 11441 11526 +85
Misses 12 12
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Could you please let me know what is the Explanation So, maybe the |
When I do:
I get:
So, When I print their z-norm subsequences, I get:
You can see that they are somewhat close. The last (3rd) element is identical but the first two elements are only slightly different. |
We just got an error caused by slight imprecision in identical case in a about-to-be-merged PR. (see error) I was wondering if we should partially (?) solve this as this error appears from time to time in the merging process. I have a new idea but have not implemented it yet. The concept goes as follows:
Do you think the |
Instead of Regarding the timing, I don't know how time consuming it would be and we should simply check with some different inputs. First:
then:
Finally, a mix of both:
Compare their performance with the check. |
I can see your point. I need to investigate this further. Btw, if you take a look at the Also, note that in this case, the Also, I had a slight miscalculation in the notebook. I fixed it. I also added a section to understand the relationship between I cannot guarantee that all of my calculations are flawless. I just wanted to share with you my thoughts. We might be able to use them at some point. |
@NimaSarajpoor For this |
Interesting, I wonder if this discussion might help. Additionally, this other conversation seems insightful as to how we should be thinking about these values. |
Btw @NimaSarajpoor, I'm starting to think if it makes more sense to move these precision-related tests to |
@seanlaw
Note that the flat segments may not be actually flat! (I mean... their values might be very very small) Note1: The length of x-axis is the number of pairs. So, if it is confusing, just pay attention to the y-axis which shows the imprecision in pearson for the pairs of subsequences. Note2: In the
Let's investigate the first pair NOTE: this pair is in the
So, now I have three versions of
Now, I am interested to see the error Note that the magnitude of error at index 47 is around 0. At index 48 and 49, the error of Also, note that the errors of As you can see, the
Note: There is no imprecision in sliding mean. I checked.
And, for
Note that Maybe we should think of a way to consider applying
The So, we can try:
I think applying it on
I took a quick look. I will read it more carefully to see if we can get some insight from it. Sorry for the long comment :) I have been working on this and I thought it would be useful to share with you the things that I found |
Thank you for working on this @NimaSarajpoor! I don't know if any of my suggestions will help but it feels like we are making slow but steady progress towards understanding what the root of our problem is. I definitely don't know the "right" answer :) |
Me neither! I just hope that we get to a better point in terms of understanding the nature of the problem.
I think that would be a good idea. Because (1) we can move forward with resolving other issues (2) As you said, put all of these edge cases together in one place. |
Yes! Especially the section on adding/subtracting/multiplying/dividing. Note that I try to say "loss of precision" rather than "imprecision" as the former implies that precision is lost due to, say, combining numbers with different magnitudes (see numerical analysis). So, both numbers being combined can have high precision but precision is loss due to floating point issues. I don't know how I'd use "imprecision" but, to me, it feels like somebody failed to record the initial number up to enough significant digits and is, therefore, "imprecise". Maybe it's just me over-analyzing things and I could be completely "off" |
I think I get your point. In my head, the imprecision might imply inaccuracy. So,"loss of precision" is better as it avoids creating such confusion. |
@NimaSarajpoor I've been thinking about this a little more. There are two reasons why I chose to use Welford to compute the stddev:
Or so I thought... Then, I realized that there is another option that is:
It's nothing but simple parallelization via
And then we can test with:
Seems to be good up to 12 decimal places but only tested on a handful of random time series. This may be the best way forward to ensure that we get the same precision as the To be 100% clear, I am saying that we replace |
This is a great idea! As you said, the computing time is mostly dominated by the matrix profile calculation, and sacrificing a little bit time on the computation of |
@NimaSarajpoor Please pull the latest changes/commits into your branch |
@seanlaw
I used your approach but without |
I don't think I understand what you did then. Technically, |
Oh! okay...I thought I have been saving some memory here :)
I didn't do anything new. I just do |
Two notes: (1) Avoiding (2) since we try to avoid "rolling approach" and calculate std for each "subsequence" directly, the computing time linearly increases as we increase the window size
The existing approach (i.e. "rolling approach") takes about 2 sec but the "parallelizing approach" takes about 40 sec!) I measured the computing time for I think this is still negligible compared to the computational load of matrix profile. What do you think @seanlaw ? |
40 seconds?? Did you mean 4 seconds? Is the "parallelization" approach using |
I meant 40 (forty) seconds. This is for my code where I avoided using I think my pc is slower than yours. But I think you should feel the same jump if you run it on your end.
And, I think it makes sense. The time complexity is |
Sure, it's probably fine |
@NimaSarajpoor Is this still needing more work? |
I think so. A few days ago I tried to see if the test functions in Note 1: Note 2:
Now, if we enhance the precision for the aforementioned case, does that mean there will be no issue for the following case?
We can test it. Ideally, the distance should be exactly 0 between the two aforementioned subsequence. [In this case, one may treat |
What do you think is the best thing to do? |
I can see many code changes in this PR that should be reverted. Also, some part of the discussion is about using parallelized computation for std instead of using Welford technique [which is already taken care of]. Thus, I think it is better to close this one, and create a new one just for handling the loss of precision in cases with identical subsequences, and add new test func to |
Sounds good. Let's do that! |
This PR is a replacement for PR #657, which tackles issue #610.
Regarding PR #657: There was a discrepancy between the files in the local repo and the remote repo, but I couldn't figure that out. That is why I am submitting a new PR.