-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix precision #668
Fix precision #668
Changes from all commits
eeb257a
60bc457
f343fac
0463654
9ceaf11
5ecaf36
f278a1d
2714e72
a5a9f17
031a173
21aacf6
9d55342
6eea65b
2862c52
36560d7
897b448
e98a0b8
4de61e5
09a9af2
f6989fa
353336a
1328d68
f3e9f23
49651ff
2adafb0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,8 +25,8 @@ def _compute_diagonal( | |
m, | ||
M_T, | ||
μ_Q, | ||
Σ_T_inverse, | ||
σ_Q_inverse, | ||
Σ_T, | ||
σ_Q, | ||
cov_a, | ||
cov_b, | ||
cov_c, | ||
|
@@ -66,11 +66,11 @@ def _compute_diagonal( | |
μ_Q : numpy.ndarray | ||
Mean of the query sequence, `Q`, relative to the current sliding window | ||
|
||
Σ_T_inverse : numpy.ndarray | ||
Inverse sliding standard deviation of time series, `T` | ||
Σ_T : numpy.ndarray | ||
Sliding standard deviation of time series, `T` | ||
|
||
σ_Q_inverse : numpy.ndarray | ||
Inverse standard deviation of the query sequence, `Q`, relative to the current | ||
σ_Q : numpy.ndarray | ||
Standard deviation of the query sequence, `Q`, relative to the current | ||
sliding window | ||
|
||
cov_a : numpy.ndarray | ||
|
@@ -182,13 +182,35 @@ def _compute_diagonal( | |
|
||
if T_B_subseq_isfinite[i + k] and T_A_subseq_isfinite[i]: | ||
# Neither subsequence contains NaNs | ||
if T_B_subseq_isconstant[i + k] or T_A_subseq_isconstant[i]: | ||
pearson = 0.5 | ||
else: | ||
pearson = cov * Σ_T_inverse[i + k] * σ_Q_inverse[i] | ||
|
||
if T_B_subseq_isconstant[i + k] and T_A_subseq_isconstant[i]: | ||
pearson = 1.0 | ||
elif T_B_subseq_isconstant[i + k] or T_A_subseq_isconstant[i]: | ||
pearson = 0.5 | ||
else: | ||
denom = Σ_T[i + k] * σ_Q[i] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This:
Should not be here. I think we want to move this higher up:
This would make things much cleaner and without repeating the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah...right! Thanks for the feedback!
Yeah...I remember your explanation. But, I thought it is not possible to use inverse here. I was wrong! |
||
if denom < config.STUMPY_MIN_STD_AB: | ||
cov = ( | ||
np.dot( | ||
(T_B[i + k : i + k + m] - M_T[i + k]), | ||
(T_A[i : i + m] - μ_Q[i]), | ||
) | ||
* m_inverse | ||
) | ||
|
||
pearson = cov / denom | ||
if pearson > 1.0: | ||
pearson = 1.0 | ||
|
||
# if config.STUMPY_CORRELATION_THRESHOLD <= pearson < 1.0: | ||
# cov = ( | ||
# np.dot( | ||
# (T_B[i + k : i + k + m] - M_T[i + k]), | ||
# (T_A[i : i + m] - μ_Q[i]), | ||
# ) | ||
# * m_inverse | ||
# ) | ||
|
||
# pearson = cov * Σ_T_inverse[i + k] * σ_Q_inverse[i] | ||
|
||
if pearson > ρ[thread_idx, i, 0]: | ||
ρ[thread_idx, i, 0] = pearson | ||
|
@@ -225,8 +247,8 @@ def _stump( | |
m, | ||
M_T, | ||
μ_Q, | ||
Σ_T_inverse, | ||
σ_Q_inverse, | ||
Σ_T, | ||
σ_Q, | ||
M_T_m_1, | ||
μ_Q_m_1, | ||
T_A_subseq_isfinite, | ||
|
@@ -259,11 +281,11 @@ def _stump( | |
μ_Q : numpy.ndarray | ||
Mean of the query sequence, `Q`, relative to the current sliding window | ||
|
||
Σ_T_inverse : numpy.ndarray | ||
Inverse sliding standard deviation of time series, `T` | ||
Σ_T : numpy.ndarray | ||
Sliding standard deviation of time series, `T` | ||
|
||
σ_Q_inverse : numpy.ndarray | ||
Inverse standard deviation of the query sequence, `Q`, relative to the current | ||
σ_Q : numpy.ndarray | ||
Standard deviation of the query sequence, `Q`, relative to the current | ||
sliding window | ||
|
||
M_T_m_1 : numpy.ndarray | ||
|
@@ -384,8 +406,8 @@ def _stump( | |
m, | ||
M_T, | ||
μ_Q, | ||
Σ_T_inverse, | ||
σ_Q_inverse, | ||
Σ_T, | ||
σ_Q, | ||
cov_a, | ||
cov_b, | ||
cov_c, | ||
|
@@ -545,7 +567,7 @@ def stump(T_A, m, T_B=None, ignore_trivial=True, normalize=True, p=2.0): | |
( | ||
T_A, | ||
μ_Q, | ||
σ_Q_inverse, | ||
σ_Q, | ||
μ_Q_m_1, | ||
T_A_subseq_isfinite, | ||
T_A_subseq_isconstant, | ||
|
@@ -554,7 +576,7 @@ def stump(T_A, m, T_B=None, ignore_trivial=True, normalize=True, p=2.0): | |
( | ||
T_B, | ||
M_T, | ||
Σ_T_inverse, | ||
Σ_T, | ||
M_T_m_1, | ||
T_B_subseq_isfinite, | ||
T_B_subseq_isconstant, | ||
|
@@ -600,8 +622,8 @@ def stump(T_A, m, T_B=None, ignore_trivial=True, normalize=True, p=2.0): | |
m, | ||
M_T, | ||
μ_Q, | ||
Σ_T_inverse, | ||
σ_Q_inverse, | ||
Σ_T, | ||
σ_Q, | ||
M_T_m_1, | ||
μ_Q_m_1, | ||
T_A_subseq_isfinite, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think to answer this question, we need to answer the following (philosophical!) question first:
Let's say my data is:
Note that the z-norm of
T[:8]
andT[-8:]
are the same. Now, of the two statements below, which one is correct?T[:8]
andT[-8:]
is zero because their z-norm are the same.T[:8]
andT[-8:]
isT[:8]
should be treated as a constant subsequence. And, we know the distance between a constant subsequence and a non-constant subsequence isIf we go with the latter statement, then$\sqrt{8}$ )
STUMPY_STDDEV_THRESHOLD = 1e-7
is fine (If you domp=stumpy.stump(T, m)
, then you will see thatmp[0, 0]
isIf we choose the former statement, then I think we should do:
STUMPY_STDDEV_THRESHOLD = 1e-100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going purely by my instincts, I think it should be the former (the distance should be zero because their z-norm are identical).
So, it looks like we only use
STUMPY_STDDEV_THRESHOLD
in three functions:stumpy/stumpy/core.py
Lines 923 to 935 in 9a2e2ec
AND
stumpy/stumpy/gpu_stump.py
Lines 140 to 155 in 9a2e2ec
AND
stumpy/stumpy/core.py
Line 1754 in 9a2e2ec
In the first two cases, it feels like we could/should replace the use of that threshold with
T_subseq_isconstant
instead? However, in the 3rd case, we are usingSTUMPY_STDDEV_THRESHOLD
to setT_subseq_isconstant
! Hmmm, this is nasty. 😢 I don't know what to do.i guess in all cases, we are trying to allow for some small leniency in defining a constant subsequence. Now I see your point as to why
STUMPY_STDDEV_THRESHOLD
should be smaller that1e-7
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fundamentally, since all of this relates to identifying a constant subsequence, is there a different/better way to identify a constant subsequence besides using the STDDEV? Initially, I thought it was a simple/cute approach but perhaps it is flawed?
Just to state it out loud, it appears that it is not possible/trivial to distinguish whether a subsequence is constant OR whether it simply has very small values to start with OR both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think we are struggling with this because we are trying to make assumptions for the user (i.e., what should be considered "constant"). The purest/strictest definition is when the STDDEV is equal to zero and I doubt that there is any argument to this, right? Then, there is everything else that is an approximation.
I am starting to think if we should add
T_A_subseq_isconstant=None
andT_B_subseq_isconstant=None
to our top level API which lets the user define the truth and we stop guessing. Any thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand your point of view. I think, however, this might make things a little more complicated. I mean...it would not be easy for a user to set
T_subseq_isconstant
. Right? Because, they probably need to calculate the rolling std and then use it to computeT_subseq_isconstant
. And, this may be used by only a few users. I know you are usually interested in solving the issues for general audience. But, if you think this is fine, then I think your proposed solution is good :)I just realized that we still have
config.STUMPY_STDDEV_THRESHOLD
. So, it should be good :) we can think of paramT_subseq_isconstant
for a little bit more advanced users.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise a valid point. So, first things first, does setting
T_A_subseq_isconstant = σ_Q < config.STUMPY_STDDEV_THRESHOLD
andT_B_subseq_isconstant = Σ_T < config.STUMPY_STDDEV_THRESHOLD
help with our constant subsequence precision issue?Then, separately, do we expose
T_A_subseq_isconstant
andT_B_subseq_isconstant
to the user? It might make sense to take a moment and dig through our past issues to see what issues users were having when dealing with constant subsequences or near constant subsequences.Again, I think the first part is more important while this second part is "nice to have" but, to your point, it's really for advanced users (because they'd need to know how to call
core.rolling_window
, which is not particularly hard but also not obvious)!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are (indirectly) doing it already:
stumpy/stump.py
: we useT_A_subseq_isconstant
which is one of the outputs returned bycore. preprocess_diagonal
in which we doT_subseq_isconstant = Σ_T < config.STUMPY_STDDEV_THRESHOLD
.core._calculate_squared_distance
, we do the same. For instance, let's see lines 923-924 incore.py
:which is equivalent to:
btw, if I remember correctly, setting$\sqrt{m}$ as their distance. (similar to the example I provided earlier). That was the main reason behind setting
config.STUMPY_STDDEV_THRESHOLD
to lower value helped me to resolve imprecision in one case. I cannot remember that particular case, but I can remember a test failed and the reason was the imprecision. The naive version gave0
as the distance between the two identical case, but the performant version treated one of the identical subseq as constant and gaveconfig.STUMPY_STDDEV_THRESHOLD=1e-20
. We can even set it lower, like1e-100
but I am not sure if it makes sense or not.Did you mean if it is reasonable or not to allow user to set these two parameters?
Yeah... that's a good idea! I will go and check out the previous issues.
You are right! So, to wrap up, we may do:
T_subseq_isconstant
incore._calculate_squared_distance
to make it consistent withstumpy/stump.py
config.STUMPY_STDDEV_THRESHOLD
to lower valueWhat do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that summary sounds good. There is one small note as to why we chose to use the STDDEV to infer a constant region (rather than adding
T_subseq_isconstant
) and that was to save on memory.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seanlaw
I provided a recap in the main page of PR. But, I will leave this conversation open for now just in case you want to mention something.