Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix precision #668

Closed
wants to merge 25 commits into from
Closed

Conversation

NimaSarajpoor
Copy link
Collaborator

This PR is a replacement for PR #657, which tackles issue #610.

Regarding PR #657: There was a discrepancy between the files in the local repo and the remote repo, but I couldn't figure that out. That is why I am submitting a new PR.

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
This is strange! Now, when I run test, it says "83 files would be left unchanged". Now, this is the same as what we see here in Github Actions. And, I get no error when I run test in my local repo.

$ ./test.sh custom 1 tests/test_stump.py
Cleaning Up
Checking Black Code Formatting
All done! \u2728 \U0001f370 \u2728
83 files would be left unchanged.
Checking Flake8 Style Guide Enforcement
Executing Custom User-Defined Tests Only
Custom Test: 1 / 1
============================= test session starts =============================
platform win32 -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: C:\Users\AS17983\Desktop\NIMA\stump\stumpy, configfile: pytest.ini
plugins: anyio-3.6.1, cython-0.2.0
collected 27 items

tests\test_stump.py ...........................                          [100%]

============================= 27 passed in 19.44s =============================

@seanlaw
Copy link
Contributor

seanlaw commented Sep 7, 2022

@NimaSarajpoor Instead of running test.sh, can you simply run this in a Jupyter notebook:

m = 3
zone = int(np.ceil(m / 4))
seed = 1
np.random.seed(seed)
T = np.random.rand(64)
scale = np.random.choice(np.array([0.001, 0, 1000]), len(T), replace=True)
T[:] = T * scale

ref_mp = naive.stump(T, m, exclusion_zone=zone, row_wise=True)
comp_mp = stump(T, m, ignore_trivial=True)

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    if np.abs(ref - cmp) > 1e-7:
        print(i, ref, cmp)

I get:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

If I do:

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    print(i, ref, cmp)

Then I get the following:

0 1.5700924586837752e-16 0.0
1 1.5700924586837752e-16 0.0
2 1.658550663133749e-07 1.6323404237781945e-07
3 2.4257150105478956e-07 2.421152116249788e-07
4 6.454152836022342e-07 6.452392069879462e-07
5 0.22155833569278333 0.2215583356927852
6 0.09204964741053696 0.09204964741054301
7 1.9109645596859633e-07 1.8966081829902314e-07
8 1.5700924586837752e-16 0.0
9 2.220446049250313e-16 0.0
10 0.012944862817693177 0.012944867345093258
11 0.49025821651994234 0.4902582165199456
12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
14 3.3457789477422945e-05 3.3457703025089785e-05
15 0.26636729814247995 0.26636729659228936
16 3.656951842685141e-08 0.0
17 0.46944298605172013 0.4694429860517184
18 0.03103461052726588 0.031034609854201676
19 6.434999199613844e-07 0.0
20 5.535501007487065e-08 0.0
21 8.19261056164571e-10 0.0
22 5.364470225196175e-07 0.0
23 8.030629626643827e-07 8.021753331240875e-07
24 1.9109645596859633e-07 1.8966081829902314e-07
25 1.5700924586837752e-16 0.0
26 0.0 0.0
27 2.53526511563009e-07 2.0157917229039495e-07
28 8.553525901666268e-08 0.0
29 4.6884218468212213e-07 4.6814316726937914e-07
30 0.10353933637363609 0.10353933656145922
31 0.1051316158248435 0.10513161293237559
32 0.48220646477596524 0.48220646544989193
33 8.523796584076095e-07 0.0
34 1.7325318446717188e-06 1.7323200730701582e-06
35 1.5414353887031432e-06 0.0
36 0.04048989128516726 0.040489912232822924
37 3.656951842685141e-08 0.0
38 3.656951842685141e-08 0.0
39 0.26636729814247995 0.26636729659228936
40 1.658550663133749e-07 0.0
41 4.6884218468212213e-07 4.6814316726937914e-07
42 6.460441508916396e-06 6.458583395855192e-06
43 0.03103461052726588 0.031034609854201676
44 0.10353933637363609 0.10353933656145922
45 0.09204964741053696 0.09204964741054301
46 8.19261056164571e-10 0.0
47 8.523796584076095e-07 8.218635459624336e-07
48 0.1051316158248435 0.10513161293237559
49 0.07416506824467556 0.07416506824466002
50 0.012944862817693177 0.012944867345093258
51 0.48220646477596524 0.48220646544989193
52 0.42052087653681697 0.42052087653681747
53 5.535501007487065e-08 0.0
54 2.7194799110210365e-16 0.0
55 0.0 0.0
56 2.760744472197097e-06 2.7529265174893867e-06
57 0.07416506824467556 0.07416506824466002
58 2.3357370178318246e-07 2.264780425067345e-07
59 0.45226821973180753 0.4522682194826431
60 2.486220486843771e-07 0.0
61 1.5700924586837752e-16 0.0

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 7, 2022

I get:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

I got a different result:

33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
41 4.6884218468212213e-07 0.0
42 6.460441508916396e-06 0.0
56 2.760744472197097e-06 0.0
60 2.486220486843771e-07 0.0

Btw, I compared the my local branch and the remote branch in origin (for both main and fix_precision) to make sure everything is the same. I used the instructions provided in this stackoverflow post. Everything is okay.

@seanlaw
Copy link
Contributor

seanlaw commented Sep 7, 2022

I got a different result:

I have a few observations:

  1. For testing stump, should we really be setting row_wise=True in naive.stump? Since stump is diagonal-wise, shouldn't we set row_wise=False? This doesn't solve the problem but it is inconsistent
  2. 12 2.950287948697726e-05 0.0 means that naive.stump is returning a non-zero distance for idx = 12 in Github Actions (and in my local branch of your repo). Can you verify what the pairwise distances are for idx = 12? Here are my pairwise distances generated from naive.stump:
0 3.0000184482883
1 3.689691694550767e-05
2 2.999981187305404
3 3.0000188123406852
4 3.807877925156031e-05
5 0.9716146011928827
6 3.365385804112601
7 3.00001930941092
8 3.0000184482883
9 3.689691694550766e-05
10 2.7130038423554645
11 1.293751201679764
12 0.0
13 2.9999872380440378
14 3.0000368085323212
15 1.9609012370025143
16 2.999981551371355
17 2.644560505532764
18 0.4142995718310496
19 3.5963320995608045e-05
20 2.9999819288490794
21 3.0000186910572095
22 3.743336396799641e-05
23 2.9999807857662883
24 3.0000192138645323
25 3.0000184482883
26 3.689691694550767e-05
27 3.660682091553413e-05
28 2.9999816124248677
29 3.0000181726311808
30 0.6555850486723159
31 2.8129002184921874
32 3.373545663525392
33 3.9431652231321966e-05
34 2.9999799194833465
35 3.0000200801134347
36 2.979568418597036
37 3.686034742708289e-05
38 2.999981569656451
39 1.735512606848563
40 2.999981270234494
41 3.000017938214321
42 2.950287948697726e-05
43 0.3834710886174519
44 0.7569603930026928
45 3.342379253523891
46 3.000018690647586
47 4.028403188967316e-05
48 2.8729632528630487
49 0.27224456911612394
50 2.7210340045887635
51 3.2311659747454886
52 0.7619746520574434
53 2.9999819565270847
54 3.0000184482882997
55 3.689691694550767e-05
56 2.9999785390824387
57 0.3461178433786073
58 2.9999820733160356
59 2.316949684170337
60 2.9999817367381696
61 3.0000184482883

I generated this by adding the following three lines to naive.stump:

def stump(T_A, m, T_B=None, exclusion_zone=None, row_wise=False):
    if T_B is None:  # self-join:
        ignore_trivial = True
        distance_matrix = np.array(
            [distance_profile(Q, T_A, m) for Q in core.rolling_window(T_A, m)]
        )
        T_B = T_A.copy()
    else:
        ignore_trivial = False
        distance_matrix = np.array(
            [distance_profile(Q, T_B, m) for Q in core.rolling_window(T_A, m)]
        )

    distance_matrix[np.isnan(distance_matrix)] = np.inf

    # Added these three lines
    if distance_matrix.shape[0] >= 12:
        for i in range(distance_matrix.shape[1]):
            print(i, distance_matrix[12, i])

According to this, subsequence with idx = 42 corresponds to the one-nearest neighbor to idx = 12

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 7, 2022

  1. For testing stump, should we really be setting row_wise=True in naive.stump? Since stump is diagonal-wise, shouldn't we set row_wise=False? This doesn't solve the problem but it is inconsistent

I think we should. btw, I tried row_wise=False before and then I realized we can set it to row_wise=True IF what we care about is just the matrix profile (and not the matrix profile indices). I mean...in such case, row_wise=True should be okay. However, as you said, we should better set it to row_wise=False for the sake of consistency.

PLEASE IGNORE THE FOLLOWING COMMENT

2. Here are my pairwise distances generated from naive.stump:

from naive.stump or stumpy.stump? Because, the value you provided for idx=12 is 0.0, which is not the same as what you provided earlier: 12 2.950287948697726e-05 0.0 (I think the last value here, i.e. 0.0, is from stumpy.stump)

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.


UPDATE:
Okay, I think you set row_wise=False now. Right?

@seanlaw
Copy link
Contributor

seanlaw commented Sep 7, 2022

from naive.stump or stumpy.stump? Because, the value you provided for idx=12 is 0.0, which is not the same as what you provided earlier: 12 2.950287948697726e-05 0.0 (I think the last value here, i.e. 0.0, is from stumpy.stump)

From naive.stump. If you look at my last output, you should NOT be looking at 12 0.0, which is the self-match/trivial match. Instead, you should be looking at 42 2.950287948697726e-05

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.

This suggests that stumpy.stump is different and not returning 0.0 for some reason

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 7, 2022

From naive.stump. If you look at my last output, you should NOT be looking at 12 0.0, which is the self-match/trivial match. Instead, you should be looking at 42 2.950287948697726e-05

Just got that! You are right!

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.

This suggests that stumpy.stump is different and not returning 0.0 for some reason

That is probably correct. I am going to do what you suggested... in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    if np.abs(ref - cmp) > 1e-7:
        print(i, ref, cmp)

?

Maybe numba performs differently on different platforms... btw, the code in this branch -in my local repo- still fails for seed=26 !!!


Do you feel we are putting too much energy on this? I mean....how do you feel about it? :)

@NimaSarajpoor
Copy link
Collaborator Author

2. an you verify what the pairwise distances are for idx = 12? Here are my pairwise distances generated from naive.stump:

0 3.0000184482883
1 3.689691694550767e-05
2 2.999981187305404
3 3.0000188123406852
4 3.807877925156031e-05
5 0.9716146011928827
6 3.365385804112601
7 3.00001930941092
8 3.0000184482883
9 3.689691694550766e-05
10 2.7130038423554645
11 1.293751201679764
12 0.0
13 2.9999872380440378
14 3.0000368085323212
15 1.9609012370025143
16 2.999981551371355
17 2.644560505532764
18 0.4142995718310496
19 3.5963320995608045e-05
20 2.9999819288490794
21 3.0000186910572095
22 3.743336396799641e-05
23 2.9999807857662883
24 3.0000192138645323
25 3.0000184482883
26 3.689691694550767e-05
27 3.660682091553413e-05
28 2.9999816124248677
29 3.0000181726311808
30 0.6555850486723159
31 2.8129002184921874
32 3.373545663525392
33 3.9431652231321966e-05
34 2.9999799194833465
35 3.0000200801134347
36 2.979568418597036
37 3.686034742708289e-05
38 2.999981569656451
39 1.735512606848563
40 2.999981270234494
41 3.000017938214321
42 2.950287948697726e-05
43 0.3834710886174519
44 0.7569603930026928
45 3.342379253523891
46 3.000018690647586
47 4.028403188967316e-05
48 2.8729632528630487
49 0.27224456911612394
50 2.7210340045887635
51 3.2311659747454886
52 0.7619746520574434
53 2.9999819565270847
54 3.0000184482882997
55 3.689691694550767e-05
56 2.9999785390824387
57 0.3461178433786073
58 2.9999820733160356
59 2.316949684170337
60 2.9999817367381696
61 3.0000184482883

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 7, 2022

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

@seanlaw
Copy link
Contributor

seanlaw commented Sep 7, 2022

in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

After commenting out @njit for _stump and _compute_diagonal, I got:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

Do you feel we are putting too much energy on this? I mean....how do you feel about it? :)

I do not feel this way at all! This shouldn't be happening and it would be good to track it down

@seanlaw
Copy link
Contributor

seanlaw commented Sep 7, 2022

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

The good thing is that we've isolated to the idx=12 and idx=42 pair. Somehow, your stump is computing 2.950306866348452e-05, which is very close to the same answer as naive.stump. Note that we are seeing the same test failures in Github Actions but all of them are only failing (first) for Ubuntu and Mac OS. Can you check if it fails for Windows in Github Actions (i.e., does it get past `test_stump.py for Windows)?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 7, 2022

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

The good thing is that we've isolated to the idx=12 and idx=42 pair. Somehow, your stump is computing 2.950306866348452e-05, which is very close to the same answer as naive.stump. Note that we are seeing the same test failures in Github Actions but all of them are only failing (first) for Ubuntu and Mac OS.

Can you check if it fails for Windows in Github Actions (i.e., does it get past `test_stump.py for Windows)?

I think it passes

============================= test session starts =============================
platform win32 -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0
rootdir: D:\a\stumpy\stumpy, configfile: pytest.ini
collected 87 items

tests\test_stump.py ...........................                          [ 31%]
tests\test_mstump.py ............................................        [ 81%]
tests\test_stumpi.py ................                                    [100%]

============================= 87 passed in 29.12s =============================

The operation got cancelled for other versions of python.

@NimaSarajpoor
Copy link
Collaborator Author

in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

After commenting out @njit for _stump and _compute_diagonal, I got:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

And I got:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
47 8.523796584076095e-07 0.0
56 2.760744472197097e-06 0.0
60 2.486220486843771e-07 0.0

Now, without Numba, the values that correspond to idx=12 are the same as what you got. (Note that we STILL use numba to get mean and std. So, MAYBE that is the reason behind the discrepancy between your result and mine) But, this can show that something is happening in numba. Right?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 7, 2022

@seanlaw
It might be a good idea to create a new branch and just have a directory tests that only contains test_stump with only the test function of our interest. I think this can help us detect the problem more quickly and come up with a MWE if needed.

@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2022

Codecov Report

Base: 99.89% // Head: 99.89% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (49651ff) compared to base (cd9c3ff).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #668   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11453    11538   +85     
=======================================
+ Hits        11441    11526   +85     
  Misses         12       12           
Impacted Files Coverage Δ
stumpy/scrump.py 100.00% <ø> (ø)
stumpy/config.py 100.00% <100.00%> (ø)
stumpy/core.py 100.00% <100.00%> (ø)
stumpy/stump.py 100.00% <100.00%> (ø)
stumpy/stumped.py 100.00% <100.00%> (ø)
tests/naive.py 100.00% <100.00%> (ø)
tests/test_core.py 100.00% <100.00%> (ø)
tests/test_stump.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 7, 2022

I get:

12 2.950287948697726e-05 0.0

Could you please let me know what is the nn of 12 in your stumpy.stump? (I think you are using Mac. right?)

Explanation
In naive, the nn is 42. In my local repo, I got 42 as well when I used stumpy.stump. So, I used the subseqs at indices 12 and 42 to create a new time series T that is used in the new test function that I pushed. However, it seems all tests are passing.

So, maybe the T that I used is not a good input to reproduce that error. Or, maybe 42 is NOT the nn of 12 in your end! That is why I asked if you could let me know the nn index of subseq at index 12.

@seanlaw
Copy link
Contributor

seanlaw commented Sep 7, 2022

Could you please let me know what is the nn of 12 in your stumpy.stump? (I think you are using Mac. right?)

When I do:

comp_mp[12]

I get:

array([0.0, 9, 9, 42], dtype=object)

So, stumpy.stump (with @njit ON) produces a left and global nearest neighbor of 9 while the right nearest neighbor is 42

When I print their z-norm subsequences, I get:

# idx = 9
stumpy.core.z_norm(T[9:9+m])
array([-0.70710678, -0.70710678,  1.41421356])

# idx = 12
stumpy.core.z_norm(T[12:12+m])
array([-0.70713287, -0.70708069,  1.41421356])

# idx = 42
stumpy.core.z_norm(T[42:42+m])
array([-0.70711201, -0.70710155,  1.41421356])

You can see that they are somewhat close. The last (3rd) element is identical but the first two elements are only slightly different. stumpy.stump is interpreting idx = 9 as being essentially identical to idx = 12

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 13, 2022

@seanlaw

We just got an error caused by slight imprecision in identical case in a about-to-be-merged PR. (see error)

I was wondering if we should partially (?) solve this as this error appears from time to time in the merging process. I have a new idea but have not implemented it yet. The concept goes as follows:

if pearson > pearson_threshold:
     # S_A: z-norm of subseq in T_A
     # S_B: z-norm of subseq in T_B
     if np.all(S_A == S_B):
            pearson = 1.0

Do you think the np.all(...) can be time consuming?

@seanlaw
Copy link
Contributor

seanlaw commented Sep 13, 2022

Do you think the np.all(...) can be time consuming?

Instead of np.all(S_A == S_B), I think you'd rather use np.allclose (see docs). The S_A == S_B can be expensive because you are forced to always compare EVERY element in S_A with S_B. I'd be willing to bet that np.allclose short circuits and stops checking if the first elements are not equal and is therefore faster.

Regarding the timing, I don't know how time consuming it would be and we should simply check with some different inputs. First:

T_random = np.random.rand(100_000)

then:

t = np.linspace(0, 50*(2*np.pi), 100_000)
T_sine = np.sin(t)  # Essentially, a perfect sine wave repeated 50 times so the check above is very costly

Finally, a mix of both:

percent = 0.1
T_random_sine = np.where(T_random < percent, T_random, T_sine)  # Essentially, insert random values only a `percent` of the time

Compare their performance with the check.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Oct 9, 2022

then the cov_a[i + k] * cov_b[i] - cov_c[i + k] * cov_d[i] might be the problem. Loss of precision (significant digits) often comes from adding/subtracting numbers that are very different in magnitude and this looks like a (potentially) classic case. And then, there's the difference between adding cov to this number to get the updated cov.

I can see your point. I need to investigate this further. Btw, if you take a look at the cov_diff plot in which the imprecision of cov is illustrated, you can see that the imprecision is in the order of 1e-11.

Also, note that in this case, the denom in formula rho = cov / denom is 4.96261359e-10. So, we need to significantly improve the precision of cov. If we just make the imprecision of cov around 1e-14, we will still have around 0.0005 imprecision in the calculation of pearson.


Also, I had a slight miscalculation in the notebook. I fixed it. I also added a section to understand the relationship between denom and cov more easily.

I cannot guarantee that all of my calculations are flawless. I just wanted to share with you my thoughts. We might be able to use them at some point.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 9, 2022

@NimaSarajpoor For this T, can you tell me which i and j pair is producing the wrong pearson? It's actually really hard for me to follow/find the values that match your outputs from the inputs since you are combining values together. So, I can't reproduce it.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 9, 2022

Interesting, I wonder if this discussion might help. Additionally, this other conversation seems insightful as to how we should be thinking about these values.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 10, 2022

Btw @NimaSarajpoor, I'm starting to think if it makes more sense to move these precision-related tests to test_precision.py? This will help us to isolate/identify all of these important precision-related edge cases

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Oct 10, 2022

@seanlaw
So, I was re-running the code again to make sure my outputs are correct. I am sharing my findings below:

  1. Input T and m
# m and zone
m = 3
zone = int(np.ceil(m / 4))

# T
seed = 2
np.random.seed(seed)
T = np.random.rand(64)
scale = np.random.choice(np.array([0.001, 0, 1000]), len(T), replace=True)
T[:] = T * scale
  1. Plot of T
    image

Note that the flat segments may not be actually flat! (I mean... their values might be very very small)

  1. Imprecision in pearson
    (cov with np.dot vs cov with rolling approach)
    image

Note1: The length of x-axis is the number of pairs. So, if it is confusing, just pay attention to the y-axis which shows the imprecision in pearson for the pairs of subsequences.

Note2: In the stumpy/stump.py of this branch, we refine cov (by using np.dot) if std < 1.0. However, to get this imprecision, I stored the rolling cov and the exact cov (i.e. the cov obtained directly by np.dot), and then did:

cov_imprecision = cov_exact - cov_roll
pearson_imprecision  = cov_imprecision / stddev 

# note: pearson = cov / stddev, where stddev is the multiplication of the std values of two subseqs
  1. Pairs of subsequences sorted in descending manner according to their pearson imprecision:
pairs_subseqA pairs_subseqB imprecision
49 55 0.030831
33 55 0.001694
33 49 0.001385
42 55 0.001367
42 49 0.001250
22 55 0.000808
22 33 0.000259
22 49 0.000206
33 42 0.000190
22 42 0.000014

Let's investigate the first pair idx = 49 and idx_nn = 55

NOTE: this pair is in the k=6-th diagonal. So, the plots I am about to provide corresponds to the pairs in this diagonal. So, for instance, the point at index 0 corresponds to the 0-th pair in this diagonal, which corresponds to the pair (subseq_0, subseq_6)

  1. visualizing the subseq and its pair
    image

  2. The subseqs are:

# subseq at idx=49
array([0.00000000e+00, 0.00000000e+00, 8.20949223e-05])

# subseq at idx_nn = 55
array([2.72023659e-05, 0.00000000e+00, 0.00000000e+00])
  1. To better see what is going on, I tried to update cov via one-step rolling approach. I mean... I updated cov(i + 1) as follows:
cov_updater = constant * (cov_a[i + k] * cov_b[i] - cov_c[i + k] * cov_d[i])
cov(i + 1) = cov_exact(i) + cov_updater

So, now I have three versions of cov:

  • cov_exact: obtained by using np.dot
  • cov_roll: obtained by using rolling approach
  • cov_roll_onestep: obtained by using rolling approach by using the exact previous cov

Now, I am interested to see the error cov_diff (which is cov_exact - cov_roll ) and cov_diff_onestep (which is cov_exact - cov_roll_onestep (the latter shows the error of the cov_updater term)

image

Note that the magnitude of error at index 47 is around 0. At index 48 and 49, the error of cov_updater (blue curve) are around 0.5e-11. So, they are being accumulated and the error of cov_diff becomes arounnd 1e-11.

Also, note that the errors of cov in index 48 and 49 are in the same order. So, why do we get imprecision in pearson at index 49? Because of the denom stddevs in pearson formula. Let's check out their values:

image

As you can see, the stddev at index 48 is very large. However, the stddev at index 49 is very small (4.962613586298805e-10). Hence, its inverse can amplify the error in cov.

  1. Let's dig deeper and investigate the values in the cov_updater term.

Note: There is no imprecision in sliding mean. I checked.

T_A = T.copy()
T_B = T_A.copy()

μ_Q_m_1 = np.mean(stumpy.core.rolling_window(T_A, m-1), axis=1)
M_T_m_1 = np.mean(stumpy.core.rolling_window(T_B, m-1), axis=1)

std_vec_A = np.std(stumpy.core.rolling_window(T_A, m-1), axis=1)
std_vec_B = np.std(stumpy.core.rolling_window(T_B, m-1), axis=1)
#################
cov_a = T_B[m - 1 :] - M_T_m_1[:-1]
cov_b = T_A[m - 1 :] - μ_Q_m_1[:-1]

# The next lines are equivalent and left for reference
# cov_c = np.roll(T_A, 1)
# cov_ = cov_c[:M_T_m_1.shape[0]] - M_T_m_1[:]
cov_c_L = np.empty(M_T_m_1.shape[0], dtype=np.float64)
cov_c_L[1:] = T_B[: M_T_m_1.shape[0] - 1]
cov_c_L[0] = T_B[-1]
cov_c = cov_c_L - M_T_m_1


# The next lines are equivalent and left for reference
# cov_d = np.roll(T_B, 1)
# cov_d = cov_d[:μ_Q_m_1.shape[0]] - μ_Q_m_1[:]
cov_d_L = np.empty(μ_Q_m_1.shape[0], dtype=np.float64)
cov_d_L[1:] = T_A[: μ_Q_m_1.shape[0] - 1]
cov_d_L[0] = T_A[-1]
cov_d = cov_d_L - μ_Q_m_1

######################

def investigate_cov_updater(cov_a, cov_b, cov_c, cov_d, idx_A, idx_B, m=3):
    m_inverse = 1.0 / m
    constant = (m - 1) * m_inverse * m_inverse
    
    cov_updater = constant * (
                    cov_a[idx_B] * cov_b[idx_A] - cov_c[idx_B] * cov_d[idx_A]
                )
    
    print('cov_a: ', cov_a[idx_B])
    print('cov_b: ', cov_b[idx_A])
    print('cov_c: ', cov_c[idx_B])
    print('cov_d: ', cov_d[idx_A])
    
    print('\n')
    print('cov_updater: ', cov_updater)
    
    print('-----------------------------------------')
    
    print('cov_a L: ', T_B[m - 1 :][idx_B])
    print('cov_a R: ', M_T_m_1[:-1][idx_B])
    print('\n')
    print('cov_b L: ', T_A[m - 1 :][idx_A])
    print('cov_b R: ', μ_Q_m_1[:-1][idx_A])
    print('\n')
    print('cov_c L: ', cov_c_L[idx_B])
    print('cov_c R: ', M_T_m_1[idx_B])
    print('\n')
    print('cov_d L: ', cov_d_L[idx_A])
    print('cov_d R: ', μ_Q_m_1[idx_A])
    
    print('-----------------------------------------')
    
    print('std_vec_A in idx_A: ', std_vec_A[idx_A])
    print('std_vec_B in idx_B: ', std_vec_B[idx_B])

And, for idx_A = idx # 49 and idx_B = idx_nn # 55, let us see the values:

# outputs
# NOTE:  The letters L and R denotes left and right term. 
# For example: cov_a = T_B[m - 1 :] - M_T_m_1[:-1], so the left term is `T_B[m - 1 :]` and the right one is `M_T_m_1[:-1]`

cov_a:  -1.3601182947424795e-05
cov_b:  8.20949222750248e-05
cov_c:  406.27502944676786
cov_d:  535.6041734976563


cov_updater:  -48356.133635460705
-----------------------------------------
cov_a L:  0.0
cov_a R:  1.3601182947424795e-05


cov_b L:  8.20949222750248e-05
cov_b R:  0.0


cov_c L:  406.2750430479508
cov_c R:  1.3601182947424795e-05


cov_d L:  535.6041734976563
cov_d R:  0.0
-----------------------------------------
std_vec_A in idx_A:  0.0
std_vec_B in idx_B:  1.3601182947424795e-05

Note that cov_a and cov_b are small values, and their multiplication become very small. On the other hand, cov_c and cov_d are large and their multiplication become very large. Hence, substracting a very small and very large value may create imprecision. But, how can we check that? Because, as I said before, the sliding mean has no imprecision. So, it seems each individual terms cov_a, cov_b, cov_c and cov_d are precise themselves.

Maybe we should think of a way to consider applying stddev throughout the updating process. Because, the only difference between direct approach and rolling approach is that in the former we divide by std first and then do multiplication. This can help with scaling up values (when they are very small) and scaling them down (when they are very large), and maybe that is why we get better precision in direct approach.


I wonder if this discussion might help.

The log might help. We can probably try it on the cov_a*cov_b and cov_c * cov_d in cov_updater. Other than that, we have just sum operation (not multiplication) in the updating process of cov, and I am not sure how we can use log.

So, we can try:

exp(log(cov_a) + log(cov_b))  # cov_a * cov_b

exp(log(cov_c) + log(cov_d))  # cov_c * cov_d

I think applying it on cov/stddev may not be that useful because the imprecision is the cov itself, and it is not caused by the division operation.

Additionally, this other conversation seems insightful as to how we should be thinking about these values.

I took a quick look. I will read it more carefully to see if we can get some insight from it.

Sorry for the long comment :) I have been working on this and I thought it would be useful to share with you the things that I found

@seanlaw
Copy link
Contributor

seanlaw commented Oct 10, 2022

Sorry for the long comment :) I have been working on this and I thought it would be useful to share with you the things that I found

Thank you for working on this @NimaSarajpoor! I don't know if any of my suggestions will help but it feels like we are making slow but steady progress towards understanding what the root of our problem is. I definitely don't know the "right" answer :)

@NimaSarajpoor
Copy link
Collaborator Author

Thank you for working on this @NimaSarajpoor! I don't know if any of my suggestions will help but it feels like we are making slow but steady progress towards understanding what the root of our problem is. I definitely don't know the "right" answer :)

Me neither! I just hope that we get to a better point in terms of understanding the nature of the problem.

Btw @NimaSarajpoor, I'm starting to think if it makes more sense to move these precision-related tests to test_precision.py? This will help us to isolate/identify all of these important precision-related edge cases

I think that would be a good idea. Because (1) we can move forward with resolving other issues (2) As you said, put all of these edge cases together in one place.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 10, 2022

Also: https://www.soa.org/news-and-publications/newsletters/compact/2014/may/com-2014-iss51/losing-my-precision-tips-for-handling-tricky-floating-point-arithmetic/

Yes! Especially the section on adding/subtracting/multiplying/dividing. Note that I try to say "loss of precision" rather than "imprecision" as the former implies that precision is lost due to, say, combining numbers with different magnitudes (see numerical analysis). So, both numbers being combined can have high precision but precision is loss due to floating point issues. I don't know how I'd use "imprecision" but, to me, it feels like somebody failed to record the initial number up to enough significant digits and is, therefore, "imprecise". Maybe it's just me over-analyzing things and I could be completely "off"

@NimaSarajpoor
Copy link
Collaborator Author

I think I get your point. In my head, the imprecision might imply inaccuracy. So,"loss of precision" is better as it avoids creating such confusion.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 25, 2022

@NimaSarajpoor I've been thinking about this a little more. There are two reasons why I chose to use Welford to compute the stddev:

  1. For large time series, Welford is really, really fast for computing a rolling stddev
  2. For a large rolling window (2D) array, np.nanstd is basically un-usable as it takes up a ton of memory and so Welford becomes the only choice that balances speed and precision

Or so I thought... Then, I realized that there is another option that is:

  1. much more precise than Welford
  2. but that takes a bit more time to compute
  3. but that doesn't run out of memory for large arrays

It's nothing but simple parallelization via numba:

import stumpy
from stumpy import core
import numpy as np
from numba import njit, prange
import time
import numpy.testing as npt


@njit(parallel=True, fastmath={"nsz", "arcp", "contract", "afn", "reassoc"})
def _rolling_nanstd(a):
    out = np.empty(a.shape[0])
    for i in prange(a.shape[0]):
        out[i] = np.nanstd(a[i])

    return out


def rolling_nanstd(a, w):
    return _rolling_nanstd(core.rolling_window(a, w))

And then we can test with:

T = np.random.rand(100_000_000)  # large time series that fails with `np.nanstd`
m = 50

start = time.time()
rolling_std = rolling_nanstd(T, m)
print(time.time()-start)

start = time.time()
welford_std = core.rolling_nanstd(T, m)
print(time.time()-start)

# On my machine, it is 2.9x slower than Welford but the matrix profile compute time is already several order of magnitude larger

npt.assert_almost_equal(rolling_std, welford_std, decimal=12)

Seems to be good up to 12 decimal places but only tested on a handful of random time series. This may be the best way forward to ensure that we get the same precision as the naive version or, at least, we are guaranteed that the stddev will be consistent. What do you think?

To be 100% clear, I am saying that we replace rolling_nanstd with this version (i.e., avoid using Welford altogether when computing the stddev)

@NimaSarajpoor
Copy link
Collaborator Author

Seems to be good up to 12 decimal places but only tested on a handful of random time series. This may be the best way forward to ensure that we get the same precision as the naive version or, at least, we are guaranteed that the stddev will be consistent. What do you think?

To be 100% clear, I am saying that we replace rolling_nanstd with this version (i.e., avoid using Welford altogether when computing the stddev)

This is a great idea! As you said, the computing time is mostly dominated by the matrix profile calculation, and sacrificing a little bit time on the computation of std should be okay! If I find some time, I will implement and push it.

@seanlaw
Copy link
Contributor

seanlaw commented Nov 5, 2022

@NimaSarajpoor Please pull the latest changes/commits into your branch

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 5, 2022

@seanlaw
Quick update:

It's nothing but simple parallelization via numba:

import stumpy
from stumpy import core
import numpy as np
from numba import njit, prange
import time
import numpy.testing as npt


@njit(parallel=True, fastmath={"nsz", "arcp", "contract", "afn", "reassoc"})
def _rolling_nanstd(a):
    out = np.empty(a.shape[0])
    for i in prange(a.shape[0]):
        out[i] = np.nanstd(a[i])

    return out


def rolling_nanstd(a, w):
    return _rolling_nanstd(core.rolling_window(a, w))

I used your approach but without core.rolling_window as I thought it is not necessary to store all subseqs. For time series with length 100_000_000 and m=50, the computing time increases by 36%. I think it is good :) It was 1.67sec and now it takes 2.27sec.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 6, 2022

I used your approach but without core.rolling_window as I thought it is not necessary to store all subseqs

I don't think I understand what you did then. Technically, core.rolling_window returns a view and so we aren't actually storing any subsequences.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 6, 2022

Technically, core.rolling_window returns a view and so we aren't actually storing any subsequences.

Oh! okay...I thought I have been saving some memory here :)

I don't think I understand what you did then

I didn't do anything new. I just do np.nanstd(T[i:i+m]) in each iteration of prange. I mean I get the subsequence right when I want it instead of using core.rolling_window.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 6, 2022

Two notes:

(1) Avoiding core.rolling_window in the calculation of "std" values can slightly reduce the computing time.

(2) since we try to avoid "rolling approach" and calculate std for each "subsequence" directly, the computing time linearly increases as we increase the window size m.

seed = 0
np.random.seed(seed)
n = 100_000_000
m = 1000

T = np.random.rand(n)

The existing approach (i.e. "rolling approach") takes about 2 sec but the "parallelizing approach" takes about 40 sec!)


I measured the computing time for m in range(50, 501, 50) when len(T) == 10_000_000
image

I think this is still negligible compared to the computational load of matrix profile. What do you think @seanlaw ?

@seanlaw
Copy link
Contributor

seanlaw commented Dec 6, 2022

The existing approach (i.e. "rolling approach") takes about 2 sec but the "parallelizing approach" takes about 40 sec!)

40 seconds?? Did you mean 4 seconds? Is the "parallelization" approach using _rolling_nanstd(core.rolling_window(a, w)) still?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 6, 2022

40 seconds?? Did you mean 4 seconds? I

I meant 40 (forty) seconds. This is for my code where I avoided using core.rolling_window(a, w). If I use core.rolling_window(a, w), the computing time will be around 47 seconds.

I think my pc is slower than yours. But I think you should feel the same jump if you run it on your end.

n = 100_000_000 # note that for the figure plotted in my previous comment, I used `n=10M`, NOT `100M`.
m = 1000 vs. 50

And, I think it makes sense. The time complexity is O(n.m) if we go with "parallelizing approach" while the time complexity of rolling approach is O(n)

@seanlaw
Copy link
Contributor

seanlaw commented Dec 6, 2022

I think this is still negligible compared to the computational load of matrix profile. What do you think @seanlaw ?

Sure, it's probably fine

@seanlaw
Copy link
Contributor

seanlaw commented Aug 24, 2023

@NimaSarajpoor Is this still needing more work?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 24, 2023

@NimaSarajpoor Is this still needing more work?

I think so. A few days ago I tried to see if the test functions in test_precision.py [of this PR] can now pass the tests. IIRC I noticed some improvements but it still fails.

Note 1:
This PR mostly tries to enhance precision for the cases where there are identical subsequences [in their z-normalized space]. [Note] When I look at the test functions in tests/test_precision.py [of this PR], I see test_snippet_fixed_seeds as a new change. So, we may have mixed things in this PR. I mean: handling both snippet and identical cases. We can either pull latest changes to this PR, and change the title of this PR to handle identical cases. OR, we may just create a new PR that handles loss-of-precision when there is identical subsequences.

Note 2:
I created identical subsequences by scaling subsequences with different values. So, for instance, I did:

T[i:i+m] = identical_subseq * 100
T[j:j+m] = identical_subseq * 0.01

Now, if we enhance the precision for the aforementioned case, does that mean there will be no issue for the following case?

T[i:i+m] = identical_subseq * 1000000
T[j:j+m] = identical_subseq * 0.000001

We can test it. Ideally, the distance should be exactly 0 between the two aforementioned subsequence. [In this case, one may treat T[j:j+m] as a constant subseq. However, that is a different story, and that option is currently supported in the API of the current version, i.e. stumpy 1.12.0]

@seanlaw
Copy link
Contributor

seanlaw commented Aug 24, 2023

What do you think is the best thing to do?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 24, 2023

@seanlaw

What do you think is the best thing to do?

I can see many code changes in this PR that should be reverted. Also, some part of the discussion is about using parallelized computation for std instead of using Welford technique [which is already taken care of]. Thus, I think it is better to close this one, and create a new one just for handling the loss of precision in cases with identical subsequences, and add new test func to test_precision.py one by one.

@seanlaw
Copy link
Contributor

seanlaw commented Aug 24, 2023

@seanlaw

What do you think is the best thing to do?

I can see many code changes in this PR that should be reverted. Also, some part of the discussion is about using parallelized computation for std instead of using Welford technique [which is already taken care of]. Thus, I think it is better to close this one, and create a new one just for handling the loss of precision in cases with identical subsequences, and add new test func to test_precision.py one by one.

Sounds good. Let's do that!

@NimaSarajpoor NimaSarajpoor mentioned this pull request Aug 25, 2023
@NimaSarajpoor NimaSarajpoor deleted the fix_precision branch September 4, 2023 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants