Fix precision #668

NimaSarajpoor · 2022-09-07T00:46:22Z

This PR is a replacement for PR #657, which tackles issue #610.

Regarding PR #657: There was a discrepancy between the files in the local repo and the remote repo, but I couldn't figure that out. That is why I am submitting a new PR.

NimaSarajpoor · 2022-09-07T00:57:14Z

@seanlaw
This is strange! Now, when I run test, it says "83 files would be left unchanged". Now, this is the same as what we see here in Github Actions. And, I get no error when I run test in my local repo.

$ ./test.sh custom 1 tests/test_stump.py
Cleaning Up
Checking Black Code Formatting
All done! \u2728 \U0001f370 \u2728
83 files would be left unchanged.
Checking Flake8 Style Guide Enforcement
Executing Custom User-Defined Tests Only
Custom Test: 1 / 1
============================= test session starts =============================
platform win32 -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: C:\Users\AS17983\Desktop\NIMA\stump\stumpy, configfile: pytest.ini
plugins: anyio-3.6.1, cython-0.2.0
collected 27 items

tests\test_stump.py ...........................                          [100%]

============================= 27 passed in 19.44s =============================

seanlaw · 2022-09-07T01:35:52Z

@NimaSarajpoor Instead of running test.sh, can you simply run this in a Jupyter notebook:

m = 3
zone = int(np.ceil(m / 4))
seed = 1
np.random.seed(seed)
T = np.random.rand(64)
scale = np.random.choice(np.array([0.001, 0, 1000]), len(T), replace=True)
T[:] = T * scale

ref_mp = naive.stump(T, m, exclusion_zone=zone, row_wise=True)
comp_mp = stump(T, m, ignore_trivial=True)

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    if np.abs(ref - cmp) > 1e-7:
        print(i, ref, cmp)

I get:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

If I do:

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    print(i, ref, cmp)

Then I get the following:

0 1.5700924586837752e-16 0.0
1 1.5700924586837752e-16 0.0
2 1.658550663133749e-07 1.6323404237781945e-07
3 2.4257150105478956e-07 2.421152116249788e-07
4 6.454152836022342e-07 6.452392069879462e-07
5 0.22155833569278333 0.2215583356927852
6 0.09204964741053696 0.09204964741054301
7 1.9109645596859633e-07 1.8966081829902314e-07
8 1.5700924586837752e-16 0.0
9 2.220446049250313e-16 0.0
10 0.012944862817693177 0.012944867345093258
11 0.49025821651994234 0.4902582165199456
12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
14 3.3457789477422945e-05 3.3457703025089785e-05
15 0.26636729814247995 0.26636729659228936
16 3.656951842685141e-08 0.0
17 0.46944298605172013 0.4694429860517184
18 0.03103461052726588 0.031034609854201676
19 6.434999199613844e-07 0.0
20 5.535501007487065e-08 0.0
21 8.19261056164571e-10 0.0
22 5.364470225196175e-07 0.0
23 8.030629626643827e-07 8.021753331240875e-07
24 1.9109645596859633e-07 1.8966081829902314e-07
25 1.5700924586837752e-16 0.0
26 0.0 0.0
27 2.53526511563009e-07 2.0157917229039495e-07
28 8.553525901666268e-08 0.0
29 4.6884218468212213e-07 4.6814316726937914e-07
30 0.10353933637363609 0.10353933656145922
31 0.1051316158248435 0.10513161293237559
32 0.48220646477596524 0.48220646544989193
33 8.523796584076095e-07 0.0
34 1.7325318446717188e-06 1.7323200730701582e-06
35 1.5414353887031432e-06 0.0
36 0.04048989128516726 0.040489912232822924
37 3.656951842685141e-08 0.0
38 3.656951842685141e-08 0.0
39 0.26636729814247995 0.26636729659228936
40 1.658550663133749e-07 0.0
41 4.6884218468212213e-07 4.6814316726937914e-07
42 6.460441508916396e-06 6.458583395855192e-06
43 0.03103461052726588 0.031034609854201676
44 0.10353933637363609 0.10353933656145922
45 0.09204964741053696 0.09204964741054301
46 8.19261056164571e-10 0.0
47 8.523796584076095e-07 8.218635459624336e-07
48 0.1051316158248435 0.10513161293237559
49 0.07416506824467556 0.07416506824466002
50 0.012944862817693177 0.012944867345093258
51 0.48220646477596524 0.48220646544989193
52 0.42052087653681697 0.42052087653681747
53 5.535501007487065e-08 0.0
54 2.7194799110210365e-16 0.0
55 0.0 0.0
56 2.760744472197097e-06 2.7529265174893867e-06
57 0.07416506824467556 0.07416506824466002
58 2.3357370178318246e-07 2.264780425067345e-07
59 0.45226821973180753 0.4522682194826431
60 2.486220486843771e-07 0.0
61 1.5700924586837752e-16 0.0

NimaSarajpoor · 2022-09-07T11:54:52Z

I get:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

I got a different result:

33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
41 4.6884218468212213e-07 0.0
42 6.460441508916396e-06 0.0
56 2.760744472197097e-06 0.0
60 2.486220486843771e-07 0.0

Btw, I compared the my local branch and the remote branch in origin (for both main and fix_precision) to make sure everything is the same. I used the instructions provided in this stackoverflow post. Everything is okay.

seanlaw · 2022-09-07T14:00:01Z

I got a different result:

I have a few observations:

For testing stump, should we really be setting row_wise=True in naive.stump? Since stump is diagonal-wise, shouldn't we set row_wise=False? This doesn't solve the problem but it is inconsistent
12 2.950287948697726e-05 0.0 means that naive.stump is returning a non-zero distance for idx = 12 in Github Actions (and in my local branch of your repo). Can you verify what the pairwise distances are for idx = 12? Here are my pairwise distances generated from naive.stump:

0 3.0000184482883
1 3.689691694550767e-05
2 2.999981187305404
3 3.0000188123406852
4 3.807877925156031e-05
5 0.9716146011928827
6 3.365385804112601
7 3.00001930941092
8 3.0000184482883
9 3.689691694550766e-05
10 2.7130038423554645
11 1.293751201679764
12 0.0
13 2.9999872380440378
14 3.0000368085323212
15 1.9609012370025143
16 2.999981551371355
17 2.644560505532764
18 0.4142995718310496
19 3.5963320995608045e-05
20 2.9999819288490794
21 3.0000186910572095
22 3.743336396799641e-05
23 2.9999807857662883
24 3.0000192138645323
25 3.0000184482883
26 3.689691694550767e-05
27 3.660682091553413e-05
28 2.9999816124248677
29 3.0000181726311808
30 0.6555850486723159
31 2.8129002184921874
32 3.373545663525392
33 3.9431652231321966e-05
34 2.9999799194833465
35 3.0000200801134347
36 2.979568418597036
37 3.686034742708289e-05
38 2.999981569656451
39 1.735512606848563
40 2.999981270234494
41 3.000017938214321
42 2.950287948697726e-05
43 0.3834710886174519
44 0.7569603930026928
45 3.342379253523891
46 3.000018690647586
47 4.028403188967316e-05
48 2.8729632528630487
49 0.27224456911612394
50 2.7210340045887635
51 3.2311659747454886
52 0.7619746520574434
53 2.9999819565270847
54 3.0000184482882997
55 3.689691694550767e-05
56 2.9999785390824387
57 0.3461178433786073
58 2.9999820733160356
59 2.316949684170337
60 2.9999817367381696
61 3.0000184482883

I generated this by adding the following three lines to naive.stump:

def stump(T_A, m, T_B=None, exclusion_zone=None, row_wise=False):
    if T_B is None:  # self-join:
        ignore_trivial = True
        distance_matrix = np.array(
            [distance_profile(Q, T_A, m) for Q in core.rolling_window(T_A, m)]
        )
        T_B = T_A.copy()
    else:
        ignore_trivial = False
        distance_matrix = np.array(
            [distance_profile(Q, T_B, m) for Q in core.rolling_window(T_A, m)]
        )

    distance_matrix[np.isnan(distance_matrix)] = np.inf

    # Added these three lines
    if distance_matrix.shape[0] >= 12:
        for i in range(distance_matrix.shape[1]):
            print(i, distance_matrix[12, i])

According to this, subsequence with idx = 42 corresponds to the one-nearest neighbor to idx = 12

NimaSarajpoor · 2022-09-07T14:26:50Z

For testing stump, should we really be setting row_wise=True in naive.stump? Since stump is diagonal-wise, shouldn't we set row_wise=False? This doesn't solve the problem but it is inconsistent

I think we should. btw, I tried row_wise=False before and then I realized we can set it to row_wise=True IF what we care about is just the matrix profile (and not the matrix profile indices). I mean...in such case, row_wise=True should be okay. However, as you said, we should better set it to row_wise=False for the sake of consistency.

PLEASE IGNORE THE FOLLOWING COMMENT

2. Here are my pairwise distances generated from naive.stump:

from naive.stump or stumpy.stump? Because, the value you provided for idx=12 is 0.0, which is not the same as what you provided earlier: 12 2.950287948697726e-05 0.0 (I think the last value here, i.e. 0.0, is from stumpy.stump)

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.

UPDATE:
Okay, I think you set row_wise=False now. Right?

seanlaw · 2022-09-07T14:31:15Z

from naive.stump or stumpy.stump? Because, the value you provided for idx=12 is 0.0, which is not the same as what you provided earlier: 12 2.950287948697726e-05 0.0 (I think the last value here, i.e. 0.0, is from stumpy.stump)

From naive.stump. If you look at my last output, you should NOT be looking at 12 0.0, which is the self-match/trivial match. Instead, you should be looking at 42 2.950287948697726e-05

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.

This suggests that stumpy.stump is different and not returning 0.0 for some reason

NimaSarajpoor · 2022-09-07T14:34:15Z

From naive.stump. If you look at my last output, you should NOT be looking at 12 0.0, which is the self-match/trivial match. Instead, you should be looking at 42 2.950287948697726e-05

Just got that! You are right!

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.

This suggests that stumpy.stump is different and not returning 0.0 for some reason

That is probably correct. I am going to do what you suggested... in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    if np.abs(ref - cmp) > 1e-7:
        print(i, ref, cmp)

?

Maybe numba performs differently on different platforms... btw, the code in this branch -in my local repo- still fails for seed=26 !!!

Do you feel we are putting too much energy on this? I mean....how do you feel about it? :)

NimaSarajpoor · 2022-09-07T14:45:58Z

2. an you verify what the pairwise distances are for idx = 12? Here are my pairwise distances generated from naive.stump:

0 3.0000184482883
1 3.689691694550767e-05
2 2.999981187305404
3 3.0000188123406852
4 3.807877925156031e-05
5 0.9716146011928827
6 3.365385804112601
7 3.00001930941092
8 3.0000184482883
9 3.689691694550766e-05
10 2.7130038423554645
11 1.293751201679764
12 0.0
13 2.9999872380440378
14 3.0000368085323212
15 1.9609012370025143
16 2.999981551371355
17 2.644560505532764
18 0.4142995718310496
19 3.5963320995608045e-05
20 2.9999819288490794
21 3.0000186910572095
22 3.743336396799641e-05
23 2.9999807857662883
24 3.0000192138645323
25 3.0000184482883
26 3.689691694550767e-05
27 3.660682091553413e-05
28 2.9999816124248677
29 3.0000181726311808
30 0.6555850486723159
31 2.8129002184921874
32 3.373545663525392
33 3.9431652231321966e-05
34 2.9999799194833465
35 3.0000200801134347
36 2.979568418597036
37 3.686034742708289e-05
38 2.999981569656451
39 1.735512606848563
40 2.999981270234494
41 3.000017938214321
42 2.950287948697726e-05
43 0.3834710886174519
44 0.7569603930026928
45 3.342379253523891
46 3.000018690647586
47 4.028403188967316e-05
48 2.8729632528630487
49 0.27224456911612394
50 2.7210340045887635
51 3.2311659747454886
52 0.7619746520574434
53 2.9999819565270847
54 3.0000184482882997
55 3.689691694550767e-05
56 2.9999785390824387
57 0.3461178433786073
58 2.9999820733160356
59 2.316949684170337
60 2.9999817367381696
61 3.0000184482883

NimaSarajpoor · 2022-09-07T14:47:28Z

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

seanlaw · 2022-09-07T14:59:09Z

in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

After commenting out @njit for _stump and _compute_diagonal, I got:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

Do you feel we are putting too much energy on this? I mean....how do you feel about it? :)

I do not feel this way at all! This shouldn't be happening and it would be good to track it down

seanlaw · 2022-09-07T15:04:31Z

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

The good thing is that we've isolated to the idx=12 and idx=42 pair. Somehow, your stump is computing 2.950306866348452e-05, which is very close to the same answer as naive.stump. Note that we are seeing the same test failures in Github Actions but all of them are only failing (first) for Ubuntu and Mac OS. Can you check if it fails for Windows in Github Actions (i.e., does it get past `test_stump.py for Windows)?

NimaSarajpoor · 2022-09-07T15:11:05Z

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

The good thing is that we've isolated to the idx=12 and idx=42 pair. Somehow, your stump is computing 2.950306866348452e-05, which is very close to the same answer as naive.stump. Note that we are seeing the same test failures in Github Actions but all of them are only failing (first) for Ubuntu and Mac OS.

Can you check if it fails for Windows in Github Actions (i.e., does it get past `test_stump.py for Windows)?

I think it passes

============================= test session starts =============================
platform win32 -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0
rootdir: D:\a\stumpy\stumpy, configfile: pytest.ini
collected 87 items

tests\test_stump.py ...........................                          [ 31%]
tests\test_mstump.py ............................................        [ 81%]
tests\test_stumpi.py ................                                    [100%]

============================= 87 passed in 29.12s =============================

The operation got cancelled for other versions of python.

NimaSarajpoor · 2022-09-07T15:29:19Z

in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

After commenting out @njit for _stump and _compute_diagonal, I got:
12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

And I got:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
47 8.523796584076095e-07 0.0
56 2.760744472197097e-06 0.0
60 2.486220486843771e-07 0.0

Now, without Numba, the values that correspond to idx=12 are the same as what you got. (Note that we STILL use numba to get mean and std. So, MAYBE that is the reason behind the discrepancy between your result and mine) But, this can show that something is happening in numba. Right?

NimaSarajpoor · 2022-09-07T16:27:48Z

@seanlaw
It might be a good idea to create a new branch and just have a directory tests that only contains test_stump with only the test function of our interest. I think this can help us detect the problem more quickly and come up with a MWE if needed.

codecov-commenter · 2022-09-07T16:30:28Z

Codecov Report

Base: 99.89% // Head: 99.89% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (49651ff) compared to base (cd9c3ff).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #668   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11453    11538   +85     
=======================================
+ Hits        11441    11526   +85     
  Misses         12       12

Impacted Files	Coverage Δ
stumpy/scrump.py	`100.00% <ø> (ø)`
stumpy/config.py	`100.00% <100.00%> (ø)`
stumpy/core.py	`100.00% <100.00%> (ø)`
stumpy/stump.py	`100.00% <100.00%> (ø)`
stumpy/stumped.py	`100.00% <100.00%> (ø)`
tests/naive.py	`100.00% <100.00%> (ø)`
tests/test_core.py	`100.00% <100.00%> (ø)`
tests/test_stump.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

NimaSarajpoor · 2022-09-07T17:26:32Z

I get:

12 2.950287948697726e-05 0.0

Could you please let me know what is the nn of 12 in your stumpy.stump? (I think you are using Mac. right?)

Explanation
In naive, the nn is 42. In my local repo, I got 42 as well when I used stumpy.stump. So, I used the subseqs at indices 12 and 42 to create a new time series T that is used in the new test function that I pushed. However, it seems all tests are passing.

So, maybe the T that I used is not a good input to reproduce that error. Or, maybe 42 is NOT the nn of 12 in your end! That is why I asked if you could let me know the nn index of subseq at index 12.

seanlaw · 2022-09-07T18:12:58Z

Could you please let me know what is the nn of 12 in your stumpy.stump? (I think you are using Mac. right?)

When I do:

comp_mp[12]

I get:

array([0.0, 9, 9, 42], dtype=object)

So, stumpy.stump (with @njit ON) produces a left and global nearest neighbor of 9 while the right nearest neighbor is 42

When I print their z-norm subsequences, I get:

# idx = 9
stumpy.core.z_norm(T[9:9+m])
array([-0.70710678, -0.70710678,  1.41421356])

# idx = 12
stumpy.core.z_norm(T[12:12+m])
array([-0.70713287, -0.70708069,  1.41421356])

# idx = 42
stumpy.core.z_norm(T[42:42+m])
array([-0.70711201, -0.70710155,  1.41421356])

You can see that they are somewhat close. The last (3rd) element is identical but the first two elements are only slightly different. stumpy.stump is interpreting idx = 9 as being essentially identical to idx = 12

NimaSarajpoor · 2022-09-13T16:13:57Z

@seanlaw

We just got an error caused by slight imprecision in identical case in a about-to-be-merged PR. (see error)

I was wondering if we should partially (?) solve this as this error appears from time to time in the merging process. I have a new idea but have not implemented it yet. The concept goes as follows:

if pearson > pearson_threshold:
     # S_A: z-norm of subseq in T_A
     # S_B: z-norm of subseq in T_B
     if np.all(S_A == S_B):
            pearson = 1.0

Do you think the np.all(...) can be time consuming?

seanlaw · 2022-09-13T16:31:09Z

Do you think the np.all(...) can be time consuming?

Instead of np.all(S_A == S_B), I think you'd rather use np.allclose (see docs). The S_A == S_B can be expensive because you are forced to always compare EVERY element in S_A with S_B. I'd be willing to bet that np.allclose short circuits and stops checking if the first elements are not equal and is therefore faster.

Regarding the timing, I don't know how time consuming it would be and we should simply check with some different inputs. First:

T_random = np.random.rand(100_000)

then:

t = np.linspace(0, 50*(2*np.pi), 100_000)
T_sine = np.sin(t)  # Essentially, a perfect sine wave repeated 50 times so the check above is very costly

Finally, a mix of both:

percent = 0.1
T_random_sine = np.where(T_random < percent, T_random, T_sine)  # Essentially, insert random values only a `percent` of the time

Compare their performance with the check.

NimaSarajpoor · 2022-10-09T04:24:00Z

then the cov_a[i + k] * cov_b[i] - cov_c[i + k] * cov_d[i] might be the problem. Loss of precision (significant digits) often comes from adding/subtracting numbers that are very different in magnitude and this looks like a (potentially) classic case. And then, there's the difference between adding cov to this number to get the updated cov.

I can see your point. I need to investigate this further. Btw, if you take a look at the cov_diff plot in which the imprecision of cov is illustrated, you can see that the imprecision is in the order of 1e-11.

Also, note that in this case, the denom in formula rho = cov / denom is 4.96261359e-10. So, we need to significantly improve the precision of cov. If we just make the imprecision of cov around 1e-14, we will still have around 0.0005 imprecision in the calculation of pearson.

Also, I had a slight miscalculation in the notebook. I fixed it. I also added a section to understand the relationship between denom and cov more easily.

I cannot guarantee that all of my calculations are flawless. I just wanted to share with you my thoughts. We might be able to use them at some point.

seanlaw · 2022-10-09T17:55:23Z

@NimaSarajpoor For this T, can you tell me which i and j pair is producing the wrong pearson? It's actually really hard for me to follow/find the values that match your outputs from the inputs since you are combining values together. So, I can't reproduce it.

seanlaw · 2022-10-09T18:19:07Z

Interesting, I wonder if this discussion might help. Additionally, this other conversation seems insightful as to how we should be thinking about these values.

seanlaw · 2022-10-10T14:37:10Z

Btw @NimaSarajpoor, I'm starting to think if it makes more sense to move these precision-related tests to test_precision.py? This will help us to isolate/identify all of these important precision-related edge cases

NimaSarajpoor · 2022-10-10T15:38:40Z

@seanlaw
So, I was re-running the code again to make sure my outputs are correct. I am sharing my findings below:

Input T and m

# m and zone
m = 3
zone = int(np.ceil(m / 4))

# T
seed = 2
np.random.seed(seed)
T = np.random.rand(64)
scale = np.random.choice(np.array([0.001, 0, 1000]), len(T), replace=True)
T[:] = T * scale

Plot of T

Note that the flat segments may not be actually flat! (I mean... their values might be very very small)

Imprecision in pearson
(cov with np.dot vs cov with rolling approach)

Note1: The length of x-axis is the number of pairs. So, if it is confusing, just pay attention to the y-axis which shows the imprecision in pearson for the pairs of subsequences.

Note2: In the stumpy/stump.py of this branch, we refine cov (by using np.dot) if std < 1.0. However, to get this imprecision, I stored the rolling cov and the exact cov (i.e. the cov obtained directly by np.dot), and then did:

cov_imprecision = cov_exact - cov_roll
pearson_imprecision  = cov_imprecision / stddev 

# note: pearson = cov / stddev, where stddev is the multiplication of the std values of two subseqs

Pairs of subsequences sorted in descending manner according to their pearson imprecision:

pairs_subseqA	pairs_subseqB	imprecision
49	55	0.030831
33	55	0.001694
33	49	0.001385
42	55	0.001367
42	49	0.001250
22	55	0.000808
22	33	0.000259
22	49	0.000206
33	42	0.000190
22	42	0.000014

Let's investigate the first pair idx = 49 and idx_nn = 55

NOTE: this pair is in the k=6-th diagonal. So, the plots I am about to provide corresponds to the pairs in this diagonal. So, for instance, the point at index 0 corresponds to the 0-th pair in this diagonal, which corresponds to the pair (subseq_0, subseq_6)

visualizing the subseq and its pair
The subseqs are:

# subseq at idx=49
array([0.00000000e+00, 0.00000000e+00, 8.20949223e-05])

# subseq at idx_nn = 55
array([2.72023659e-05, 0.00000000e+00, 0.00000000e+00])

To better see what is going on, I tried to update cov via one-step rolling approach. I mean... I updated cov(i + 1) as follows:

cov_updater = constant * (cov_a[i + k] * cov_b[i] - cov_c[i + k] * cov_d[i])
cov(i + 1) = cov_exact(i) + cov_updater

So, now I have three versions of cov:

cov_exact: obtained by using np.dot
cov_roll: obtained by using rolling approach
cov_roll_onestep: obtained by using rolling approach by using the exact previous cov

Now, I am interested to see the error cov_diff (which is cov_exact - cov_roll ) and cov_diff_onestep (which is cov_exact - cov_roll_onestep (the latter shows the error of the cov_updater term)

Note that the magnitude of error at index 47 is around 0. At index 48 and 49, the error of cov_updater (blue curve) are around 0.5e-11. So, they are being accumulated and the error of cov_diff becomes arounnd 1e-11.

Also, note that the errors of cov in index 48 and 49 are in the same order. So, why do we get imprecision in pearson at index 49? Because of the denom stddevs in pearson formula. Let's check out their values:

As you can see, the stddev at index 48 is very large. However, the stddev at index 49 is very small (4.962613586298805e-10). Hence, its inverse can amplify the error in cov.

Let's dig deeper and investigate the values in the cov_updater term.

Note: There is no imprecision in sliding mean. I checked.

T_A = T.copy()
T_B = T_A.copy()

μ_Q_m_1 = np.mean(stumpy.core.rolling_window(T_A, m-1), axis=1)
M_T_m_1 = np.mean(stumpy.core.rolling_window(T_B, m-1), axis=1)

std_vec_A = np.std(stumpy.core.rolling_window(T_A, m-1), axis=1)
std_vec_B = np.std(stumpy.core.rolling_window(T_B, m-1), axis=1)
#################
cov_a = T_B[m - 1 :] - M_T_m_1[:-1]
cov_b = T_A[m - 1 :] - μ_Q_m_1[:-1]

# The next lines are equivalent and left for reference
# cov_c = np.roll(T_A, 1)
# cov_ = cov_c[:M_T_m_1.shape[0]] - M_T_m_1[:]
cov_c_L = np.empty(M_T_m_1.shape[0], dtype=np.float64)
cov_c_L[1:] = T_B[: M_T_m_1.shape[0] - 1]
cov_c_L[0] = T_B[-1]
cov_c = cov_c_L - M_T_m_1


# The next lines are equivalent and left for reference
# cov_d = np.roll(T_B, 1)
# cov_d = cov_d[:μ_Q_m_1.shape[0]] - μ_Q_m_1[:]
cov_d_L = np.empty(μ_Q_m_1.shape[0], dtype=np.float64)
cov_d_L[1:] = T_A[: μ_Q_m_1.shape[0] - 1]
cov_d_L[0] = T_A[-1]
cov_d = cov_d_L - μ_Q_m_1

######################

def investigate_cov_updater(cov_a, cov_b, cov_c, cov_d, idx_A, idx_B, m=3):
    m_inverse = 1.0 / m
    constant = (m - 1) * m_inverse * m_inverse
    
    cov_updater = constant * (
                    cov_a[idx_B] * cov_b[idx_A] - cov_c[idx_B] * cov_d[idx_A]
                )
    
    print('cov_a: ', cov_a[idx_B])
    print('cov_b: ', cov_b[idx_A])
    print('cov_c: ', cov_c[idx_B])
    print('cov_d: ', cov_d[idx_A])
    
    print('\n')
    print('cov_updater: ', cov_updater)
    
    print('-----------------------------------------')
    
    print('cov_a L: ', T_B[m - 1 :][idx_B])
    print('cov_a R: ', M_T_m_1[:-1][idx_B])
    print('\n')
    print('cov_b L: ', T_A[m - 1 :][idx_A])
    print('cov_b R: ', μ_Q_m_1[:-1][idx_A])
    print('\n')
    print('cov_c L: ', cov_c_L[idx_B])
    print('cov_c R: ', M_T_m_1[idx_B])
    print('\n')
    print('cov_d L: ', cov_d_L[idx_A])
    print('cov_d R: ', μ_Q_m_1[idx_A])
    
    print('-----------------------------------------')
    
    print('std_vec_A in idx_A: ', std_vec_A[idx_A])
    print('std_vec_B in idx_B: ', std_vec_B[idx_B])

And, for idx_A = idx # 49 and idx_B = idx_nn # 55, let us see the values:

# outputs
# NOTE:  The letters L and R denotes left and right term. 
# For example: cov_a = T_B[m - 1 :] - M_T_m_1[:-1], so the left term is `T_B[m - 1 :]` and the right one is `M_T_m_1[:-1]`

cov_a:  -1.3601182947424795e-05
cov_b:  8.20949222750248e-05
cov_c:  406.27502944676786
cov_d:  535.6041734976563


cov_updater:  -48356.133635460705
-----------------------------------------
cov_a L:  0.0
cov_a R:  1.3601182947424795e-05


cov_b L:  8.20949222750248e-05
cov_b R:  0.0


cov_c L:  406.2750430479508
cov_c R:  1.3601182947424795e-05


cov_d L:  535.6041734976563
cov_d R:  0.0
-----------------------------------------
std_vec_A in idx_A:  0.0
std_vec_B in idx_B:  1.3601182947424795e-05

Note that cov_a and cov_b are small values, and their multiplication become very small. On the other hand, cov_c and cov_d are large and their multiplication become very large. Hence, substracting a very small and very large value may create imprecision. But, how can we check that? Because, as I said before, the sliding mean has no imprecision. So, it seems each individual terms cov_a, cov_b, cov_c and cov_d are precise themselves.

Maybe we should think of a way to consider applying stddev throughout the updating process. Because, the only difference between direct approach and rolling approach is that in the former we divide by std first and then do multiplication. This can help with scaling up values (when they are very small) and scaling them down (when they are very large), and maybe that is why we get better precision in direct approach.

I wonder if this discussion might help.

The log might help. We can probably try it on the cov_a*cov_b and cov_c * cov_d in cov_updater. Other than that, we have just sum operation (not multiplication) in the updating process of cov, and I am not sure how we can use log.

So, we can try:

exp(log(cov_a) + log(cov_b))  # cov_a * cov_b

exp(log(cov_c) + log(cov_d))  # cov_c * cov_d

I think applying it on cov/stddev may not be that useful because the imprecision is the cov itself, and it is not caused by the division operation.

Additionally, this other conversation seems insightful as to how we should be thinking about these values.

I took a quick look. I will read it more carefully to see if we can get some insight from it.

Sorry for the long comment :) I have been working on this and I thought it would be useful to share with you the things that I found

seanlaw · 2022-10-10T15:53:05Z

Sorry for the long comment :) I have been working on this and I thought it would be useful to share with you the things that I found

Thank you for working on this @NimaSarajpoor! I don't know if any of my suggestions will help but it feels like we are making slow but steady progress towards understanding what the root of our problem is. I definitely don't know the "right" answer :)

NimaSarajpoor · 2022-10-10T16:13:52Z

Thank you for working on this @NimaSarajpoor! I don't know if any of my suggestions will help but it feels like we are making slow but steady progress towards understanding what the root of our problem is. I definitely don't know the "right" answer :)

Me neither! I just hope that we get to a better point in terms of understanding the nature of the problem.

Btw @NimaSarajpoor, I'm starting to think if it makes more sense to move these precision-related tests to test_precision.py? This will help us to isolate/identify all of these important precision-related edge cases

I think that would be a good idea. Because (1) we can move forward with resolving other issues (2) As you said, put all of these edge cases together in one place.

NimaSarajpoor · 2022-10-10T17:49:45Z

Also:
https://www.soa.org/news-and-publications/newsletters/compact/2014/may/com-2014-iss51/losing-my-precision-tips-for-handling-tricky-floating-point-arithmetic/

seanlaw · 2022-10-10T18:46:49Z

Also: https://www.soa.org/news-and-publications/newsletters/compact/2014/may/com-2014-iss51/losing-my-precision-tips-for-handling-tricky-floating-point-arithmetic/

Yes! Especially the section on adding/subtracting/multiplying/dividing. Note that I try to say "loss of precision" rather than "imprecision" as the former implies that precision is lost due to, say, combining numbers with different magnitudes (see numerical analysis). So, both numbers being combined can have high precision but precision is loss due to floating point issues. I don't know how I'd use "imprecision" but, to me, it feels like somebody failed to record the initial number up to enough significant digits and is, therefore, "imprecise". Maybe it's just me over-analyzing things and I could be completely "off"

NimaSarajpoor · 2022-10-10T21:05:07Z

I think I get your point. In my head, the imprecision might imply inaccuracy. So,"loss of precision" is better as it avoids creating such confusion.

seanlaw · 2022-10-25T00:26:10Z

@NimaSarajpoor I've been thinking about this a little more. There are two reasons why I chose to use Welford to compute the stddev:

For large time series, Welford is really, really fast for computing a rolling stddev
For a large rolling window (2D) array, np.nanstd is basically un-usable as it takes up a ton of memory and so Welford becomes the only choice that balances speed and precision

Or so I thought... Then, I realized that there is another option that is:

much more precise than Welford
but that takes a bit more time to compute
but that doesn't run out of memory for large arrays

It's nothing but simple parallelization via numba:

import stumpy
from stumpy import core
import numpy as np
from numba import njit, prange
import time
import numpy.testing as npt


@njit(parallel=True, fastmath={"nsz", "arcp", "contract", "afn", "reassoc"})
def _rolling_nanstd(a):
    out = np.empty(a.shape[0])
    for i in prange(a.shape[0]):
        out[i] = np.nanstd(a[i])

    return out


def rolling_nanstd(a, w):
    return _rolling_nanstd(core.rolling_window(a, w))

And then we can test with:

T = np.random.rand(100_000_000)  # large time series that fails with `np.nanstd`
m = 50

start = time.time()
rolling_std = rolling_nanstd(T, m)
print(time.time()-start)

start = time.time()
welford_std = core.rolling_nanstd(T, m)
print(time.time()-start)

# On my machine, it is 2.9x slower than Welford but the matrix profile compute time is already several order of magnitude larger

npt.assert_almost_equal(rolling_std, welford_std, decimal=12)

Seems to be good up to 12 decimal places but only tested on a handful of random time series. This may be the best way forward to ensure that we get the same precision as the naive version or, at least, we are guaranteed that the stddev will be consistent. What do you think?

To be 100% clear, I am saying that we replace rolling_nanstd with this version (i.e., avoid using Welford altogether when computing the stddev)

NimaSarajpoor · 2022-10-25T20:40:34Z

Seems to be good up to 12 decimal places but only tested on a handful of random time series. This may be the best way forward to ensure that we get the same precision as the naive version or, at least, we are guaranteed that the stddev will be consistent. What do you think?

To be 100% clear, I am saying that we replace rolling_nanstd with this version (i.e., avoid using Welford altogether when computing the stddev)

This is a great idea! As you said, the computing time is mostly dominated by the matrix profile calculation, and sacrificing a little bit time on the computation of std should be okay! If I find some time, I will implement and push it.

seanlaw · 2022-11-05T20:16:43Z

@NimaSarajpoor Please pull the latest changes/commits into your branch

NimaSarajpoor · 2022-12-05T20:03:11Z

@seanlaw
Quick update:

It's nothing but simple parallelization via numba:

import stumpy
from stumpy import core
import numpy as np
from numba import njit, prange
import time
import numpy.testing as npt


@njit(parallel=True, fastmath={"nsz", "arcp", "contract", "afn", "reassoc"})
def _rolling_nanstd(a):
    out = np.empty(a.shape[0])
    for i in prange(a.shape[0]):
        out[i] = np.nanstd(a[i])

    return out


def rolling_nanstd(a, w):
    return _rolling_nanstd(core.rolling_window(a, w))

I used your approach but without core.rolling_window as I thought it is not necessary to store all subseqs. For time series with length 100_000_000 and m=50, the computing time increases by 36%. I think it is good :) It was 1.67sec and now it takes 2.27sec.

seanlaw · 2022-12-06T02:45:30Z

I used your approach but without core.rolling_window as I thought it is not necessary to store all subseqs

I don't think I understand what you did then. Technically, core.rolling_window returns a view and so we aren't actually storing any subsequences.

NimaSarajpoor · 2022-12-06T06:21:06Z

Technically, core.rolling_window returns a view and so we aren't actually storing any subsequences.

Oh! okay...I thought I have been saving some memory here :)

I don't think I understand what you did then

I didn't do anything new. I just do np.nanstd(T[i:i+m]) in each iteration of prange. I mean I get the subsequence right when I want it instead of using core.rolling_window.

NimaSarajpoor · 2022-12-06T15:40:50Z

Two notes:

(1) Avoiding core.rolling_window in the calculation of "std" values can slightly reduce the computing time.

(2) since we try to avoid "rolling approach" and calculate std for each "subsequence" directly, the computing time linearly increases as we increase the window size m.

seed = 0
np.random.seed(seed)
n = 100_000_000
m = 1000

T = np.random.rand(n)

The existing approach (i.e. "rolling approach") takes about 2 sec but the "parallelizing approach" takes about 40 sec!)

I measured the computing time for m in range(50, 501, 50) when len(T) == 10_000_000

I think this is still negligible compared to the computational load of matrix profile. What do you think @seanlaw ?

seanlaw · 2022-12-06T16:10:21Z

The existing approach (i.e. "rolling approach") takes about 2 sec but the "parallelizing approach" takes about 40 sec!)

40 seconds?? Did you mean 4 seconds? Is the "parallelization" approach using _rolling_nanstd(core.rolling_window(a, w)) still?

NimaSarajpoor · 2022-12-06T16:27:25Z

40 seconds?? Did you mean 4 seconds? I

I meant 40 (forty) seconds. This is for my code where I avoided using core.rolling_window(a, w). If I use core.rolling_window(a, w), the computing time will be around 47 seconds.

I think my pc is slower than yours. But I think you should feel the same jump if you run it on your end.

n = 100_000_000 # note that for the figure plotted in my previous comment, I used `n=10M`, NOT `100M`.
m = 1000 vs. 50

And, I think it makes sense. The time complexity is O(n.m) if we go with "parallelizing approach" while the time complexity of rolling approach is O(n)

seanlaw · 2022-12-06T17:24:28Z

I think this is still negligible compared to the computational load of matrix profile. What do you think @seanlaw ?

Sure, it's probably fine

seanlaw · 2023-08-24T13:13:05Z

@NimaSarajpoor Is this still needing more work?

NimaSarajpoor · 2023-08-24T16:33:15Z

@NimaSarajpoor Is this still needing more work?

I think so. A few days ago I tried to see if the test functions in test_precision.py [of this PR] can now pass the tests. IIRC I noticed some improvements but it still fails.

Note 1:
This PR mostly tries to enhance precision for the cases where there are identical subsequences [in their z-normalized space]. [Note] When I look at the test functions in tests/test_precision.py [of this PR], I see test_snippet_fixed_seeds as a new change. So, we may have mixed things in this PR. I mean: handling both snippet and identical cases. We can either pull latest changes to this PR, and change the title of this PR to handle identical cases. OR, we may just create a new PR that handles loss-of-precision when there is identical subsequences.

Note 2:
I created identical subsequences by scaling subsequences with different values. So, for instance, I did:

T[i:i+m] = identical_subseq * 100
T[j:j+m] = identical_subseq * 0.01

Now, if we enhance the precision for the aforementioned case, does that mean there will be no issue for the following case?

T[i:i+m] = identical_subseq * 1000000
T[j:j+m] = identical_subseq * 0.000001

We can test it. Ideally, the distance should be exactly 0 between the two aforementioned subsequence. [In this case, one may treat T[j:j+m] as a constant subseq. However, that is a different story, and that option is currently supported in the API of the current version, i.e. stumpy 1.12.0]

seanlaw · 2023-08-24T18:56:04Z

What do you think is the best thing to do?

NimaSarajpoor · 2023-08-24T22:33:51Z

@seanlaw

What do you think is the best thing to do?

I can see many code changes in this PR that should be reverted. Also, some part of the discussion is about using parallelized computation for std instead of using Welford technique [which is already taken care of]. Thus, I think it is better to close this one, and create a new one just for handling the loss of precision in cases with identical subsequences, and add new test func to test_precision.py one by one.

seanlaw · 2023-08-24T23:04:33Z

@seanlaw

What do you think is the best thing to do?

I can see many code changes in this PR that should be reverted. Also, some part of the discussion is about using parallelized computation for std instead of using Welford technique [which is already taken care of]. Thus, I think it is better to close this one, and create a new one just for handling the loss of precision in cases with identical subsequences, and add new test func to test_precision.py one by one.

Sounds good. Let's do that!

NimaSarajpoor added 6 commits September 6, 2022 17:35

add test function for input with identical subseqs

eeb257a

Recalculate var if it is less than a threshold

60bc457

Add test function for identical subseq with different scales

f343fac

recalculare pearson if it exceeds a threshold

0463654

Add new test function

9ceaf11

Add volatile test function

5ecaf36

NimaSarajpoor added 2 commits September 7, 2022 09:58

add test function to investigate issue

f278a1d

Modify test functions

2714e72

NimaSarajpoor added 3 commits September 7, 2022 13:06

modify test function

a5a9f17

modify test function and add spacer np.inf

031a173

modify input for test function

21aacf6

Add unit tests to capture loss of precision

2adafb0

NimaSarajpoor mentioned this pull request Aug 24, 2023

Fixed #610 Imprecision in identical subsequence case #657

Closed

NimaSarajpoor closed this Aug 25, 2023

NimaSarajpoor mentioned this pull request Aug 25, 2023

Fix #610 #907

Open

NimaSarajpoor deleted the fix_precision branch September 4, 2023 03:22

Fix precision #668

Fix precision #668

Conversation

NimaSarajpoor commented Sep 7, 2022

NimaSarajpoor commented Sep 7, 2022

seanlaw commented Sep 7, 2022

NimaSarajpoor commented Sep 7, 2022 • edited Loading

seanlaw commented Sep 7, 2022 • edited Loading

NimaSarajpoor commented Sep 7, 2022 • edited Loading

seanlaw commented Sep 7, 2022

NimaSarajpoor commented Sep 7, 2022 • edited Loading

NimaSarajpoor commented Sep 7, 2022

NimaSarajpoor commented Sep 7, 2022 • edited Loading

seanlaw commented Sep 7, 2022 • edited Loading

seanlaw commented Sep 7, 2022

NimaSarajpoor commented Sep 7, 2022 • edited Loading

NimaSarajpoor commented Sep 7, 2022

NimaSarajpoor commented Sep 7, 2022 • edited Loading

codecov-commenter commented Sep 7, 2022 • edited Loading

Codecov Report

NimaSarajpoor commented Sep 7, 2022 • edited Loading

seanlaw commented Sep 7, 2022 • edited Loading

NimaSarajpoor commented Sep 13, 2022 • edited Loading

seanlaw commented Sep 13, 2022 • edited Loading

NimaSarajpoor commented Oct 9, 2022 • edited Loading

seanlaw commented Oct 9, 2022 • edited Loading

seanlaw commented Oct 9, 2022 • edited Loading

seanlaw commented Oct 10, 2022 • edited Loading

NimaSarajpoor commented Oct 10, 2022 • edited Loading

seanlaw commented Oct 10, 2022 • edited Loading

NimaSarajpoor commented Oct 10, 2022

NimaSarajpoor commented Oct 10, 2022

seanlaw commented Oct 10, 2022

NimaSarajpoor commented Oct 10, 2022

seanlaw commented Oct 25, 2022 • edited Loading

NimaSarajpoor commented Oct 25, 2022

seanlaw commented Nov 5, 2022

NimaSarajpoor commented Dec 5, 2022 • edited Loading

seanlaw commented Dec 6, 2022

NimaSarajpoor commented Dec 6, 2022 • edited Loading

NimaSarajpoor commented Dec 6, 2022 • edited Loading

seanlaw commented Dec 6, 2022

NimaSarajpoor commented Dec 6, 2022 • edited Loading

seanlaw commented Dec 6, 2022

seanlaw commented Aug 24, 2023

NimaSarajpoor commented Aug 24, 2023 • edited Loading

seanlaw commented Aug 24, 2023

NimaSarajpoor commented Aug 24, 2023 • edited Loading

seanlaw commented Aug 24, 2023

NimaSarajpoor commented Sep 7, 2022 •

edited

Loading

seanlaw commented Sep 7, 2022 •

edited

Loading

NimaSarajpoor commented Sep 7, 2022 •

edited

Loading

NimaSarajpoor commented Sep 7, 2022 •

edited

Loading

NimaSarajpoor commented Sep 7, 2022 •

edited

Loading

seanlaw commented Sep 7, 2022 •

edited

Loading

NimaSarajpoor commented Sep 7, 2022 •

edited

Loading

NimaSarajpoor commented Sep 7, 2022 •

edited

Loading

codecov-commenter commented Sep 7, 2022 •

edited

Loading

NimaSarajpoor commented Sep 7, 2022 •

edited

Loading

seanlaw commented Sep 7, 2022 •

edited

Loading

NimaSarajpoor commented Sep 13, 2022 •

edited

Loading

seanlaw commented Sep 13, 2022 •

edited

Loading

NimaSarajpoor commented Oct 9, 2022 •

edited

Loading

seanlaw commented Oct 9, 2022 •

edited

Loading

seanlaw commented Oct 9, 2022 •

edited

Loading

seanlaw commented Oct 10, 2022 •

edited

Loading

NimaSarajpoor commented Oct 10, 2022 •

edited

Loading

seanlaw commented Oct 10, 2022 •

edited

Loading

seanlaw commented Oct 25, 2022 •

edited

Loading

NimaSarajpoor commented Dec 5, 2022 •

edited

Loading

NimaSarajpoor commented Dec 6, 2022 •

edited

Loading

NimaSarajpoor commented Dec 6, 2022 •

edited

Loading

NimaSarajpoor commented Dec 6, 2022 •

edited

Loading

NimaSarajpoor commented Aug 24, 2023 •

edited

Loading

NimaSarajpoor commented Aug 24, 2023 •

edited

Loading