Fixed #610 Imprecision in identical subsequence case #657

NimaSarajpoor · 2022-08-02T21:52:07Z

This PR resolves issue #610 by adding a new config variable to reset the already-calculated pearson value to 1.0 when it exceeds the threshold set by the config variable.

NimaSarajpoor · 2022-08-02T22:55:21Z

So, this problem is more complicated than I thought :) I think I found an alternative approach and it seems it is working. HOWEVER, it does not work for this new test function I designed:

def test_stump_identical_subsequence_self_join_rare_cases_2():
    # This test function is designed to capture the errors that migtht be raised
    # due the imprecision in the calculation of pearson values in the edge case
    # where two subsequences are identical (i.e. their pearson value is 1.0)
    # This is resolved by setting config.STUMPY_PERFECT_CORRELATION
    m = 3
    zone = int(np.ceil(m / 4))

    seed_values = [27343, 84451]
    for seed in seed_values:
        np.random.seed(seed)

        identical = np.random.rand(8)
        T_A = np.random.rand(20)
        T_A[1 : 1 + identical.shape[0]] = identical * 0.001 # NEW
        T_A[11 : 11 + identical.shape[0]] = identical * 1000 # NEW

        ref_mp = naive.stump(T_A, m, exclusion_zone=zone, row_wise=True)
        comp_mp = stump(T_A, m, ignore_trivial=True)
        naive.replace_inf(ref_mp)
        naive.replace_inf(comp_mp)
        npt.assert_almost_equal(
            ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
        )  # ignore indices

        comp_mp = stump(pd.Series(T_A), m, ignore_trivial=True)
        naive.replace_inf(comp_mp)
        npt.assert_almost_equal(
            ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
        )  # ignore indices

There are still two identical subsequences... let's see the error:

 Mismatched elements: 2 / 18 (11.1%)
E           Max absolute difference: 0.002743917738525453
E           Max relative difference: 0.9999999999545488
E            x: array([0.06758104087691376, 2.7194799110210365e-16,
E                  1.2471461933214347e-13, 3.1401849173675503e-16,
E                  2.482534153247273e-16, 2.482534153247273e-16,...
E            y: array([0.06758104087691354, 0.0, 0.0027439177386501673, 0.0, 0.0, 0.0,
E                  0.0, 0.0021481232219631354, 0.7633511355203079,
E                  0.0021481232219631354, 0.46425476469467025, 0.0,...

Now, the third element in y is 0.0027439177386501673 while the actual value (i.e. ref) is 1.2471461933214347e-13. I am trying to see if I can find a solution for this.

(I think the pearson value is about 0.999998 which is below the threshold)

codecov-commenter · 2022-09-02T05:07:48Z

Codecov Report

Merging #657 (c57f662) into main (bd1beb4) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head c57f662 differs from pull request most recent head 566638a. Consider uploading reports for the commit 566638a to get more accurate results

@@           Coverage Diff           @@
##             main     #657   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11434    11531   +97     
=======================================
+ Hits        11422    11519   +97     
  Misses         12       12

Impacted Files	Coverage Δ
stumpy/config.py	`100.00% <100.00%> (ø)`
stumpy/core.py	`100.00% <100.00%> (ø)`
stumpy/stump.py	`100.00% <100.00%> (ø)`
tests/naive.py	`100.00% <100.00%> (ø)`
tests/test_stump.py	`100.00% <100.00%> (ø)`
stumpy/motifs.py	`100.00% <0.00%> (ø)`
stumpy/mpdist.py	`100.00% <0.00%> (ø)`
stumpy/scrump.py	`100.00% <0.00%> (ø)`
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

NimaSarajpoor · 2022-09-02T06:28:01Z

@seanlaw
As you correctly said, this "precision in numerical calculation" is waaay nastier than I thought! Fixing a thing and getting error from somewhere else (I mean from another test functions). I will let you know if I fix it.

NimaSarajpoor · 2022-09-02T15:44:21Z

Notes:

(1)

# in config
STUMPY_PERFECT_CORRELATION = 0.99999999 

# in stumpy/stump.py
if pearson > config.STUMPY_PERFECT_CORRELATION:
    pearson = 1.0

This approach sometimes results in error. In one case, pearson=0.9999_9999_01 was mistakenly reset to 1.0, and resulted in failure in test function. So, I had to avoid using this approach.

(2)
I am refining variance in welford function.

NimaSarajpoor · 2022-09-02T18:21:52Z

@seanlaw

(1) I decided to not pursue a new approach for calculating std. As you said, the welford is probably the best one can get considering both speed and accuracy.

(2) The two new test functions related to identical subsequences are now passing. Also, as opposed to our first attempt, all other test functions are passing as well.

(3) I am refining "var" in _welford_nanvar as I thought it might help with improving the precision of other var values that are calculated in the same iteration throughout the rolling process.

(4) I refine pearson only when it exceeds a threshold and if it is going to be used for updating matrix profile. Alternatively, we may do the refinement close to the end of function _stump. I remember you had concerns about whether this latter approach works or not because you said "can we be sure that the matrix profile indices ref_I and comp_I are the same?". I do not have the answer for this question. However, I tried it out and it seems that doing the refinement in _stump works.

The downside of doing the refinement in _stump (after merging results of all threads) is that we might(?) get into trouble in top-k mp. Because in that case, we need to make sure the refined top-k pearsons are still sorted! For instance, let's say the top-3 pearson values of a subsequence is: [0.9999, 0.99998, 1.0], then, after the refinement, this might be changed to [0.99999, 0.99998, 1.0], which is not sorted!

(5) I did not record the running time of stump after these changes. This needs to be done and the running time should be compared with the original stump to see how much this process adds to the computing process.
(update: I tested for length 10_000 and 50_000. It causes 10% increase in computing time which is not good)

(6) In case that matters, there is this test function below that is not added to tests/test_stump.py .

def test_stump_new():
    seed = 0
    np.ranom.seed(seed)
    T = np.random.rand(64)
    
    scale = np.random.choice(np.array([0.001, 1.0, 1000]), len(T), replace=True)
    T[:] = T * scale

    m = 3
    zone = int(np.ceil(m / 4))

    ref_mp = naive.stump(T, m, exclusion_zone=zone, row_wise=True)
    comp_mp = stump(T, m, ignore_trivial=True)
    naive.replace_inf(ref_mp)
    naive.replace_inf(comp_mp)
    npt.assert_almost_equal(
        ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
    )  # ignore indices

This test function fails and, I couldn't resolve it yet.

seanlaw

@NimaSarajpoor Please consider the following suggestions

stumpy/config.py

stumpy/core.py

stumpy/stump.py

seanlaw · 2022-09-03T21:35:43Z

(4) I refine pearson only when it exceeds a threshold and if it is going to be used for updating matrix profile. Alternatively, we may do the refinement close to the end of function _stump. I remember you had concerns about whether this latter approach works or not because you said "can we be sure that the matrix profile indices ref_I and comp_I are the same?". I do not have the answer for this question. However, I tried it out and it seems that doing the refinement in _stump works.

I think I like where pearson is refined in the current iteration. I don't think we should do it at the end of _stump

The downside of doing the refinement in _stump (after merging results of all threads) is that we might(?) get into trouble in top-k mp. Because in that case, we need to make sure the refined top-k pearsons are still sorted! For instance, let's say the top-3 pearson values of a subsequence is: [0.9999, 0.99998, 1.0], then, after the refinement, this might be changed to [0.99999, 0.99998, 1.0], which is not sorted!

Right. This is why I would keep the refinement in _compute_diagonal

I tested for length 10_000 and 50_000. It causes 10% increase in computing time which is not good

Where is the 10% increase coming from? Which line is causing such a huge increase? If there are no identical subsequences, does the time go back to what it was before?

(6) In case that matters, there is this test function below that is not added to tests/test_stump.py .

It seems important to resolve :)

NimaSarajpoor · 2022-09-05T00:58:39Z

@seanlaw
I am still working on this. I will update you.

NimaSarajpoor · 2022-09-05T05:34:08Z

@seanlaw
Could you please help me understand something? In my local repo, the test functions in tests/test_stump.py are all passing. However, here, the test function test_stump_volatile fails. Note that the generated time series in this test function corresponds to specified random seed. So, I think the randomly-generated input should not change as the seed is specified. So, why do we get failure here? Am I missing something?

seanlaw · 2022-09-05T14:26:54Z

@NimaSarajpoor Let me try to take a look and see what happens when I run your latest commit locally as well. The fact that the same test is failing in three independent environments is concerning.

seanlaw · 2022-09-05T14:30:50Z

@NimaSarajpoor When I cloned your branch and executed the tests suite, I am seeing the same failed tests:

tests/test_stump.py ..........................F

=================================== FAILURES ===================================
_____________________________ test_stump_volatile ______________________________

    def test_stump_volatile():
        m = 3
        zone = int(np.ceil(m / 4))

        seed_values = [0, 1]
        for seed in seed_values:
            np.random.seed(seed)
            T = np.random.rand(64)
            scale = np.random.choice(np.array([0.001, 0, 1000]), len(T), replace=True)
            T[:] = T * scale

            ref_mp = naive.stump(T, m, exclusion_zone=zone, row_wise=True)
            comp_mp = stump(T, m, ignore_trivial=True)
            naive.replace_inf(ref_mp)
            naive.replace_inf(comp_mp)

>           npt.assert_almost_equal(
                ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
            )  # ignore indices
E           AssertionError:
E           Arrays are not almost equal to 5 decimals
E
E           Mismatched elements: 1 / 62 (1.61%)
E           Max absolute difference: 2.950287948697726e-05
E           Max relative difference: 0.257701917724807
E            x: array([1.5700924586837752e-16, 1.5700924586837752e-16,
E                  1.658550663133749e-07, 2.4257150105478956e-07,
E                  6.454152836022342e-07, 0.22155833569278333, 0.09204964741053696,...
E            y: array([0.0, 0.0, 1.6323404237781945e-07, 2.421152116249788e-07,
E                  6.452392069879462e-07, 0.2215583356927852, 0.09204964741054301,
E                  1.8966081829902314e-07, 0.0, 0.0, 0.012944867345093258,...

tests/test_stump.py:348: AssertionError
=========================== short test summary info ============================
FAILED tests/test_stump.py::test_stump_volatile - AssertionError:
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
========================= 1 failed, 26 passed in 5.01s =========================

This failed 5/5 times in a row with the same error. By any chance, are you forgetting to reinstall the latest version of your local branch?

NimaSarajpoor · 2022-09-05T16:49:17Z

@seanlaw

This failed 5/5 times in a row with the same error. By any chance, are you forgetting to reinstall the latest version of your local branch?

I think I am installing it. Please see below:

(base)
AS17983@MCOEELAS1798306 MINGW64 ~/Desktop/NIMA/stump/stumpy (identical_case)
$ ./setup.sh
Found existing installation: stumpy 1.11.1
Uninstalling stumpy-1.11.1:
  Would remove:
    c:\users\as17983\miniconda3\lib\site-packages\stumpy-1.11.1.dist-info\*
    c:\users\as17983\miniconda3\lib\site-packages\stumpy\*
Proceed (Y/n)?   Successfully uninstalled stumpy-1.11.1
Processing c:\users\as17983\desktop\nima\stump\stumpy
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: numpy>=1.17 in c:\users\as17983\miniconda3\lib\site-packages (from stumpy==1.11.1) (1.21.5)
Requirement already satisfied: scipy>=1.5 in c:\users\as17983\miniconda3\lib\site-packages (from stumpy==1.11.1) (1.8.1)
Requirement already satisfied: numba>=0.54 in c:\users\as17983\miniconda3\lib\site-packages (from stumpy==1.11.1) (0.55.2)
Requirement already satisfied: setuptools in c:\users\as17983\miniconda3\lib\site-packages (from numba>=0.54->stumpy==1.11.1) (65.3.0)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in c:\users\as17983\miniconda3\lib\site-packages (from numba>=0.54->stumpy==1.11.1) (0.38.1)
Building wheels for collected packages: stumpy
  Building wheel for stumpy (setup.py): started
  Building wheel for stumpy (setup.py): finished with status 'done'
  Created wheel for stumpy: filename=stumpy-1.11.1-py3-none-any.whl size=139726 sha256=5260163c8726bb0e6df21c85ad98bec0f1399ed4de6f2d1429efb8597873657d
  Stored in directory: C:\Users\AS17983\AppData\Local\Temp\pip-ephem-wheel-cache-hhrovqh5\wheels\4e\11\b5\98d6a9e131f3ea63e20ef7e65380d0a3082c1fefbbc1dfe81b
Successfully built stumpy
Installing collected packages: stumpy
Successfully installed stumpy-1.11.1
(base)
AS17983@MCOEELAS1798306 MINGW64 ~/Desktop/NIMA/stump/stumpy (identical_case)
$ ./test.sh custom 1 tests/test_stump.py
Cleaning Up
Checking Black Code Formatting
All done! \u2728 \U0001f370 \u2728
84 files would be left unchanged.
Checking Flake8 Style Guide Enforcement
Executing Custom User-Defined Tests Only
Custom Test: 1 / 1
============================= test session starts =============================
platform win32 -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: C:\Users\AS17983\Desktop\NIMA\stump\stumpy, configfile: pytest.ini
plugins: anyio-3.6.1, cython-0.2.0
collected 27 items

tests\test_stump.py ...........................                          [100%]

============================= 27 passed in 17.42s =============================
Cleaning Up
(base)

Note that all 27 test are passing. As you can see, I installed it. I am going to push again just in case I missed something before.

AS17983@MCOEELAS1798306 MINGW64 ~/Desktop/NIMA/stump/stumpy (identical_case)
$ git push origin identical_case
Everything up-to-date
(base)
AS17983@MCOEELAS1798306 MINGW64 ~/Desktop/NIMA/stump/stumpy (identical_case)
$

seanlaw · 2022-09-06T13:27:08Z

@NimaSarajpoor I'm not sure why you aren't seeing the failed test. The only thing that I noticed was that, in your case, it says:

84 files would be left unchanged.

But, in the Github Actions (as well as my local copy of your branch), I see:

83 files would be left unchanged.

We should both have the same set of files

NimaSarajpoor · 2022-09-06T15:03:20Z

The only thing that I noticed was that, in your case, it says:
84 files would be left unchanged.
But, in the Github Actions (as well as my local copy of your branch), I see:
83 files would be left unchanged.

Thanks for brining that to my attention. I tried to do a git push -f (force) to force my local changes to the remote.

Update
Okay...that did not help. There is still discrepancy on the number of unchanged file. I will work on this and let you know if I figure it out.

seanlaw · 2022-09-06T17:25:06Z

I don't know if that is the source of the failed test but it is clear that there are differences between what has been pushed and what you have installed locally

NimaSarajpoor · 2022-09-06T18:43:59Z

Yeah...if I cannot find the file, or the cause of this problem, I may create a new, clean branch and apply the changes there and then submit another PR. If it goes well, then we can close this one. Is that okay?

seanlaw · 2022-09-06T19:27:19Z

Yeah...if I cannot find the file, or the cause of this problem, I may create a new, clean branch and apply the changes there and then submit another PR. If it goes well, then we can close this one. Is that okay?

In case it matters, to recreate your branch and trigger the failed test, I simply did:

install the Github CLI (e.g., conda install -c conda-forge gh)
clone main (parent) repo (e.g., git clone https://github.com/TDAmeritrade/stumpy stumpy.git)
navigate to this PR page and copy the Github CLI command provided in the <> Code button in the upper right (e.g., gh pr checkout 657)
navigate to the cloned repo in your terminal & execute the CLI cmd in Step 3
finally, run setup.sh && ./test.sh in this repo branch

seanlaw · 2023-08-24T13:25:46Z

@NimaSarajpoor Can this be closed or is there more work to be done?

NimaSarajpoor · 2023-08-24T16:19:16Z

@NimaSarajpoor Can this be closed or is there more work to be done?

This PR was replaced with PR #668 (see #668 (comment))

So, I think we can just close this PR and track the progress in the PR #668

NimaSarajpoor added 3 commits August 2, 2022 15:36

Add test functions for rare cases

67ec77f

Add config variable to resolve imprecision

e1c9785

Refine value to handle imprecision

1d0a22c

seanlaw changed the title ~~Fixed #610 Imprecision in identical case~~ Fixed #610 Imprecision in identical subsequence case Aug 11, 2022

This was referenced Sep 1, 2022

Fix #664 fixing param s in AB-join mode #665

Merged

Fix Precision in Unit Test #610

Open

NimaSarajpoor force-pushed the identical_case branch from dbaccab to 1d0a22c Compare September 2, 2022 07:07

NimaSarajpoor added 5 commits September 2, 2022 02:00

refine std if it is very small

9eb0f0e

Recalculate nearest distance if it is small

aeb87a1

Add new test function for identical case

d3bbd31

refine pearson right after its calculation

cc128ed

Comment config variable

d972f74

NimaSarajpoor added 3 commits September 2, 2022 11:12

replace stumpy_perfect_correation with threshold

73061c9

refine pearson only when we have to

eee51c2

correct format

f433014

seanlaw requested changes Sep 3, 2022

View reviewed changes

NimaSarajpoor added 8 commits September 3, 2022 21:09

increase pearson correlation threshold

94ceb32

increase pearson threshold

a8b0481

add new config variable for variance

e6fd6bd

replace hard-coded value with config variable

f60d624

add new test function

c913183

increase config variable threshold

ea98582

change test function

7ff2981

Add new test function

109c7d3

NimaSarajpoor added 9 commits September 4, 2022 15:15

clean the if statement

9cc81f5

avoid numpy linalg.norm to speed up computation

2d93b2b

find constant subseqs with rolling min and rolling mean

855ab63

refine precision of pearson value just for main matrix profile

d2eb056

change structure of if-block

95bd5a6

re-design the refinement of pearson

05a0245

use numpy dot to calculated square of norm of an array

183f861

correct format

7fcb69a

refine pearson by recalculating it using cov

c57f662

NimaSarajpoor added 3 commits September 4, 2022 22:35

Add new seed value

086cd2a

update cov and pearson

4ee651c

remove comment

d035d06

NimaSarajpoor added 2 commits September 5, 2022 11:09

reduce pearson threshold

a11ca36

Temporarily remvoe seed 0 in test function

566638a

increase pearson correlation threshold

f080306

NimaSarajpoor mentioned this pull request Sep 7, 2022

Fix precision #668

Closed

seanlaw closed this Aug 24, 2023

NimaSarajpoor deleted the identical_case branch September 4, 2023 03:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed #610 Imprecision in identical subsequence case #657

Fixed #610 Imprecision in identical subsequence case #657

NimaSarajpoor commented Aug 2, 2022

NimaSarajpoor commented Aug 2, 2022 •

edited

Loading

codecov-commenter commented Sep 2, 2022 •

edited

Loading

NimaSarajpoor commented Sep 2, 2022

NimaSarajpoor commented Sep 2, 2022 •

edited

Loading

NimaSarajpoor commented Sep 2, 2022 •

edited

Loading

seanlaw left a comment

seanlaw commented Sep 3, 2022

NimaSarajpoor commented Sep 5, 2022

NimaSarajpoor commented Sep 5, 2022

seanlaw commented Sep 5, 2022

seanlaw commented Sep 5, 2022 •

edited

Loading

NimaSarajpoor commented Sep 5, 2022 •

edited

Loading

seanlaw commented Sep 6, 2022

NimaSarajpoor commented Sep 6, 2022 •

edited

Loading

seanlaw commented Sep 6, 2022

NimaSarajpoor commented Sep 6, 2022

seanlaw commented Sep 6, 2022 •

edited

Loading

seanlaw commented Aug 24, 2023

NimaSarajpoor commented Aug 24, 2023

Fixed #610 Imprecision in identical subsequence case #657

Fixed #610 Imprecision in identical subsequence case #657

Conversation

NimaSarajpoor commented Aug 2, 2022

NimaSarajpoor commented Aug 2, 2022 • edited Loading

codecov-commenter commented Sep 2, 2022 • edited Loading

Codecov Report

NimaSarajpoor commented Sep 2, 2022

NimaSarajpoor commented Sep 2, 2022 • edited Loading

NimaSarajpoor commented Sep 2, 2022 • edited Loading

seanlaw left a comment

Choose a reason for hiding this comment

seanlaw commented Sep 3, 2022

NimaSarajpoor commented Sep 5, 2022

NimaSarajpoor commented Sep 5, 2022

seanlaw commented Sep 5, 2022

seanlaw commented Sep 5, 2022 • edited Loading

NimaSarajpoor commented Sep 5, 2022 • edited Loading

seanlaw commented Sep 6, 2022

NimaSarajpoor commented Sep 6, 2022 • edited Loading

seanlaw commented Sep 6, 2022

NimaSarajpoor commented Sep 6, 2022

seanlaw commented Sep 6, 2022 • edited Loading

seanlaw commented Aug 24, 2023

NimaSarajpoor commented Aug 24, 2023

NimaSarajpoor commented Aug 2, 2022 •

edited

Loading

codecov-commenter commented Sep 2, 2022 •

edited

Loading

NimaSarajpoor commented Sep 2, 2022 •

edited

Loading

NimaSarajpoor commented Sep 2, 2022 •

edited

Loading

seanlaw commented Sep 5, 2022 •

edited

Loading

NimaSarajpoor commented Sep 5, 2022 •

edited

Loading

NimaSarajpoor commented Sep 6, 2022 •

edited

Loading

seanlaw commented Sep 6, 2022 •

edited

Loading