Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #610 Imprecision in identical subsequence case #657

Closed
wants to merge 46 commits into from

Conversation

NimaSarajpoor
Copy link
Collaborator

This PR resolves issue #610 by adding a new config variable to reset the already-calculated pearson value to 1.0 when it exceeds the threshold set by the config variable.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 2, 2022

So, this problem is more complicated than I thought :) I think I found an alternative approach and it seems it is working. HOWEVER, it does not work for this new test function I designed:

def test_stump_identical_subsequence_self_join_rare_cases_2():
    # This test function is designed to capture the errors that migtht be raised
    # due the imprecision in the calculation of pearson values in the edge case
    # where two subsequences are identical (i.e. their pearson value is 1.0)
    # This is resolved by setting config.STUMPY_PERFECT_CORRELATION
    m = 3
    zone = int(np.ceil(m / 4))

    seed_values = [27343, 84451]
    for seed in seed_values:
        np.random.seed(seed)

        identical = np.random.rand(8)
        T_A = np.random.rand(20)
        T_A[1 : 1 + identical.shape[0]] = identical * 0.001 # NEW
        T_A[11 : 11 + identical.shape[0]] = identical * 1000 # NEW

        ref_mp = naive.stump(T_A, m, exclusion_zone=zone, row_wise=True)
        comp_mp = stump(T_A, m, ignore_trivial=True)
        naive.replace_inf(ref_mp)
        naive.replace_inf(comp_mp)
        npt.assert_almost_equal(
            ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
        )  # ignore indices

        comp_mp = stump(pd.Series(T_A), m, ignore_trivial=True)
        naive.replace_inf(comp_mp)
        npt.assert_almost_equal(
            ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
        )  # ignore indices

There are still two identical subsequences... let's see the error:

 Mismatched elements: 2 / 18 (11.1%)
E           Max absolute difference: 0.002743917738525453
E           Max relative difference: 0.9999999999545488
E            x: array([0.06758104087691376, 2.7194799110210365e-16,
E                  1.2471461933214347e-13, 3.1401849173675503e-16,
E                  2.482534153247273e-16, 2.482534153247273e-16,...
E            y: array([0.06758104087691354, 0.0, 0.0027439177386501673, 0.0, 0.0, 0.0,
E                  0.0, 0.0021481232219631354, 0.7633511355203079,
E                  0.0021481232219631354, 0.46425476469467025, 0.0,...

Now, the third element in y is 0.0027439177386501673 while the actual value (i.e. ref) is 1.2471461933214347e-13. I am trying to see if I can find a solution for this.

(I think the pearson value is about 0.999998 which is below the threshold)

@seanlaw seanlaw changed the title Fixed #610 Imprecision in identical case Fixed #610 Imprecision in identical subsequence case Aug 11, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2022

Codecov Report

Merging #657 (c57f662) into main (bd1beb4) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head c57f662 differs from pull request most recent head 566638a. Consider uploading reports for the commit 566638a to get more accurate results

@@           Coverage Diff           @@
##             main     #657   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11434    11531   +97     
=======================================
+ Hits        11422    11519   +97     
  Misses         12       12           
Impacted Files Coverage Δ
stumpy/config.py 100.00% <100.00%> (ø)
stumpy/core.py 100.00% <100.00%> (ø)
stumpy/stump.py 100.00% <100.00%> (ø)
tests/naive.py 100.00% <100.00%> (ø)
tests/test_stump.py 100.00% <100.00%> (ø)
stumpy/motifs.py 100.00% <0.00%> (ø)
stumpy/mpdist.py 100.00% <0.00%> (ø)
stumpy/scrump.py 100.00% <0.00%> (ø)
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
As you correctly said, this "precision in numerical calculation" is waaay nastier than I thought! Fixing a thing and getting error from somewhere else (I mean from another test functions). I will let you know if I fix it.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 2, 2022

Notes:

(1)

# in config
STUMPY_PERFECT_CORRELATION = 0.99999999 

# in stumpy/stump.py
if pearson > config.STUMPY_PERFECT_CORRELATION:
    pearson = 1.0

This approach sometimes results in error. In one case, pearson=0.9999_9999_01 was mistakenly reset to 1.0, and resulted in failure in test function. So, I had to avoid using this approach.


(2)
I am refining variance in welford function.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 2, 2022

@seanlaw

(1) I decided to not pursue a new approach for calculating std. As you said, the welford is probably the best one can get considering both speed and accuracy.

(2) The two new test functions related to identical subsequences are now passing. Also, as opposed to our first attempt, all other test functions are passing as well.

(3) I am refining "var" in _welford_nanvar as I thought it might help with improving the precision of other var values that are calculated in the same iteration throughout the rolling process.

(4) I refine pearson only when it exceeds a threshold and if it is going to be used for updating matrix profile. Alternatively, we may do the refinement close to the end of function _stump. I remember you had concerns about whether this latter approach works or not because you said "can we be sure that the matrix profile indices ref_I and comp_I are the same?". I do not have the answer for this question. However, I tried it out and it seems that doing the refinement in _stump works.

The downside of doing the refinement in _stump (after merging results of all threads) is that we might(?) get into trouble in top-k mp. Because in that case, we need to make sure the refined top-k pearsons are still sorted! For instance, let's say the top-3 pearson values of a subsequence is: [0.9999, 0.99998, 1.0], then, after the refinement, this might be changed to [0.99999, 0.99998, 1.0], which is not sorted!

(5) I did not record the running time of stump after these changes. This needs to be done and the running time should be compared with the original stump to see how much this process adds to the computing process.
(update: I tested for length 10_000 and 50_000. It causes 10% increase in computing time which is not good)

(6) In case that matters, there is this test function below that is not added to tests/test_stump.py .

def test_stump_new():
    seed = 0
    np.ranom.seed(seed)
    T = np.random.rand(64)
    
    scale = np.random.choice(np.array([0.001, 1.0, 1000]), len(T), replace=True)
    T[:] = T * scale

    m = 3
    zone = int(np.ceil(m / 4))

    ref_mp = naive.stump(T, m, exclusion_zone=zone, row_wise=True)
    comp_mp = stump(T, m, ignore_trivial=True)
    naive.replace_inf(ref_mp)
    naive.replace_inf(comp_mp)
    npt.assert_almost_equal(
        ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
    )  # ignore indices

This test function fails and, I couldn't resolve it yet.

Copy link
Contributor

@seanlaw seanlaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NimaSarajpoor Please consider the following suggestions

stumpy/config.py Outdated Show resolved Hide resolved
stumpy/core.py Outdated Show resolved Hide resolved
stumpy/stump.py Outdated Show resolved Hide resolved
stumpy/stump.py Outdated Show resolved Hide resolved
stumpy/stump.py Outdated Show resolved Hide resolved
stumpy/stump.py Outdated Show resolved Hide resolved
@seanlaw
Copy link
Contributor

seanlaw commented Sep 3, 2022

(4) I refine pearson only when it exceeds a threshold and if it is going to be used for updating matrix profile. Alternatively, we may do the refinement close to the end of function _stump. I remember you had concerns about whether this latter approach works or not because you said "can we be sure that the matrix profile indices ref_I and comp_I are the same?". I do not have the answer for this question. However, I tried it out and it seems that doing the refinement in _stump works.

I think I like where pearson is refined in the current iteration. I don't think we should do it at the end of _stump

The downside of doing the refinement in _stump (after merging results of all threads) is that we might(?) get into trouble in top-k mp. Because in that case, we need to make sure the refined top-k pearsons are still sorted! For instance, let's say the top-3 pearson values of a subsequence is: [0.9999, 0.99998, 1.0], then, after the refinement, this might be changed to [0.99999, 0.99998, 1.0], which is not sorted!

Right. This is why I would keep the refinement in _compute_diagonal

I tested for length 10_000 and 50_000. It causes 10% increase in computing time which is not good

Where is the 10% increase coming from? Which line is causing such a huge increase? If there are no identical subsequences, does the time go back to what it was before?

(6) In case that matters, there is this test function below that is not added to tests/test_stump.py .

It seems important to resolve :)

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
I am still working on this. I will update you.

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
Could you please help me understand something? In my local repo, the test functions in tests/test_stump.py are all passing. However, here, the test function test_stump_volatile fails. Note that the generated time series in this test function corresponds to specified random seed. So, I think the randomly-generated input should not change as the seed is specified. So, why do we get failure here? Am I missing something?

@seanlaw
Copy link
Contributor

seanlaw commented Sep 5, 2022

@NimaSarajpoor Let me try to take a look and see what happens when I run your latest commit locally as well. The fact that the same test is failing in three independent environments is concerning.

@seanlaw
Copy link
Contributor

seanlaw commented Sep 5, 2022

@NimaSarajpoor When I cloned your branch and executed the tests suite, I am seeing the same failed tests:

tests/test_stump.py ..........................F

=================================== FAILURES ===================================
_____________________________ test_stump_volatile ______________________________

    def test_stump_volatile():
        m = 3
        zone = int(np.ceil(m / 4))

        seed_values = [0, 1]
        for seed in seed_values:
            np.random.seed(seed)
            T = np.random.rand(64)
            scale = np.random.choice(np.array([0.001, 0, 1000]), len(T), replace=True)
            T[:] = T * scale

            ref_mp = naive.stump(T, m, exclusion_zone=zone, row_wise=True)
            comp_mp = stump(T, m, ignore_trivial=True)
            naive.replace_inf(ref_mp)
            naive.replace_inf(comp_mp)

>           npt.assert_almost_equal(
                ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
            )  # ignore indices
E           AssertionError:
E           Arrays are not almost equal to 5 decimals
E
E           Mismatched elements: 1 / 62 (1.61%)
E           Max absolute difference: 2.950287948697726e-05
E           Max relative difference: 0.257701917724807
E            x: array([1.5700924586837752e-16, 1.5700924586837752e-16,
E                  1.658550663133749e-07, 2.4257150105478956e-07,
E                  6.454152836022342e-07, 0.22155833569278333, 0.09204964741053696,...
E            y: array([0.0, 0.0, 1.6323404237781945e-07, 2.421152116249788e-07,
E                  6.452392069879462e-07, 0.2215583356927852, 0.09204964741054301,
E                  1.8966081829902314e-07, 0.0, 0.0, 0.012944867345093258,...

tests/test_stump.py:348: AssertionError
=========================== short test summary info ============================
FAILED tests/test_stump.py::test_stump_volatile - AssertionError:
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
========================= 1 failed, 26 passed in 5.01s =========================

This failed 5/5 times in a row with the same error. By any chance, are you forgetting to reinstall the latest version of your local branch?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 5, 2022

@seanlaw

This failed 5/5 times in a row with the same error. By any chance, are you forgetting to reinstall the latest version of your local branch?

I think I am installing it. Please see below:

(base)
AS17983@MCOEELAS1798306 MINGW64 ~/Desktop/NIMA/stump/stumpy (identical_case)
$ ./setup.sh
Found existing installation: stumpy 1.11.1
Uninstalling stumpy-1.11.1:
  Would remove:
    c:\users\as17983\miniconda3\lib\site-packages\stumpy-1.11.1.dist-info\*
    c:\users\as17983\miniconda3\lib\site-packages\stumpy\*
Proceed (Y/n)?   Successfully uninstalled stumpy-1.11.1
Processing c:\users\as17983\desktop\nima\stump\stumpy
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: numpy>=1.17 in c:\users\as17983\miniconda3\lib\site-packages (from stumpy==1.11.1) (1.21.5)
Requirement already satisfied: scipy>=1.5 in c:\users\as17983\miniconda3\lib\site-packages (from stumpy==1.11.1) (1.8.1)
Requirement already satisfied: numba>=0.54 in c:\users\as17983\miniconda3\lib\site-packages (from stumpy==1.11.1) (0.55.2)
Requirement already satisfied: setuptools in c:\users\as17983\miniconda3\lib\site-packages (from numba>=0.54->stumpy==1.11.1) (65.3.0)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in c:\users\as17983\miniconda3\lib\site-packages (from numba>=0.54->stumpy==1.11.1) (0.38.1)
Building wheels for collected packages: stumpy
  Building wheel for stumpy (setup.py): started
  Building wheel for stumpy (setup.py): finished with status 'done'
  Created wheel for stumpy: filename=stumpy-1.11.1-py3-none-any.whl size=139726 sha256=5260163c8726bb0e6df21c85ad98bec0f1399ed4de6f2d1429efb8597873657d
  Stored in directory: C:\Users\AS17983\AppData\Local\Temp\pip-ephem-wheel-cache-hhrovqh5\wheels\4e\11\b5\98d6a9e131f3ea63e20ef7e65380d0a3082c1fefbbc1dfe81b
Successfully built stumpy
Installing collected packages: stumpy
Successfully installed stumpy-1.11.1
(base)
AS17983@MCOEELAS1798306 MINGW64 ~/Desktop/NIMA/stump/stumpy (identical_case)
$ ./test.sh custom 1 tests/test_stump.py
Cleaning Up
Checking Black Code Formatting
All done! \u2728 \U0001f370 \u2728
84 files would be left unchanged.
Checking Flake8 Style Guide Enforcement
Executing Custom User-Defined Tests Only
Custom Test: 1 / 1
============================= test session starts =============================
platform win32 -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: C:\Users\AS17983\Desktop\NIMA\stump\stumpy, configfile: pytest.ini
plugins: anyio-3.6.1, cython-0.2.0
collected 27 items

tests\test_stump.py ...........................                          [100%]

============================= 27 passed in 17.42s =============================
Cleaning Up
(base)

Note that all 27 test are passing. As you can see, I installed it. I am going to push again just in case I missed something before.


AS17983@MCOEELAS1798306 MINGW64 ~/Desktop/NIMA/stump/stumpy (identical_case)
$ git push origin identical_case
Everything up-to-date
(base)
AS17983@MCOEELAS1798306 MINGW64 ~/Desktop/NIMA/stump/stumpy (identical_case)
$

@seanlaw
Copy link
Contributor

seanlaw commented Sep 6, 2022

@NimaSarajpoor I'm not sure why you aren't seeing the failed test. The only thing that I noticed was that, in your case, it says:

84 files would be left unchanged.

But, in the Github Actions (as well as my local copy of your branch), I see:

83 files would be left unchanged.

We should both have the same set of files

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Sep 6, 2022

The only thing that I noticed was that, in your case, it says:

84 files would be left unchanged.

But, in the Github Actions (as well as my local copy of your branch), I see:

83 files would be left unchanged.

Thanks for brining that to my attention. I tried to do a git push -f (force) to force my local changes to the remote.


Update
Okay...that did not help. There is still discrepancy on the number of unchanged file. I will work on this and let you know if I figure it out.

@seanlaw
Copy link
Contributor

seanlaw commented Sep 6, 2022

I don't know if that is the source of the failed test but it is clear that there are differences between what has been pushed and what you have installed locally

@NimaSarajpoor
Copy link
Collaborator Author

Yeah...if I cannot find the file, or the cause of this problem, I may create a new, clean branch and apply the changes there and then submit another PR. If it goes well, then we can close this one. Is that okay?

@seanlaw
Copy link
Contributor

seanlaw commented Sep 6, 2022

Yeah...if I cannot find the file, or the cause of this problem, I may create a new, clean branch and apply the changes there and then submit another PR. If it goes well, then we can close this one. Is that okay?

In case it matters, to recreate your branch and trigger the failed test, I simply did:

  1. install the Github CLI (e.g., conda install -c conda-forge gh)
  2. clone main (parent) repo (e.g., git clone https://github.com/TDAmeritrade/stumpy stumpy.git)
  3. navigate to this PR page and copy the Github CLI command provided in the <> Code button in the upper right (e.g., gh pr checkout 657)
  4. navigate to the cloned repo in your terminal & execute the CLI cmd in Step 3
  5. finally, run setup.sh && ./test.sh in this repo branch

@NimaSarajpoor NimaSarajpoor mentioned this pull request Sep 7, 2022
@seanlaw
Copy link
Contributor

seanlaw commented Aug 24, 2023

@NimaSarajpoor Can this be closed or is there more work to be done?

@NimaSarajpoor
Copy link
Collaborator Author

@NimaSarajpoor Can this be closed or is there more work to be done?

This PR was replaced with PR #668 (see #668 (comment))

So, I think we can just close this PR and track the progress in the PR #668

@seanlaw seanlaw closed this Aug 24, 2023
@NimaSarajpoor NimaSarajpoor deleted the identical_case branch September 4, 2023 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants