Extend whitening tests #3531

JoeZiminski · 2024-11-12T13:22:56Z

Further to #3505, this PR extends the whitening tests to check the values are correct for both int16 and float32, as well as checking most of the parameters. Also, some functions are factored out in whiten.py to help with testing.

For the most part the approach is to generate some test data with known covariance, then compute whitening and check that a) the covariance matrix is recovered correctly b) the whitened data is indeed white.

Some points for discussion @yger would be great to get your thoughts:

apply_mean=False by default, I don't think (?) there is any reason not to remove the mean prior to computing the covariance, which will reduce bias. Indeed the whitened data is not really white when apply_mean=False but is when apply_mean=True. Should this be default True?
For regularizing, regularize_kwargs["assume_centered"] = True is always set in the code, though the default apply_mean=False as above. I added an error in this case, to force the user to switch to apply_mean=True. Also, in case the user explicitly sets regularize_kwargs["assume_centered"] = False an error is raised, as this will be overwritten to True. However, maybe a better approach is to set regularize_kwargs["assume_centered"] directly from apply_mean?
At the moment sklearn_covariance provides some features that are represented as classes, and others as functions. At the moment the regularize kwargs can only accept the class ones and will crash for function ones. I just added a note to the docstring, but wanted to raise this in case it is important to support both.

Question on CustomRecording

@alejoe91 @samuelgarcia I wanted to make a quick recording object and set some custom data on it for testing purposes. The result is the small CustomRecording (top of test_whiten,py) which basically exposes a set_data function, and get_traces() just returns the data that was set directly without processing. Is this okay for you both? If so I will factor this out into a separate PR.

…osed.£

JoeZiminski · 2024-11-12T15:01:21Z

src/spikeinterface/preprocessing/whiten.py

@@ -223,27 +198,88 @@ def compute_whitening_matrix(
    # type and we estimate a more reasonable eps in the case
    # where the data is on a scale less than 1.
    if eps is None:
-        eps = 1e-8
-    if data.dtype.kind == "f":


As the random_chunk -> data is always float, I removed the int16 option, let me know if this makes sense.

Out of interest, what is the purpose of this scaling? will it have much effect if it is capped in the maximum direction by 1e-16?

alejoe91 · 2024-11-12T16:51:06Z

@JoeZiminski for the CustomRecording, wouldn't a NumpyRecording work as well? What would be the added value of the CustomRecording?

JoeZiminski · 2024-11-12T16:57:16Z

Thanks @alejoe91, there is no added value! I just didn't know NumpyRecording existed 😄 thanks I'll use that

h-mayorquin · 2024-11-12T20:26:20Z

The NoiseGeneratorRecording already has the option of passing the covariance matrix:

spikeinterface/src/spikeinterface/core/generate.py

Lines 1228 to 1264 in e525d85

    
           class NoiseGeneratorRecording(BaseRecording): 
        
               """ 
        
               A lazy recording that generates white noise samples if and only if `get_traces` is called. 
        
               This done by tiling small noise chunk. 
        
               2 strategies to be reproducible across different start/end frame calls: 
        
                 * "tile_pregenerated": pregenerate a small noise block and tile it depending the start_frame/end_frame 
        
                 * "on_the_fly": generate on the fly small noise chunk and tile then. seed depend also on the noise block. 
        
               Parameters 
        
               ---------- 
        
               num_channels : int 
        
                   The number of channels. 
        
               sampling_frequency : float 
        
                   The sampling frequency of the recorder. 
        
               durations : list[float] 
        
                   The durations of each segment in seconds. Note that the length of this list is the number of segments. 
        
               noise_levels : float | np.array, default: 1.0 
        
                   Std of the white noise (if an array, defined by per channels) 
        
               cov_matrix : np.array | None, default: None 
        
                   The covariance matrix of the noise 
        
               dtype : np.dtype | str | None, default: "float32" 
        
                   The dtype of the recording. Note that only np.float32 and np.float64 are supported. 
        
               seed : int | None, default: None 
        
                   The seed for np.random.default_rng. 
        
               strategy : "tile_pregenerated" | "on_the_fly", default: "tile_pregenerated" 
        
                   The strategy of generating noise chunk: 
        
                     * "tile_pregenerated": pregenerate a noise chunk of noise_block_size sample and repeat it 
        
                                            very fast and cusume only one noise block. 
        
                     * "on_the_fly": generate on the fly a new noise block by combining seed + noise block index 
        
                                     no memory preallocation but a bit more computaion (random) 
        
               noise_block_size : int, default: 30000 
        
                   Size in sample of noise block. 
        
               Notes

Would not that eliminate some of your scaffolding? Is there something that you are missing there?

Also, maybe the following would be useful:
https://docs.pytest.org/en/stable/reference/reference.html#pytest-importorskip

JoeZiminski · 2024-11-13T11:28:38Z

Thanks @h-mayorquin! I did look at that but I think the one thing I need which is not there is the option to cast the data to int16, and to pass some custom means. The latter would be easy to extend, the former also I think actually, but would it be useful? If so I could extend the NoiseRecording to take int16 as a dtype and cast the MVN data using this strategy? (unless there is a better one).

Awesome thanks for that importdata skip will check it out! 🤤

alejoe91 · 2024-11-13T11:46:25Z

Thanks @h-mayorquin! I did look at that but I think the one thing I need which is not there is the option to cast the data to int16, and to pass some custom means. The latter would be easy to extend, the former also I think actually, but would it be useful? If so I could extend the NoiseRecording to take int16 as a dtype and cast the MVN data using this strategy? (unless there is a better one).

Awesome thanks for that importdata skip will check it out! 🤤

I think that since you need to craft specific sets of traces, you'd be better off using the NumpyRecording directly :)

h-mayorquin · 2024-11-13T14:06:40Z

Yeah, I think @alejoe91 suggestion is the simpler one. I would vote for that unless you need considerable memory heavy traces as it would simplify your scaffolding and would not require any other changes.

For the NoiseRecording You could `astype' preprocessing if you need the type of that data (the problem is that we use white noise under the hood and that' can be generated on the fly, I can give you more details on slack if you are curios). Extend NoiseRecording to accept a mean seems like a good thing though, maybe something to do but is outside of the scope of this PR.

JoeZiminski · 2024-11-15T11:12:36Z

Great thanks both! I went for NumpyRecording in the end as suggested, I think this is ready for review.

samuelgarcia · 2024-11-20T16:34:26Z

Hi.
About the apply_mean I think this is important to keep it False.
Traces with spikes are biased because the distribution is not simetric (more neg values) remove the real ean is not the best very often because it is not the middle of the noise.
Also the filtering is already supposed to remove the mean.
Lets discuss this with a call

JoeZiminski · 2024-11-20T18:54:31Z

Hi @samuelgarcia this is a good point, I will think about this more and we should discuss on call. I will leave some thoughts here as notes.

For sure the mean is biased by the spikes when estimating correlated noise. But, the whole procedure is biased by the spikes, as the covariance is computed over data with the spikes included. To estimate covariance unbiased by the spikes we will need to remove the spikes explicitly at the start of the procedure. Otherwise, there is no way around the fact we are whitening the traces over a covariance that includes both noise and signal (spikes), and including vs. not including the mean does not get around that. The only way would be to cut the spikes first.

As an aside, I wonder whether in kilosort the first paper says the covariance is estimated from traces with spikes removed, but all later versions include the spikes in the covariance estimate. It may be that this was on purpose, so the whitening step not only removes correlated noise but also removes correlated signal i.e. forces spikes waveforms onto a subset of channels. Maybe in the end this is better for the sorter, and is an explicit aim of the whitening step.

The above is slightly philosophical, I think a more convincing point is that later preprocessing steps will re-introduce noise and so the means might no longer be close to zero, as they are after filtering. For example CMR (computed at separate timepoints) and drift correction (interpolate separate time chunks) will add noise to the data across the timeseries of a single channel. So the bias introduced here may outweigh the bias introduced by the spikes.

I think some benchmarking could be useful, mainly:

how white is real-life data with apply_mean=True vs. apply_mean=False. I think simply from the mathematical definition it will be whiter with apply_mean=True. Then maybe the question is : how white then is this data but with spikes removed i.e. just looking at correlated noise? But, I think if that's what we are interested in, we should be removing the spikes at the start.
What are the per-channel means after filter -> CMR -> drift correction?
Even if the mean is bias, it is the least-bad estimate without explicitly removing the spikes. Not removing the mean will just result in an even worse bias such as that introduced by later preprocessing steps. Unless, we compute and remove our own estimation of central tendency. But, I guess this will interact badly with the rest of the covariance computation. We would have to completely reformulate the covariance estimation e.g. median covariance.

JoeZiminski force-pushed the extend_whitening_tests branch from 5e91f67 to 19105e2 Compare November 12, 2024 13:45

JoeZiminski added 2 commits November 12, 2024 13:45

Factor out covariance matrix computation, ensure eps parameter is exp…

d18ca6c

…osed.£

Start extending whitening tests.

b9d3cd9

JoeZiminski force-pushed the extend_whitening_tests branch from 19105e2 to 602e6ce Compare November 12, 2024 14:16

JoeZiminski and others added 2 commits November 12, 2024 14:52

Continue expanding tests.

382e2ff

Continue extending whitening tests.

1c77e2b

JoeZiminski force-pushed the extend_whitening_tests branch from dda72da to 8841f3d Compare November 12, 2024 14:52

Finalise tests, tidy up, add docs, final eps change.

14c83af

JoeZiminski force-pushed the extend_whitening_tests branch from 8841f3d to 14c83af Compare November 12, 2024 14:54

JoeZiminski commented Nov 12, 2024

View reviewed changes

Amend eps further.

1d84e5d

JoeZiminski force-pushed the extend_whitening_tests branch from 0070301 to 1d84e5d Compare November 12, 2024 15:04

JoeZiminski requested a review from yger November 12, 2024 15:24

alejoe91 added the preprocessing Related to preprocessing module label Nov 13, 2024

JoeZiminski added 3 commits November 15, 2024 10:34

Use NumpyRecording and delete CustomRecording

e7ac452

Rework to avoid introducing a 'update_traces' function.

4a5ff58

More tidy ups.

0f97f46

Fix tests.

6dc1f72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend whitening tests #3531

Extend whitening tests #3531

JoeZiminski commented Nov 12, 2024 •

edited

Loading

JoeZiminski Nov 12, 2024

alejoe91 commented Nov 12, 2024

JoeZiminski commented Nov 12, 2024

h-mayorquin commented Nov 12, 2024

JoeZiminski commented Nov 13, 2024 •

edited

Loading

alejoe91 commented Nov 13, 2024

h-mayorquin commented Nov 13, 2024

JoeZiminski commented Nov 15, 2024

samuelgarcia commented Nov 20, 2024

JoeZiminski commented Nov 20, 2024

Extend whitening tests #3531

Are you sure you want to change the base?

Extend whitening tests #3531

Conversation

JoeZiminski commented Nov 12, 2024 • edited Loading

JoeZiminski Nov 12, 2024

Choose a reason for hiding this comment

alejoe91 commented Nov 12, 2024

JoeZiminski commented Nov 12, 2024

h-mayorquin commented Nov 12, 2024

JoeZiminski commented Nov 13, 2024 • edited Loading

alejoe91 commented Nov 13, 2024

h-mayorquin commented Nov 13, 2024

JoeZiminski commented Nov 15, 2024

samuelgarcia commented Nov 20, 2024

JoeZiminski commented Nov 20, 2024

JoeZiminski commented Nov 12, 2024 •

edited

Loading

JoeZiminski commented Nov 13, 2024 •

edited

Loading