A detailed guide on preprocessing #2996

JoeZiminski · 2024-06-07T08:50:58Z

For ephys pipelines there are a number of quite complex preprocessing steps that need to be done in a specific order. Also, there are some nuances to some preprocessing steps, particulaly around multi-shank and bad channel detection. For example CAR, IBL bad channel detection*, highpass spatial filter etc make assumptions on channel location that are broken in the multi-shank contex. Bad channel detection should not be done after CAR etc.

In general I think these preprocessing steps are all quite mysterious if you are not familiar with them. I've been meaning for ages to write a detailed guide on preprocessing steps (what and why for each step) but never gotten around to it, and speaking to people at the conference I think it would be useful. So I'll post this issue to get the ball rolling and invite contributions, maybe opening a PR next week!

*BTW I just realised this yesterday for IBL bad channel detection, maybe we should add an assert / warning?
*also whitening, potentially.

h-mayorquin · 2024-06-07T14:13:21Z

This would be great!

Let's keep issues here that are about preprocessing:

Why does the spike amplitude shrink to single digits after preprocessing? #2994

chiyu1203 · 2024-06-09T12:39:52Z

Thanks for initiating this! I have not yet been able to help with the documentation but I would like to point out another preprocessing step that confuses me. I learned about preprocessing steps from the latest SpikeInterface_Tutorial.ipynb, where saving Saving SpikeInterface objects is introduced after Preprocessing, so I followed those steps for my analysis. However, I read @alejoe91 paper in 2023 about compressing ephys data. In the paper, they discussed the benefits (e.g. boost compression ratio) and drawbacks (e.g. causing lossy compression) of applying pre-processing step (e.g. band-pass filter) before compressing the raw data. So, is it better to compress and save the raw data directly and only do pre-processing steps when applying spike sorting to the data?

JoeZiminski · 2024-06-21T12:15:12Z

Hmm good question @chiyu1203 I would be interested in hearing other perspectives but I think it depends a little on your own preference, in terms of whether to store forever the 'true' raw data or a slightly-pre-processed version.

If you are using NP probes, I guess you would want to do phase_shift, bandpass_filter then store. These are minimal preprocessing steps and so you may not regret keeping your data in this format. However, things do change and new methods for preprocessing are sometimes introduced, for example the introduction of the phase_shift step itself, or there are ongoing discussions on accounting for headstage filter nonlinear phase distortions #2943. So you never know what the preprocessing pipeline might be in a few years. Maybe one day you might want to revisit the data with a more up-to-date preprocessing pipeline and wish you had the 'true' raw data. For me, I would err on storing the raw data considering it is probably only a relatively minimal improvement (figure 7 in paper) you get in compression after bandpass filtering. But both approaches have merit!

h-mayorquin · 2024-06-21T16:42:51Z

For long term storage and data sharing such as the on Dandi you want the data to be as raw as possible. You don't know how future users would like to use your data and pre-processing steps might fall out of fashion. as @JoeZiminski indicates.

For working projects you want to keep both for the sake of computational efficiency but be sure to keep the raw data. Then, when you feel that your dataset is in such state that you want to share it with someone else that is not you (publish or an archive for example) you go back to step 1.

Note that this is only valid for data whose processed versions is very large. For results that are really computing heavy or hard to reproduce it is a good idea to share the most direct input that supports a hypothesis (spike times for example).

JoeZiminski · 2024-06-25T09:03:07Z

(Note) there is no way to skip highpass filter (300 Hz corner freq) in KS4

JoeZiminski · 2024-07-18T13:41:30Z

I'm still well up for this just needing to prioritise some other things ahead of it. We can use it as a dumping ground for potentially confusing things about preprocessing pipelines until ready.

jonpedros · 2024-10-04T03:36:17Z

I'm curious about the benefits of applying each preprocessing step to grouped vs. non-grouped data when dealing with tetrode datasets. In my case, our recordings have plenty of EMG artifacts from a skull reference screw that are shared across all tetrodes, so it seems more effective to apply it globally. I am not aware of any possible benefits of applying bandpass filters in grouped data. For whitening, however, I am not so sure.

I also have some doubts on whitening regardless of the question of grouping or not before applying it, especially since some sorters highly recommend it before sorting.

I know that whitened data is not supposed to maintain real waveforms and that it's supposed to highlight channel-specific signals. However, whitening tetrode-grouped data makes some channels seem to go completely silent while other become very noisy, along with some likely spikes seen on non-whitened data being erased from all channels of that tetrode.

300-6000Hz Bandpass + CMR:

300-6000Hz Bandpass + CMR + whitening on tetrode grouped data:

When whitening is applied locally but in non-grouped data, results are less strange than with group-based whitening, but spikes still seem be gone from some of the tetrodes.

300-6000Hz Bandpass + CMR + whitening with local covariance on non-grouped data:

Finally, non-grouped whitening results look alright, but I'm not so sure. Running automatic sorting with mountainsort5 in this data yields spike-like units in only one tetrode, versus valid units on all tetrodes with non whitened data.

300-6000Hz Bandpass + CMR + whitening on non-grouped data:

jonpedros · 2024-10-15T04:42:04Z

@JoeZiminski Was wondering if I could get your input about this matter. Please tell me if it's not within the scope of this issue. Thanks!

JoeZiminski · 2024-10-17T20:08:52Z

Hi @jonpedros thanks for this, definitely within the scope of this PR! This is a great question, its very interesting to see these plots. For reference here is a similar plot from a neuropixels recording to see the effect of whitening. Please also see this (rushed, will revise) notebook illustrating some of the points below.

Could I just check your set up, I am not that familiar with mult-tetrode recordings. So you have 6 tetrodes implanted in the brain, how far apart are they? Could you post the code you are running the data with?

I agree that filtering will not be affected by between grouped vs. non-grouped analysis (it is applied per individual channel). Does the EMG artefacts persist in the data after filtering + CMR or has that pretty much taken care of them? The raw data looks pretty good to me in terms of noise.

Just in case the background is useful, the intuition of the whitening step is to first estimate the covariance between all channels (and store this in a covariance matrix, $\Sigma$. This is a (num_channel x num_channel) matrix. For example, the first row will capture how much channel 1 covaries with all other channels (channel 1 vs channel 1, channel 1 vs. channel 2, .... channel 1 vs. channel n). Next, we would like to remove this covariance from our channels, achieved by applying a transformation that results in the transformed data having identity covariance matrix. If you follow the operation (e.g. track what happens in a code example, see notebook), we end up essentially subtracting surrounding channels from a channel weighted by how much the channels covary. So, say we are whitening channel 4. We take the channel 4 signal and subtract channel 3, weighted by how much channel 3 covaries with channel 4. If the channels are strongly correlated, this will remove both signal and noise, leading to units becoming isolated on their strongest-loading channel.

In this case, each tetrode has a lot of shared signal (spikes + noise), but there doesn't seem like much signal is shared between tetrodes, at least assessing visually. It looks like in this case, when only a few, highly correlated signals are included, the whitening is removing a lot of signal and noise from channels in fairly unpredictable ways. You can play around with the second part of the attached notebook to see try and emulate this. For example in the grouped whitening, first tetrode, the signal across the four channels is basically all isolated to two non-neighbouring channels. I'm not sure how useful this would be for the sorting. It will also be worth taking a look at the whitening matrices, maybe in some cases issues in the estimation lead to the values becoming very small.

That being said, whitening heavily rescales the data, and if you perform per-tetrode and then plot all tetrodes together, the signal on one tetrodes channels might just look very small in comparison to those on other tetrodes. If you blow up the scaling, maybe they look a little more normal with spikes nicely isolated to certain channels within the tetrode.

It would be interesting to see the whitening matrix produced for the 4th tetrode down, grouped data. I have no idea why so much noise is introduced!

When you estimate the whitening all together, the noise correlations that exist between tetrodes are introduced into the estimation. Maybe in order to reverse the covariance matrix transform, whitening can reduce the noise correlations between tetrodes and doesn't need to aggressively zero the spikes. Alternatively, the scaling will be better normalised across tetrodes, so maybe this is why it looks less weird. There are still definitely some strange effects, such as units appearing to completely disappear.

In sum, I am not sure whitening this data is a good idea for this data. Because of the setup, the neurons in the data without whitening are already quite isolated. The correlated noise across tetrodes doesn't look that strong (although it is hard to see by eye at this scale). I would go ahead and sort with and without whitening, and see how interpretable the data looks, I would imagine the results would be better without whitening. It is also worth exploring what sorters suit this dataset best, I think use of neuropixels / high-density probes is in many ways built into kilosort design and it's defaults. I wonder how much this is the case for other sorters (e.g. herding spikes). It might even make sense to sort per-tetrode. @chrishalcrow recently mentioned herdingspikes is very fast and should be able to advise.

It would also be interesting to hear @yger thoughts on this as he has been working on improving the stability of covariance estimation during whitening.

chrishalcrow · 2024-10-21T08:08:04Z

Yes, if you can get herdingspikes to install (https://github.com/mhhennig/HS2) it's really really fast for sorting. So it's good for this kind of analysis where you try out lots of different preprocessing pipelines.

Another note: someone advised me that whitening can be good for sorting, but you then want the unwhitened recording for extracting waveforms etc. So you might want to try that out: doing something like

unwhitened_recording = ...
whitened_recording = si.whiten(unwhitened_recording)
sorting = si.run_sorter(recording = whitened_recording, sorter_name="...")
sorting_analyzer = si.create_sorting_analyzer(recording= unwhitened_recording, sorting=sorting)

This is second-hand advice, so don't take it too seriously, but might be worth a try!

jonpedros · 2024-10-24T08:17:23Z

Thank you very much for the through responses @JoeZiminski and @chrishalcrow !

Mountainsort takes ages to run right now so it's good to have something to quickly run with different iterations of preprocessed data. I will also try getting the waveforms from unwhitened data.

I'm currently always trying sorting by tetrodes, except when I tried KS2, which really didn't seem very tuned at all to that kind of data. MountainSort seemed to give the best results. My reason for insisting on trying whitening a bit (even though filtering and CMR seems to take care of noise) before considering it inadequate to my dataset was first because of the recommendation to use it with no caveats before sorting with MountainSort 5.

Also because I had issues with MS5 finding only what seemed to be a noise cluster in one of the tetrodes, even though I could isolate likely units from that single cluster with manual sorting, and although I could see likely waveforms in the raw data (it's the data from the second tetrode from above, by the way). So I was wondering if that might be a result of not whitening data. Checking the extracted waveforms showed that cluster seemed to be the result of extracting almost every oscillation from the traces, which obviously drowns real spikes in noise.

I will also try plotting whitened data separately for each tetrode, although that might not "solve" it for some of the tetrodes.

In my understanding, I think a partial approach would be to simply z-score data so that it has variance 1, and so that standard deviation based parameters used by many sorters make more sense. What do you think?

At any rate, @JoeZiminski would you like to check my dataset if you have time? I've uploaded my data to figshare here and adapted my script so that (I'm hoping) it will easily run as is.

JoeZiminski · 2024-10-25T11:59:53Z

Hey @jonpedros thanks for your message, sorry I don't have a chance to respond in full ATM but I wanted to mention, looking at your shared data, if you are on version > 0.100.8 you might be affected by #3505. If so, as a workaround, casting like recording = si.astype(recording, np.float32) before running whitening should fix it.

jonpedros · 2024-10-30T03:17:58Z

@JoeZiminski Thank you for catching that issue! Casting the dtype completely solved the strange outputs, and I also seem to have got better sorting results overall with whitening. With the dtype issue fixed, running whitening with local covariance versus simply grouped by tetrode also returns the same results, which is what I expected initially. Running it on all the traces with global variance yields almost identical results, at least accessing visually.

JoeZiminski · 2024-11-07T14:26:07Z

Hey @jonpedros, great! If it's not too much hassle, out of interest could you re-make the plots above, with the fixed whitening? I am curious to see how it looks when everything is working

jonpedros · 2024-11-19T01:34:37Z

Sure! Sorry for the delay.

If I may make a related question, I noticed the principal_components extension has a whiten parameter, "If True, waveforms are pre-whitened" (default True). Should this be turned off when postprocessing whitened waveforms?

Raw traces:

Bandpassed traces (300Hz-6000Hz)

CMR Re-referenced traces

Whitening with global covariance

JoeZiminski · 2024-11-19T17:59:46Z

Very cool, yes that looks much more like expected. Thanks @jonpedros!

I was also curious about this argument that I saw for the first time the other day. I would expect that the principal components would be better computed from the raw, rather than whitened, waveforms when computing for quality metrics. I guess a benefit of computing whitened would be that it better matches the conditions as during sorting (where everything is white), but I wouldn't have thought that is very important. @alejoe91 what do you think?

alejoe91 added documentation Improvements or additions to documentation preprocessing Related to preprocessing module labels Jun 7, 2024

JoeZiminski mentioned this issue Jun 11, 2024

unable to detect channels outside of brain with detect_bad_channels #2929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A detailed guide on preprocessing #2996

A detailed guide on preprocessing #2996

JoeZiminski commented Jun 7, 2024 •

edited

Loading

h-mayorquin commented Jun 7, 2024 •

edited

Loading

chiyu1203 commented Jun 9, 2024

JoeZiminski commented Jun 21, 2024 •

edited

Loading

h-mayorquin commented Jun 21, 2024

JoeZiminski commented Jun 25, 2024

JoeZiminski commented Jul 18, 2024 •

edited

Loading

jonpedros commented Oct 4, 2024 •

edited

Loading

jonpedros commented Oct 15, 2024 •

edited

Loading

JoeZiminski commented Oct 17, 2024 •

edited

Loading

chrishalcrow commented Oct 21, 2024

jonpedros commented Oct 24, 2024 •

edited

Loading

JoeZiminski commented Oct 25, 2024

jonpedros commented Oct 30, 2024 •

edited

Loading

JoeZiminski commented Nov 7, 2024

jonpedros commented Nov 19, 2024 •

edited

Loading

JoeZiminski commented Nov 19, 2024

A detailed guide on preprocessing #2996

A detailed guide on preprocessing #2996

Comments

JoeZiminski commented Jun 7, 2024 • edited Loading

h-mayorquin commented Jun 7, 2024 • edited Loading

chiyu1203 commented Jun 9, 2024

JoeZiminski commented Jun 21, 2024 • edited Loading

h-mayorquin commented Jun 21, 2024

JoeZiminski commented Jun 25, 2024

JoeZiminski commented Jul 18, 2024 • edited Loading

jonpedros commented Oct 4, 2024 • edited Loading

jonpedros commented Oct 15, 2024 • edited Loading

JoeZiminski commented Oct 17, 2024 • edited Loading

chrishalcrow commented Oct 21, 2024

jonpedros commented Oct 24, 2024 • edited Loading

JoeZiminski commented Oct 25, 2024

jonpedros commented Oct 30, 2024 • edited Loading

JoeZiminski commented Nov 7, 2024

jonpedros commented Nov 19, 2024 • edited Loading

JoeZiminski commented Nov 19, 2024

JoeZiminski commented Jun 7, 2024 •

edited

Loading

h-mayorquin commented Jun 7, 2024 •

edited

Loading

JoeZiminski commented Jun 21, 2024 •

edited

Loading

JoeZiminski commented Jul 18, 2024 •

edited

Loading

jonpedros commented Oct 4, 2024 •

edited

Loading

jonpedros commented Oct 15, 2024 •

edited

Loading

JoeZiminski commented Oct 17, 2024 •

edited

Loading

jonpedros commented Oct 24, 2024 •

edited

Loading

jonpedros commented Oct 30, 2024 •

edited

Loading

jonpedros commented Nov 19, 2024 •

edited

Loading