Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A detailed guide on preprocessing #2996

Open
JoeZiminski opened this issue Jun 7, 2024 · 16 comments
Open

A detailed guide on preprocessing #2996

JoeZiminski opened this issue Jun 7, 2024 · 16 comments
Labels
documentation Improvements or additions to documentation preprocessing Related to preprocessing module

Comments

@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Jun 7, 2024

For ephys pipelines there are a number of quite complex preprocessing steps that need to be done in a specific order. Also, there are some nuances to some preprocessing steps, particulaly around multi-shank and bad channel detection. For example CAR, IBL bad channel detection*, highpass spatial filter etc make assumptions on channel location that are broken in the multi-shank contex. Bad channel detection should not be done after CAR etc.

In general I think these preprocessing steps are all quite mysterious if you are not familiar with them. I've been meaning for ages to write a detailed guide on preprocessing steps (what and why for each step) but never gotten around to it, and speaking to people at the conference I think it would be useful. So I'll post this issue to get the ball rolling and invite contributions, maybe opening a PR next week!

*BTW I just realised this yesterday for IBL bad channel detection, maybe we should add an assert / warning?
*also whitening, potentially.

@alejoe91 alejoe91 added documentation Improvements or additions to documentation preprocessing Related to preprocessing module labels Jun 7, 2024
@h-mayorquin
Copy link
Collaborator

h-mayorquin commented Jun 7, 2024

This would be great!

Let's keep issues here that are about preprocessing:

@chiyu1203
Copy link

Thanks for initiating this! I have not yet been able to help with the documentation but I would like to point out another preprocessing step that confuses me. I learned about preprocessing steps from the latest SpikeInterface_Tutorial.ipynb, where saving Saving SpikeInterface objects is introduced after Preprocessing, so I followed those steps for my analysis. However, I read @alejoe91 paper in 2023 about compressing ephys data. In the paper, they discussed the benefits (e.g. boost compression ratio) and drawbacks (e.g. causing lossy compression) of applying pre-processing step (e.g. band-pass filter) before compressing the raw data. So, is it better to compress and save the raw data directly and only do pre-processing steps when applying spike sorting to the data?

@JoeZiminski
Copy link
Collaborator Author

JoeZiminski commented Jun 21, 2024

Hmm good question @chiyu1203 I would be interested in hearing other perspectives but I think it depends a little on your own preference, in terms of whether to store forever the 'true' raw data or a slightly-pre-processed version.

If you are using NP probes, I guess you would want to do phase_shift, bandpass_filter then store. These are minimal preprocessing steps and so you may not regret keeping your data in this format. However, things do change and new methods for preprocessing are sometimes introduced, for example the introduction of the phase_shift step itself, or there are ongoing discussions on accounting for headstage filter nonlinear phase distortions #2943. So you never know what the preprocessing pipeline might be in a few years. Maybe one day you might want to revisit the data with a more up-to-date preprocessing pipeline and wish you had the 'true' raw data. For me, I would err on storing the raw data considering it is probably only a relatively minimal improvement (figure 7 in paper) you get in compression after bandpass filtering. But both approaches have merit!

@h-mayorquin
Copy link
Collaborator

For long term storage and data sharing such as the on Dandi you want the data to be as raw as possible. You don't know how future users would like to use your data and pre-processing steps might fall out of fashion. as @JoeZiminski indicates.

For working projects you want to keep both for the sake of computational efficiency but be sure to keep the raw data. Then, when you feel that your dataset is in such state that you want to share it with someone else that is not you (publish or an archive for example) you go back to step 1.

Note that this is only valid for data whose processed versions is very large. For results that are really computing heavy or hard to reproduce it is a good idea to share the most direct input that supports a hypothesis (spike times for example).

@JoeZiminski
Copy link
Collaborator Author

(Note) there is no way to skip highpass filter (300 Hz corner freq) in KS4

@JoeZiminski
Copy link
Collaborator Author

JoeZiminski commented Jul 18, 2024

I'm still well up for this just needing to prioritise some other things ahead of it. We can use it as a dumping ground for potentially confusing things about preprocessing pipelines until ready.

@jonpedros
Copy link

jonpedros commented Oct 4, 2024

I'm curious about the benefits of applying each preprocessing step to grouped vs. non-grouped data when dealing with tetrode datasets. In my case, our recordings have plenty of EMG artifacts from a skull reference screw that are shared across all tetrodes, so it seems more effective to apply it globally. I am not aware of any possible benefits of applying bandpass filters in grouped data. For whitening, however, I am not so sure.

I also have some doubts on whitening regardless of the question of grouping or not before applying it, especially since some sorters highly recommend it before sorting.

I know that whitened data is not supposed to maintain real waveforms and that it's supposed to highlight channel-specific signals. However, whitening tetrode-grouped data makes some channels seem to go completely silent while other become very noisy, along with some likely spikes seen on non-whitened data being erased from all channels of that tetrode.

300-6000Hz Bandpass + CMR:
pre-cmr

300-6000Hz Bandpass + CMR + whitening on tetrode grouped data:
pre-cmr-whit

When whitening is applied locally but in non-grouped data, results are less strange than with group-based whitening, but spikes still seem be gone from some of the tetrodes.

300-6000Hz Bandpass + CMR + whitening with local covariance on non-grouped data:
pre-cmr-non-group-local-whit

Finally, non-grouped whitening results look alright, but I'm not so sure. Running automatic sorting with mountainsort5 in this data yields spike-like units in only one tetrode, versus valid units on all tetrodes with non whitened data.

300-6000Hz Bandpass + CMR + whitening on non-grouped data:
pre-cmr-non-group-whit

@jonpedros
Copy link

jonpedros commented Oct 15, 2024

@JoeZiminski Was wondering if I could get your input about this matter. Please tell me if it's not within the scope of this issue. Thanks!

@JoeZiminski
Copy link
Collaborator Author

JoeZiminski commented Oct 17, 2024

Hi @jonpedros thanks for this, definitely within the scope of this PR! This is a great question, its very interesting to see these plots. For reference here is a similar plot from a neuropixels recording to see the effect of whitening. Please also see this (rushed, will revise) notebook illustrating some of the points below.

Could I just check your set up, I am not that familiar with mult-tetrode recordings. So you have 6 tetrodes implanted in the brain, how far apart are they? Could you post the code you are running the data with?

I agree that filtering will not be affected by between grouped vs. non-grouped analysis (it is applied per individual channel). Does the EMG artefacts persist in the data after filtering + CMR or has that pretty much taken care of them? The raw data looks pretty good to me in terms of noise.

Just in case the background is useful, the intuition of the whitening step is to first estimate the covariance between all channels (and store this in a covariance matrix, $\Sigma$. This is a (num_channel x num_channel) matrix. For example, the first row will capture how much channel 1 covaries with all other channels (channel 1 vs channel 1, channel 1 vs. channel 2, .... channel 1 vs. channel n). Next, we would like to remove this covariance from our channels, achieved by applying a transformation that results in the transformed data having identity covariance matrix. If you follow the operation (e.g. track what happens in a code example, see notebook), we end up essentially subtracting surrounding channels from a channel weighted by how much the channels covary. So, say we are whitening channel 4. We take the channel 4 signal and subtract channel 3, weighted by how much channel 3 covaries with channel 4. If the channels are strongly correlated, this will remove both signal and noise, leading to units becoming isolated on their strongest-loading channel.

In this case, each tetrode has a lot of shared signal (spikes + noise), but there doesn't seem like much signal is shared between tetrodes, at least assessing visually. It looks like in this case, when only a few, highly correlated signals are included, the whitening is removing a lot of signal and noise from channels in fairly unpredictable ways. You can play around with the second part of the attached notebook to see try and emulate this. For example in the grouped whitening, first tetrode, the signal across the four channels is basically all isolated to two non-neighbouring channels. I'm not sure how useful this would be for the sorting. It will also be worth taking a look at the whitening matrices, maybe in some cases issues in the estimation lead to the values becoming very small.

That being said, whitening heavily rescales the data, and if you perform per-tetrode and then plot all tetrodes together, the signal on one tetrodes channels might just look very small in comparison to those on other tetrodes. If you blow up the scaling, maybe they look a little more normal with spikes nicely isolated to certain channels within the tetrode.

It would be interesting to see the whitening matrix produced for the 4th tetrode down, grouped data. I have no idea why so much noise is introduced!

When you estimate the whitening all together, the noise correlations that exist between tetrodes are introduced into the estimation. Maybe in order to reverse the covariance matrix transform, whitening can reduce the noise correlations between tetrodes and doesn't need to aggressively zero the spikes. Alternatively, the scaling will be better normalised across tetrodes, so maybe this is why it looks less weird. There are still definitely some strange effects, such as units appearing to completely disappear.

In sum, I am not sure whitening this data is a good idea for this data. Because of the setup, the neurons in the data without whitening are already quite isolated. The correlated noise across tetrodes doesn't look that strong (although it is hard to see by eye at this scale). I would go ahead and sort with and without whitening, and see how interpretable the data looks, I would imagine the results would be better without whitening. It is also worth exploring what sorters suit this dataset best, I think use of neuropixels / high-density probes is in many ways built into kilosort design and it's defaults. I wonder how much this is the case for other sorters (e.g. herding spikes). It might even make sense to sort per-tetrode. @chrishalcrow recently mentioned herdingspikes is very fast and should be able to advise.

It would also be interesting to hear @yger thoughts on this as he has been working on improving the stability of covariance estimation during whitening.

@chrishalcrow
Copy link
Collaborator

Yes, if you can get herdingspikes to install (https://github.com/mhhennig/HS2) it's really really fast for sorting. So it's good for this kind of analysis where you try out lots of different preprocessing pipelines.

Another note: someone advised me that whitening can be good for sorting, but you then want the unwhitened recording for extracting waveforms etc. So you might want to try that out: doing something like

unwhitened_recording = ...
whitened_recording = si.whiten(unwhitened_recording)
sorting = si.run_sorter(recording = whitened_recording, sorter_name="...")
sorting_analyzer = si.create_sorting_analyzer(recording= unwhitened_recording, sorting=sorting)

This is second-hand advice, so don't take it too seriously, but might be worth a try!

@jonpedros
Copy link

jonpedros commented Oct 24, 2024

Thank you very much for the through responses @JoeZiminski and @chrishalcrow !

Mountainsort takes ages to run right now so it's good to have something to quickly run with different iterations of preprocessed data. I will also try getting the waveforms from unwhitened data.

I'm currently always trying sorting by tetrodes, except when I tried KS2, which really didn't seem very tuned at all to that kind of data. MountainSort seemed to give the best results. My reason for insisting on trying whitening a bit (even though filtering and CMR seems to take care of noise) before considering it inadequate to my dataset was first because of the recommendation to use it with no caveats before sorting with MountainSort 5.

Also because I had issues with MS5 finding only what seemed to be a noise cluster in one of the tetrodes, even though I could isolate likely units from that single cluster with manual sorting, and although I could see likely waveforms in the raw data (it's the data from the second tetrode from above, by the way). So I was wondering if that might be a result of not whitening data. Checking the extracted waveforms showed that cluster seemed to be the result of extracting almost every oscillation from the traces, which obviously drowns real spikes in noise.

I will also try plotting whitened data separately for each tetrode, although that might not "solve" it for some of the tetrodes.

In my understanding, I think a partial approach would be to simply z-score data so that it has variance 1, and so that standard deviation based parameters used by many sorters make more sense. What do you think?

At any rate, @JoeZiminski would you like to check my dataset if you have time? I've uploaded my data to figshare here and adapted my script so that (I'm hoping) it will easily run as is.

@JoeZiminski
Copy link
Collaborator Author

Hey @jonpedros thanks for your message, sorry I don't have a chance to respond in full ATM but I wanted to mention, looking at your shared data, if you are on version > 0.100.8 you might be affected by #3505. If so, as a workaround, casting like recording = si.astype(recording, np.float32) before running whitening should fix it.

@jonpedros
Copy link

jonpedros commented Oct 30, 2024

@JoeZiminski Thank you for catching that issue! Casting the dtype completely solved the strange outputs, and I also seem to have got better sorting results overall with whitening. With the dtype issue fixed, running whitening with local covariance versus simply grouped by tetrode also returns the same results, which is what I expected initially. Running it on all the traces with global variance yields almost identical results, at least accessing visually.

@JoeZiminski
Copy link
Collaborator Author

Hey @jonpedros, great! If it's not too much hassle, out of interest could you re-make the plots above, with the fixed whitening? I am curious to see how it looks when everything is working

@jonpedros
Copy link

jonpedros commented Nov 19, 2024

Sure! Sorry for the delay.

If I may make a related question, I noticed the principal_components extension has a whiten parameter, "If True, waveforms are pre-whitened" (default True). Should this be turned off when postprocessing whitened waveforms?

Raw traces:
Picture1
Bandpassed traces (300Hz-6000Hz)
Picture2
CMR Re-referenced traces
Picture3
Whitening with global covariance
Picture4

@JoeZiminski
Copy link
Collaborator Author

Very cool, yes that looks much more like expected. Thanks @jonpedros!

I was also curious about this argument that I saw for the first time the other day. I would expect that the principal components would be better computed from the raw, rather than whitened, waveforms when computing for quality metrics. I guess a benefit of computing whitened would be that it better matches the conditions as during sorting (where everything is white), but I wouldn't have thought that is very important. @alejoe91 what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation preprocessing Related to preprocessing module
Projects
None yet
Development

No branches or pull requests

6 participants