-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A detailed guide on preprocessing #2996
Comments
This would be great! Let's keep issues here that are about preprocessing: |
Thanks for initiating this! I have not yet been able to help with the documentation but I would like to point out another preprocessing step that confuses me. I learned about preprocessing steps from the latest SpikeInterface_Tutorial.ipynb, where saving Saving SpikeInterface objects is introduced after Preprocessing, so I followed those steps for my analysis. However, I read @alejoe91 paper in 2023 about compressing ephys data. In the paper, they discussed the benefits (e.g. boost compression ratio) and drawbacks (e.g. causing lossy compression) of applying pre-processing step (e.g. band-pass filter) before compressing the raw data. So, is it better to compress and save the raw data directly and only do pre-processing steps when applying spike sorting to the data? |
Hmm good question @chiyu1203 I would be interested in hearing other perspectives but I think it depends a little on your own preference, in terms of whether to store forever the 'true' raw data or a slightly-pre-processed version. If you are using NP probes, I guess you would want to do |
For long term storage and data sharing such as the on Dandi you want the data to be as raw as possible. You don't know how future users would like to use your data and pre-processing steps might fall out of fashion. as @JoeZiminski indicates. For working projects you want to keep both for the sake of computational efficiency but be sure to keep the raw data. Then, when you feel that your dataset is in such state that you want to share it with someone else that is not you (publish or an archive for example) you go back to step 1. Note that this is only valid for data whose processed versions is very large. For results that are really computing heavy or hard to reproduce it is a good idea to share the most direct input that supports a hypothesis (spike times for example). |
(Note) there is no way to skip highpass filter (300 Hz corner freq) in KS4 |
I'm still well up for this just needing to prioritise some other things ahead of it. We can use it as a dumping ground for potentially confusing things about preprocessing pipelines until ready. |
@JoeZiminski Was wondering if I could get your input about this matter. Please tell me if it's not within the scope of this issue. Thanks! |
Hi @jonpedros thanks for this, definitely within the scope of this PR! This is a great question, its very interesting to see these plots. For reference here is a similar plot from a neuropixels recording to see the effect of whitening. Please also see this (rushed, will revise) notebook illustrating some of the points below. Could I just check your set up, I am not that familiar with mult-tetrode recordings. So you have 6 tetrodes implanted in the brain, how far apart are they? Could you post the code you are running the data with? I agree that filtering will not be affected by between grouped vs. non-grouped analysis (it is applied per individual channel). Does the EMG artefacts persist in the data after filtering + CMR or has that pretty much taken care of them? The raw data looks pretty good to me in terms of noise. Just in case the background is useful, the intuition of the whitening step is to first estimate the covariance between all channels (and store this in a covariance matrix, In this case, each tetrode has a lot of shared signal (spikes + noise), but there doesn't seem like much signal is shared between tetrodes, at least assessing visually. It looks like in this case, when only a few, highly correlated signals are included, the whitening is removing a lot of signal and noise from channels in fairly unpredictable ways. You can play around with the second part of the attached notebook to see try and emulate this. For example in the grouped whitening, first tetrode, the signal across the four channels is basically all isolated to two non-neighbouring channels. I'm not sure how useful this would be for the sorting. It will also be worth taking a look at the whitening matrices, maybe in some cases issues in the estimation lead to the values becoming very small. That being said, whitening heavily rescales the data, and if you perform per-tetrode and then plot all tetrodes together, the signal on one tetrodes channels might just look very small in comparison to those on other tetrodes. If you blow up the scaling, maybe they look a little more normal with spikes nicely isolated to certain channels within the tetrode. It would be interesting to see the whitening matrix produced for the 4th tetrode down, grouped data. I have no idea why so much noise is introduced! When you estimate the whitening all together, the noise correlations that exist between tetrodes are introduced into the estimation. Maybe in order to reverse the covariance matrix transform, whitening can reduce the noise correlations between tetrodes and doesn't need to aggressively zero the spikes. Alternatively, the scaling will be better normalised across tetrodes, so maybe this is why it looks less weird. There are still definitely some strange effects, such as units appearing to completely disappear. In sum, I am not sure whitening this data is a good idea for this data. Because of the setup, the neurons in the data without whitening are already quite isolated. The correlated noise across tetrodes doesn't look that strong (although it is hard to see by eye at this scale). I would go ahead and sort with and without whitening, and see how interpretable the data looks, I would imagine the results would be better without whitening. It is also worth exploring what sorters suit this dataset best, I think use of neuropixels / high-density probes is in many ways built into kilosort design and it's defaults. I wonder how much this is the case for other sorters (e.g. herding spikes). It might even make sense to sort per-tetrode. @chrishalcrow recently mentioned herdingspikes is very fast and should be able to advise. It would also be interesting to hear @yger thoughts on this as he has been working on improving the stability of covariance estimation during whitening. |
Yes, if you can get herdingspikes to install (https://github.com/mhhennig/HS2) it's really really fast for sorting. So it's good for this kind of analysis where you try out lots of different preprocessing pipelines. Another note: someone advised me that whitening can be good for sorting, but you then want the unwhitened recording for extracting waveforms etc. So you might want to try that out: doing something like unwhitened_recording = ...
whitened_recording = si.whiten(unwhitened_recording)
sorting = si.run_sorter(recording = whitened_recording, sorter_name="...")
sorting_analyzer = si.create_sorting_analyzer(recording= unwhitened_recording, sorting=sorting) This is second-hand advice, so don't take it too seriously, but might be worth a try! |
Thank you very much for the through responses @JoeZiminski and @chrishalcrow ! Mountainsort takes ages to run right now so it's good to have something to quickly run with different iterations of preprocessed data. I will also try getting the waveforms from unwhitened data. I'm currently always trying sorting by tetrodes, except when I tried KS2, which really didn't seem very tuned at all to that kind of data. MountainSort seemed to give the best results. My reason for insisting on trying whitening a bit (even though filtering and CMR seems to take care of noise) before considering it inadequate to my dataset was first because of the recommendation to use it with no caveats before sorting with MountainSort 5. Also because I had issues with MS5 finding only what seemed to be a noise cluster in one of the tetrodes, even though I could isolate likely units from that single cluster with manual sorting, and although I could see likely waveforms in the raw data (it's the data from the second tetrode from above, by the way). So I was wondering if that might be a result of not whitening data. Checking the extracted waveforms showed that cluster seemed to be the result of extracting almost every oscillation from the traces, which obviously drowns real spikes in noise. I will also try plotting whitened data separately for each tetrode, although that might not "solve" it for some of the tetrodes. In my understanding, I think a partial approach would be to simply z-score data so that it has variance 1, and so that standard deviation based parameters used by many sorters make more sense. What do you think? At any rate, @JoeZiminski would you like to check my dataset if you have time? I've uploaded my data to figshare here and adapted my script so that (I'm hoping) it will easily run as is. |
Hey @jonpedros thanks for your message, sorry I don't have a chance to respond in full ATM but I wanted to mention, looking at your shared data, if you are on version > |
@JoeZiminski Thank you for catching that issue! Casting the dtype completely solved the strange outputs, and I also seem to have got better sorting results overall with whitening. With the dtype issue fixed, running whitening with local covariance versus simply grouped by tetrode also returns the same results, which is what I expected initially. Running it on all the traces with global variance yields almost identical results, at least accessing visually. |
Hey @jonpedros, great! If it's not too much hassle, out of interest could you re-make the plots above, with the fixed whitening? I am curious to see how it looks when everything is working |
Very cool, yes that looks much more like expected. Thanks @jonpedros! I was also curious about this argument that I saw for the first time the other day. I would expect that the principal components would be better computed from the raw, rather than whitened, waveforms when computing for quality metrics. I guess a benefit of computing whitened would be that it better matches the conditions as during sorting (where everything is white), but I wouldn't have thought that is very important. @alejoe91 what do you think? |
For ephys pipelines there are a number of quite complex preprocessing steps that need to be done in a specific order. Also, there are some nuances to some preprocessing steps, particulaly around multi-shank and bad channel detection. For example CAR, IBL bad channel detection*, highpass spatial filter etc make assumptions on channel location that are broken in the multi-shank contex. Bad channel detection should not be done after CAR etc.
In general I think these preprocessing steps are all quite mysterious if you are not familiar with them. I've been meaning for ages to write a detailed guide on preprocessing steps (what and why for each step) but never gotten around to it, and speaking to people at the conference I think it would be useful. So I'll post this issue to get the ball rolling and invite contributions, maybe opening a PR next week!
*BTW I just realised this yesterday for IBL bad channel detection, maybe we should add an assert / warning?
*also whitening, potentially.
The text was updated successfully, but these errors were encountered: