Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow running speed #767

Open
Tingchen-G opened this issue Aug 26, 2024 · 13 comments
Open

Slow running speed #767

Tingchen-G opened this issue Aug 26, 2024 · 13 comments

Comments

@Tingchen-G
Copy link

Tingchen-G commented Aug 26, 2024

Hi!

We are using kilosort for 32-channel recordings that are 10~15 hours long, and it's taking a really long time to process, so I am hoping to ask for some advice on this issue.

  1. We have 16 shanks, each shank with 32 channels. Currently I'm using a loop to run kilosort on each shank separately. Some shanks took 3-4 hours, but a few shanks took 9-10 hours. I noticed that kilosort takes longer and longer to run as it is looped. Any idea for why this might be the case?

  2. We are planning to upgrade our GPU. I read on the Kilosort Hardware Recommendation page that for longer recordings, "this situation typically requires more RAM, like 32 or 64 GB". May I check if this is referring to GPU or system memory? Also, since our current memory is sufficient to handle our data, do you think increasing memory, either in the system or GPU, would reduce runtime?

Thank you!

@RobertoDF
Copy link
Contributor

RobertoDF commented Aug 26, 2024

Interesting, might be related to this SpikeInterface/spikeinterface#3332 . I also noticed that running kilosort in a loop sometimes causes weird behaviors.

@jacobpennington
Copy link
Collaborator

As for the loop question, are you noticing that it takes longer on the third and forth loop, or just longer on the second loop like the linked issue in @RobertoDF's comment? If you're assigning the sorting results like:

for i in some_list:
    results = run_kilosort(...)

Then the variables in results will be kept in memory until the next loop completes (or longer if you're storing the results in a list for example), which will slow down sorting some since that memory won't be available in the meantime. Most of those aren't too big, but the memory for tF can add up fast for recordings with a lot of spikes.

For the "taking a long time" part, I can't really say much without some information about what hardware you're using. For reference, a Neuropixels recording 2-3 hours long on SSD is expected to take 2-3 hours to sort with a 8-12GB GeForce 3000 or 4000 series card, an i7 or better processor from the last few generations, and at least 32GB of system memory. A 32-channel recording should take less time; however, differences in hardware or spike counts could account for some of the gap.

Is there a reason you're sorting the shanks separately instead of all at once?

@Tingchen-G
Copy link
Author

Thank you for your response! Yes the sorting takes longer on the second loop, just like in @RobertoDF's comment. But for every loop, I have del ops, st, clu, tF, Wall, similar_templates, is_ref, est_contam_rate, kept_spikes at the end, which I thought would clear the memory?

I am sorting the shanks separately because our recordings are very long, so I am worried that sorting shanks together would lead to "CUDA out of memory" issue.

And finally, just to clarify: in the Kilosort Hardware Recommendation page, "this situation typically requires more RAM, like 32 or 64 GB" --> is this referring to system memory?

Thank you!

@jacobpennington
Copy link
Collaborator

Yes that is referring to system memory. I'll look into the looping issue. I would also recommend trying out sorting it all together, and only sort separately if you run into errors since that should speed up the sorting quite a bit.

As for taking too long to run, can you please give some information about what hardware you're using? Specifically: graphics card, processor, amount of GPU memory and system memory, and are you sorting on SSD or HDD?

@Tingchen-G
Copy link
Author

Tingchen-G commented Aug 30, 2024

I see, I'll try sorting them all together. Regarding hardware, we're using GPU: GeForce GTX 1080Ti, 11GB memory; Processor: Intel i7-9700, 48GB memory, and we are sorting on SSD.

@Tingchen-G
Copy link
Author

Also, I noticed that the final clustering step takes the longest time. For a shank that took 11.5hrs to run, 13,844,472 spikes are extracted for first clustering, but 43,478,695 spikes are extracted for final clustering. Is it because too many spikes are extracted for final clustering? I'm using the default 9 and 8 for Th_universal and Th_learned.

@jacobpennington
Copy link
Collaborator

One other thing to check: can you make note of how many spikes were detected for each shank? I just want to make sure it's not a case where you happened to sort the shanks with more spikes later in the loop, which would of course take longer.

Another thing you can try is increasing the cluster_downsampling parameter, which would speed up the clustering steps. With that many spikes, you don't need to use as many for some of the clustering operations.

@Tingchen-G
Copy link
Author

Sorry for the late reply! Here are the spike counts for each shank:

Shank 1: 23,946,723
Shank 2: 26,824,833
Shank 3: 40,672,509
Shank 4: 43,187,385
Shank 5: 32,859,009
Shank 6: 30,946,386
Shank 7: 26,166,955
Shank 8: 17,119,952
Shank 9: 5,001,869
Shank 10: 8,773,221
Shank 11: 22,833,448
Shank 12: 20,865,463
Shank 13: 22,793,711
Shank 14: 30,212,405
Shank 15: 27,891,315
Shank 16: 19,776,232
The spike counts vary significantly between shanks. I suspect the loop may be causing the slow runtime because I've noticed that when a shank takes too long, stopping the loop, restarting Anaconda Prompt and kilosort, and running a new loop from this same shank onwards would make it run much faster.

I'll definitely try increasing the cluster_downsampling parameter! Thanks!

@jacobpennington
Copy link
Collaborator

Thanks, still looking into this. Would it be possible for you to share the binary file and probe information for one of the shanks so that I can benchmark the memory usage in a loop? Any of the shanks with 20million or more spikes should work. We don't have datasets with a long duration like that available, so that would help me debug this issue and some related ones.

@Tingchen-G
Copy link
Author

Tingchen-G commented Oct 11, 2024

Hi!

Sorry for the delay. Sure, we could share the files. May I ask how to share the binary file? the compressed file is still too big to share on GitHub. Here is the probe information:

chanMap = np.arange(32)
kcoords = np.zeros(32)
n_chan = 32

xc_1_3 = np.ones(16) * 6.2
xc_2_4 = np.ones(16) * 6.2 + 30
xc = np.array([val for pair in zip(xc_1_3, xc_2_4) for val in pair])

yc_2_4 = np.array([15 + 6.2 + 30 * i for i in range(16)])
yc_1_3 = np.array([6.2 + 30 * k for k in range(16)])
yc = np.array([val for pair in zip(yc_1_3, yc_2_4) for val in pair])

probe = {
    'chanMap': chanMap,
    'xc': xc,
    'yc': yc,
    'kcoords': kcoords,
    'n_chan': n_chan
}

Thank you!

@jacobpennington
Copy link
Collaborator

jacobpennington commented Oct 11, 2024 via email

@Tingchen-G
Copy link
Author

@Tingchen-G
Copy link
Author

Tingchen-G commented Oct 13, 2024

Hi,

I am now running kilosort on a new set of data of similar sizes, and this issue seems to be solved! Now each shank takes around 2hrs, which is quite reasonable considering our data size. I am now using kilosort4.0.18, and have added these lines to the end of the loop:

    with open('kilosort.log', 'w') as f:
        pass  

    del ops, st, clu, tF, Wall, similar_templates, is_ref, est_contam_rate, kept_spikes
    del camps, contam_pct, templates, chan_best, amplitudes, firing_rates, dshift

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants