-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve BIDSLayout performance on very large datasets #285
Comments
Are there hidden files in these datasets like |
Not sure which version you're using, but excluding derivatives could help. In the latest version (which is not stable), |
There are no hidden files or a derivatives directory. I'm just installing from pypi ( |
Also, would it be possible to create the bids-layout object with only specific subjects/sessions included, similar (but the inverse) of the |
Yes, try installing master—though be aware that there are some API-breaking changes (but you probably want to get ahead of those anyway, as the current PyPI release is out of sync with the BIDS Derivatives RC). That might fix your problem without any further effort, as derivatives are no longer indexed by default in master. There's no way to limit to only certain subjects/sessions at the moment. I think this came up before and we decided it wasn't worth the effort, but I could be swayed if there's enough demand for it (or if I get a PR). |
Oh, sorry—missed where you said there are no derivatives. I'm having some reading difficulties today. The current implementation isn't heavily optimized (and I don't think 0.7 fixes this), so it may be that it's just slow if you have a particularly large dataset. But the fact that load time seems to be supralinear in the number of subjects is kind of concerning. Do you mind doing some profiling (cProfile is fine) and pasting the results here? I'm curious to see what's eating up those cycles... |
No problem - I'll kick off the script in a few minutes and let you know. Thanks! |
I ran this on the medium-sized dataset:
|
As a short term fix, it looks like you could use the exclude argument with regex to exclude things you don't want: I'm not sure if this will work at the directory level (which would speed things up even more), but it's worth a try. Mind giving that a try on the large dataset excluding all but one subject? FYI this is sort of a "hidden" feature right now, since |
OK, so you're saying I should trying something like:
Where |
Sure! I trust your regex skillz |
Heyo - so I verified that the regex picked up only what I want (https://regex101.com/r/uOKFHL/1), but the load time didn't change for any scale. Any other ideas? |
But when you do Maybe try: |
Ah, thanks for fixing the regex! It's because it is looking in provided directory, so I'll use this for the time being and as I play around I'll let you know if I notice any peculiarities. Thanks! 👍 |
Awesome. I'm going to reopen this issue (with a more general name) because I think this is a all too common scenario, and coming up w/ said regex is obvious not that intuitive. I think officially supporting excluding subjects at the And like Tal said, the supralinear time increase w/ dataset size is worrisome... |
Looks like most of the time is being eaten up by |
FWIW -- if that is the same directories over and over again, might be worth creating/using a little simple "memoizer" for |
It might also be faster to assume that things that should be directories are, catch exceptions when trying to open/stat something underneath, and perform the |
@yarikoptic good idea. I'll have to take a closer look to see where the I'm pretty sure we do need to determine whether each path is a file or directory, because different validation hooks get triggered ( |
Update: using branch for
|
By my read that shaved off ~4s/45s. Or is that a different dataset than your last profile? |
Different dataset - sorry. Look at the initial comment. It turned |
Nice. |
Sweet, thanks Greg! |
Hey - I was trying to use
pybids
in a pipeline of mine, but found that it takes a very long time to create a layout on large datasets. I tested this using the public NKI-RS dataset from FCP-INDI in BIDS format, and ran using subsets of the dataset of various size: 2 subjects, 90, and the full thing, 963.My script is the following:
The output, in the form of
{n subs}:{time elapsed in seconds}
, is:With this being the first step of my pipeline, and I'm almost always specifying a
--participant_label
or--session_label
for a bunch of these launched in parallel, it would be great if it didn't take upwards of 10 minutes per task to find the data. I'd still like to have the BIDSLayout as part of the pipeline itself so I can grab various pieces of metadata as I need them.Any ideas why this gets so slow as sample size increases, or places you suggest I could make a PR to speed things up (likely in
grabbit
)?The text was updated successfully, but these errors were encountered: