-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BIDSLayout performance on very large datasets, continued #609
Comments
takes too long? or crashes? |
You can't really concatenate When you say the script chokes, do you mean that the initial As I say in #521, I'm sure optimization is possible—but is likely to require some effort. From the profiling @gkiar and others have done, it's clear that nearly all of the time is being spent at OS level, and not in Python itself. The main culprit is that At the end of the day, with millions of files, some non-trivial fraction of which need to be read in and stored in the DB (all the JSON files), I doubt you're ever going to have something fast enough to do the indexing at run-time. The best-case scenario is that one-time indexing and then saving the DB to disk works (as suggested in #521) solves your problem. |
Sorry for the vague language. So far I haven't had a successful run with creating a BIDSLayout for that dataset. The longest I've left it running was about 10 hours after which I was disconnected from the server and the script execution was cancelled (for an unknown reason that's probably not related). We only need to run the pipeline once; the goal is to insert every path to an imaging file within the BIDS dataset into another database. I'm on Python 3.6.9 and the latest version of pybids so it sounds like my only option here is to let the BIDSLayout run for longer and cross my fingers. |
Probably also worth running with |
side-topic which triggered me to initiate the "issue": con/fscacher#1 |
Thanks @tyarkoni I'll add that argument. I appreciate the help. |
@johnsaigle do you have a sense of how many files we're talking about, in total (or even just per subject)? That might help me build an intuition about how long this should take... 10 hours seems unreasonably long even for 40k subjects, so I'm wondering if there's a separate bottleneck (e.g., inserting records into the DB). |
I'm not exactly sure how many files. I just looked at a few directories arbitrarily and they each seem to have two imaging (NIFTI) files and two accompanying JSON files. The script never reaches the part where it connects to the destination database. Here's where the script is getting stuck. def load_bids_data(self):
"""
Loads the BIDS study using the BIDSLayout function (part of the pybids
package) and return the object.
:return: bids structure
"""
if self.verbose:
print('Loading the BIDS dataset with BIDS layout library...\n')
bids_config = os.environ['LORIS_MRI'] + "/python/lib/bids.json"
exclude_arr = ['/code/', '/sourcedata/', '/log/', '.git/']
bids_layout = BIDSLayout(root=self.bids_dir, config=bids_config, ignore=exclude_arr)
if self.verbose:
print('\t=> BIDS dataset loaded with BIDS layout\n')
return bids_layout The script prints The BIDS directory is remotely hosted via sshfs so the RTT could be a potential bottleneck. |
What's in the custom BIDS config ( |
Oh, heh... I missed the last sentence on first read. Yes, if you're trying to index remotely over ssh, that seems very likely to slow things down to the point where they're unworkable. I don't know much about SSHFS, but I'd be pretty surprised if that doesn't impose all kinds of bottlenecks (at minimum, you're probably having to transfer all the JSON files and directory listings; but depending on implementation, you might actually be inadvertently transferring the image files too). |
Yeah I don't know too much about sshfs either 😅 Unfortunately there's no way around it for now. I'll run the script again overnight and I'll try using the default config file as well as disabling the index metadata.
Here is the contents. {
"name": "bids",
"entities": [
{
"name": "subject",
"pattern": "[/\\\\]+sub-([a-zA-Z0-9]+)",
"directory": "{subject}"
},
{
"name": "session",
"pattern": "[_/\\\\]+ses-([a-zA-Z0-9]+)",
"mandatory": false,
"directory": "{subject}{session}"
},
{
"name": "task",
"pattern": "[_/\\\\]+task-([a-zA-Z0-9]+)"
},
{
"name": "acquisition",
"pattern": "[_/\\\\]+acq-([a-zA-Z0-9]+)"
},
{
"name": "ce",
"pattern": "[_/\\\\]+ce-([a-zA-Z0-9]+)"
},
{
"name": "reconstruction",
"pattern": "[_/\\\\]+rec-([a-zA-Z0-9]+)"
},
{
"name": "dir",
"pattern": "[_/\\\\]+dir-([a-zA-Z0-9]+)"
},
{
"name": "run",
"pattern": "[_/\\\\]+run-0*(\\d+)",
"dtype": "int"
},
{
"name": "proc",
"pattern": "[_/\\\\]+proc-([a-zA-Z0-9]+)"
},
{
"name": "modality",
"pattern": "[_/\\\\]+mod-([a-zA-Z0-9]+)"
},
{
"name": "echo",
"pattern": "[_/\\\\]+echo-([0-9]+)\\_bold."
},
{
"name": "recording",
"pattern": "[_/\\\\]+recording-([a-zA-Z0-9]+)"
},
{
"name": "suffix",
"pattern": "[._]*([a-zA-Z0-9]*?)\\.[^/\\\\]+$"
},
{
"name": "scans",
"pattern": "(.*\\_scans.tsv)$"
},
{
"name": "fmap",
"pattern": "(phasediff|magnitude[1-2]|phase[1-2]|fieldmap|epi)\\.nii"
},
{
"name": "datatype",
"pattern": "[/\\\\]+(func|anat|fmap|dwi|meg|eeg)[/\\\\]+"
},
{
"name": "extension",
"pattern": "[._]*[a-zA-Z0-9]*?\\.([^/\\\\]+)$"
}
],
"default_path_patterns": [
"sub-{subject}[/ses-{session}]/anat/sub-{subject}[_ses-{session}][_acq-{acquisition}][_ce-{contrast}][_rec-{reconstruction}]_{suffix<T1w|T2w|T1rho|T1map|T2map|T2star|FLAIR|FLASH|PDmap|PD|PDT2|inplaneT[12]|angio>}.nii.gz",
"sub-{subject}[/ses-{session}]/anat/sub-{subject}[_ses-{session}][_acq-{acquisition}][_ce-{contrast}][_rec-{reconstruction}][_mod-{modality}]_{suffix<defacemask>}.nii.gz",
"sub-{subject}[/ses-{session}]/func/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}]_{suffix<bold>}.nii.gz",
"sub-{subject}[/ses-{session}]/dwi/sub-{subject}[_ses-{session}][_acq-{acquisition}]_{suffix<dwi>}.{extension<bval|bvec|json|nii\\.gz|nii>|nii\\.gz}",
"sub-{subject}[/ses-{session}]/fmap/sub-{subject}[_ses-{session}][_acq-{acquisition}][_dir-{direction}][_run-{run}]_{fmap<phasediff|magnitude[1-2]|phase[1-2]|fieldmap|epi>}.nii.gz",
"sub-{subject}[/ses-{session}]/[{datatype<func|meg|eeg>|func}/]sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}][_recording-{recording}]_{suffix<events>}.{extension<tsv>|tsv}",
"sub-{subject}[/ses-{session}]/func/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}][_recording-{recording}]_{suffix<physio|stim>}.{extension<tsv\\.gz|json}",
"sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_run-{run}][_proc-{proc}]_meg.{extension|json}",
"sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_run-{run}][_proc-{proc}]_{suffix<channels>}.{extension<tsv>|tsv}",
"sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}]_{suffix<coordsystem>}.json",
"sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}]_{suffix<photo>}.jpg"
]
} |
At a glance, I don't see obvious differences from the default BIDS config, so I don't know if you need that argument. But it shouldn't affect much either way. I'd suggest trying to initialize a |
BIDSLayout completed very quickly for 10 subjects. There was barely an instant between the debug messages I described above. |
Interesting. I'm not sure what's going on, then. If you explicitly time a short run (maybe 100 subjects), and extrapolate to what it would take to do 40k, that might give an indication of whether you just haven't waited long enough, or if there's some supralinear scaling that kicks in at some point. If it's the latter, I'll try to look into it as time allows, though it may be a while, as this will likely require some work. |
I'll give that a try if tonight's run fails. Thanks again for the help. |
I ran our pipeline processing script again with a different dataset that has about 200 participants (i.e. |
Hi! 👋 I'm working on a very large BIDS dataset (about 40k participants). Our analysis script is choking on a call to BIDSLayout.
The issue is similar to #285. The author of this issue, @gkiar, is a coworker of mine and he's been coaching me through an approach to massaging the data so that the processing pipeline can move forward.
As a workaround, Greg has suggested analyzing the data in chunks. We would take the first 1000 candidates (as an example) and create a
BIDSLayout
object for just these candidates by supplying an appropriate regex to theexclude
parameter. We would do this for each batch of 1000 and then glue each BIDSLayout object together manually within the script.I'm looking for some guidance on how to combine BIDSLayout objects. I assume I can't just do
BIDSLayoutN = BIDSLayout1 + BIDSLayout2 + ....
.What elements of this class would need to be concatenated in order to get a combined, working BIDSLayout object?
Also - if we're way off track with this approach, any other suggestions as to how to work with
pybids
andBIDSLayout
with a dataset of this scale would be greatly appreciated. :)Thanks in advance!
The text was updated successfully, but these errors were encountered: