BIDSLayout performance on very large datasets, continued #609

johnsaigle · 2020-05-12T16:58:33Z

Hi! 👋 I'm working on a very large BIDS dataset (about 40k participants). Our analysis script is choking on a call to BIDSLayout.

The issue is similar to #285. The author of this issue, @gkiar, is a coworker of mine and he's been coaching me through an approach to massaging the data so that the processing pipeline can move forward.

As a workaround, Greg has suggested analyzing the data in chunks. We would take the first 1000 candidates (as an example) and create a BIDSLayout object for just these candidates by supplying an appropriate regex to the exclude parameter. We would do this for each batch of 1000 and then glue each BIDSLayout object together manually within the script.

I'm looking for some guidance on how to combine BIDSLayout objects. I assume I can't just do BIDSLayoutN = BIDSLayout1 + BIDSLayout2 + .... .

What elements of this class would need to be concatenated in order to get a combined, working BIDSLayout object?

Also - if we're way off track with this approach, any other suggestions as to how to work with pybids and BIDSLayout with a dataset of this scale would be greatly appreciated. :)

Thanks in advance!

The text was updated successfully, but these errors were encountered:

yarikoptic · 2020-05-12T17:35:48Z

is choking

takes too long? or crashes?

tyarkoni · 2020-05-12T17:43:29Z

You can't really concatenate BIDSLayout objects, and there's really no use case for that... I think the fact that you're asking about this is a sign that there are deeper problems elsewhere (i.e., with scalability). So, yeah, we need to figure out how to make the BIDSLayout work better with massive datasets.

When you say the script chokes, do you mean that the initial BIDSLayout initialization just never completes, or that it just takes too long to be practical? If the latter, see #521 for some tips that might solve your problem. You can save the DB, so you only have to live through the indexing once. And if you can live without metadata indexing, things should get much faster.

As I say in #521, I'm sure optimization is possible—but is likely to require some effort. From the profiling @gkiar and others have done, it's clear that nearly all of the time is being spent at OS level, and not in Python itself. The main culprit is that os.walk() calls stat on every file, and if you have 40k subjects, you probably have several hundred thousand directories that need to be scanned this way. If you're not already on 3.6+, you can try upgrading Python and see if that helps, as I think os.walk is more efficient in recent versions. But beyond that, to speed things up, we'd need to replace os.walk with some other file-scanning strategy.

At the end of the day, with millions of files, some non-trivial fraction of which need to be read in and stored in the DB (all the JSON files), I doubt you're ever going to have something fast enough to do the indexing at run-time. The best-case scenario is that one-time indexing and then saving the DB to disk works (as suggested in #521) solves your problem.

johnsaigle · 2020-05-12T18:54:37Z

Sorry for the vague language. So far I haven't had a successful run with creating a BIDSLayout for that dataset. The longest I've left it running was about 10 hours after which I was disconnected from the server and the script execution was cancelled (for an unknown reason that's probably not related).

We only need to run the pipeline once; the goal is to insert every path to an imaging file within the BIDS dataset into another database.

I'm on Python 3.6.9 and the latest version of pybids so it sounds like my only option here is to let the BIDSLayout run for longer and cross my fingers.

tyarkoni · 2020-05-12T18:56:39Z

Probably also worth running with index_metadata=False... that might help quite a bit. Of course you might want the metadata accessible, in which case you'll need to re-run with it turned back on, but at least you'll have some assurance that indexing can complete.

yarikoptic · 2020-05-12T18:59:02Z

side-topic which triggered me to initiate the "issue": con/fscacher#1

johnsaigle · 2020-05-12T19:03:02Z

Thanks @tyarkoni I'll add that argument. I appreciate the help.

tyarkoni · 2020-05-12T19:05:30Z

@johnsaigle do you have a sense of how many files we're talking about, in total (or even just per subject)? That might help me build an intuition about how long this should take... 10 hours seems unreasonably long even for 40k subjects, so I'm wondering if there's a separate bottleneck (e.g., inserting records into the DB).

johnsaigle · 2020-05-12T19:19:08Z

I'm not exactly sure how many files. I just looked at a few directories arbitrarily and they each seem to have two imaging (NIFTI) files and two accompanying JSON files.

The script never reaches the part where it connects to the destination database. Here's where the script is getting stuck.

    def load_bids_data(self):
        """
        Loads the BIDS study using the BIDSLayout function (part of the pybids
        package) and return the object.

        :return: bids structure
        """


        if self.verbose:
            print('Loading the BIDS dataset with BIDS layout library...\n')


        bids_config = os.environ['LORIS_MRI'] + "/python/lib/bids.json"
        exclude_arr = ['/code/', '/sourcedata/', '/log/', '.git/']
        bids_layout = BIDSLayout(root=self.bids_dir, config=bids_config, ignore=exclude_arr)


        if self.verbose:
            print('\t=> BIDS dataset loaded with BIDS layout\n')


        return bids_layout

The script prints Loading the BIDS dataset with BIDS layout library... but never reaches => BIDS dataset loaded with BIDS layout. It's only after this point that the script connects to the destination database.

The BIDS directory is remotely hosted via sshfs so the RTT could be a potential bottleneck.

tyarkoni · 2020-05-12T19:29:13Z

What's in the custom BIDS config (bids.json)? There's no obvious reason why that would slow things down, except in the case there are malformed regular expressions that accidentally multiply the number of detected entities. Do you mind pasting the contents of that file here? Alternatively (or in addition), you could try running with the default config (i.e., just remove the config argument) and see if that changes anything.

tyarkoni · 2020-05-12T19:34:16Z

Oh, heh... I missed the last sentence on first read. Yes, if you're trying to index remotely over ssh, that seems very likely to slow things down to the point where they're unworkable. I don't know much about SSHFS, but I'd be pretty surprised if that doesn't impose all kinds of bottlenecks (at minimum, you're probably having to transfer all the JSON files and directory listings; but depending on implementation, you might actually be inadvertently transferring the image files too).

johnsaigle · 2020-05-12T19:40:30Z

Yeah I don't know too much about sshfs either 😅 Unfortunately there's no way around it for now.

I'll run the script again overnight and I'll try using the default config file as well as disabling the index metadata.

bids.json comes from the repo of the pipeline I'm using, LORIS-MRI.

Here is the contents.

{
    "name": "bids",
    "entities": [
        {
            "name": "subject",
            "pattern": "[/\\\\]+sub-([a-zA-Z0-9]+)",
            "directory": "{subject}"
        },
        {
            "name": "session",
            "pattern": "[_/\\\\]+ses-([a-zA-Z0-9]+)",
            "mandatory": false,
            "directory": "{subject}{session}"
        },
        {
            "name": "task",
            "pattern": "[_/\\\\]+task-([a-zA-Z0-9]+)"
        },
        {
            "name": "acquisition",
            "pattern": "[_/\\\\]+acq-([a-zA-Z0-9]+)"
        },
        {
            "name": "ce",
            "pattern": "[_/\\\\]+ce-([a-zA-Z0-9]+)"
        },
        {
            "name": "reconstruction",
            "pattern": "[_/\\\\]+rec-([a-zA-Z0-9]+)"
        },
        {
            "name": "dir",
            "pattern": "[_/\\\\]+dir-([a-zA-Z0-9]+)"
        },
        {
            "name": "run",
            "pattern": "[_/\\\\]+run-0*(\\d+)",
            "dtype": "int"
        },
        {
            "name": "proc",
            "pattern": "[_/\\\\]+proc-([a-zA-Z0-9]+)"
        },
        {
            "name": "modality",
            "pattern": "[_/\\\\]+mod-([a-zA-Z0-9]+)"
        },
        {
            "name": "echo",
            "pattern": "[_/\\\\]+echo-([0-9]+)\\_bold."
        },
        {
            "name": "recording",
            "pattern": "[_/\\\\]+recording-([a-zA-Z0-9]+)"
        },
        {
            "name": "suffix",
            "pattern": "[._]*([a-zA-Z0-9]*?)\\.[^/\\\\]+$"
        },
        {
            "name": "scans",
            "pattern": "(.*\\_scans.tsv)$"
        },
        {
            "name": "fmap",
            "pattern": "(phasediff|magnitude[1-2]|phase[1-2]|fieldmap|epi)\\.nii"
        },
        {
            "name": "datatype",
            "pattern": "[/\\\\]+(func|anat|fmap|dwi|meg|eeg)[/\\\\]+"
        },
        {
            "name": "extension",
            "pattern": "[._]*[a-zA-Z0-9]*?\\.([^/\\\\]+)$"
        }
    ],

    "default_path_patterns": [
        "sub-{subject}[/ses-{session}]/anat/sub-{subject}[_ses-{session}][_acq-{acquisition}][_ce-{contrast}][_rec-{reconstruction}]_{suffix<T1w|T2w|T1rho|T1map|T2map|T2star|FLAIR|FLASH|PDmap|PD|PDT2|inplaneT[12]|angio>}.nii.gz",
        "sub-{subject}[/ses-{session}]/anat/sub-{subject}[_ses-{session}][_acq-{acquisition}][_ce-{contrast}][_rec-{reconstruction}][_mod-{modality}]_{suffix<defacemask>}.nii.gz",
        "sub-{subject}[/ses-{session}]/func/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}]_{suffix<bold>}.nii.gz",
        "sub-{subject}[/ses-{session}]/dwi/sub-{subject}[_ses-{session}][_acq-{acquisition}]_{suffix<dwi>}.{extension<bval|bvec|json|nii\\.gz|nii>|nii\\.gz}",
        "sub-{subject}[/ses-{session}]/fmap/sub-{subject}[_ses-{session}][_acq-{acquisition}][_dir-{direction}][_run-{run}]_{fmap<phasediff|magnitude[1-2]|phase[1-2]|fieldmap|epi>}.nii.gz",
        "sub-{subject}[/ses-{session}]/[{datatype<func|meg|eeg>|func}/]sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}][_recording-{recording}]_{suffix<events>}.{extension<tsv>|tsv}",
        "sub-{subject}[/ses-{session}]/func/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}][_recording-{recording}]_{suffix<physio|stim>}.{extension<tsv\\.gz|json}",
        "sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_run-{run}][_proc-{proc}]_meg.{extension|json}",
        "sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_run-{run}][_proc-{proc}]_{suffix<channels>}.{extension<tsv>|tsv}",
        "sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}]_{suffix<coordsystem>}.json",
        "sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}]_{suffix<photo>}.jpg"
    ]
}

tyarkoni · 2020-05-12T19:51:21Z

At a glance, I don't see obvious differences from the default BIDS config, so I don't know if you need that argument. But it shouldn't affect much either way.

I'd suggest trying to initialize a BIDSLayout with maybe 10 subjects and seeing how long that take. If it's longer than a couple of seconds, you're looking at serious (and likely intractable) overhead from sshfs.

johnsaigle · 2020-05-12T20:08:16Z

BIDSLayout completed very quickly for 10 subjects. There was barely an instant between the debug messages I described above.

tyarkoni · 2020-05-12T20:28:44Z

Interesting. I'm not sure what's going on, then. If you explicitly time a short run (maybe 100 subjects), and extrapolate to what it would take to do 40k, that might give an indication of whether you just haven't waited long enough, or if there's some supralinear scaling that kicks in at some point. If it's the latter, I'll try to look into it as time allows, though it may be a while, as this will likely require some work.

johnsaigle · 2020-05-12T21:24:40Z

I'll give that a try if tonight's run fails. Thanks again for the help.

johnsaigle · 2020-05-20T19:14:20Z

I ran our pipeline processing script again with a different dataset that has about 200 participants (i.e. sub-* folders). The call to BIDSLayout above completed quickly. I wasn't really measuring but it definitely took less than a minute to complete. 👍 This was over sshfs, as before.

yarikoptic mentioned this issue May 12, 2020

Statement of the problem and review of possible existing solutions con/fscacher#1

Open

tyarkoni mentioned this issue Aug 19, 2020

ENH: Add relpath attribute to BIDSFile and associated refactoring #647

Merged

tyarkoni closed this as completed in #647 Aug 19, 2020

Remi-Gau mentioned this issue Oct 27, 2020

Better/proper/cleaner/easier handling of multi-session dataset bids-standard/bids-matlab#63

Closed

HippocampusGirl mentioned this issue May 13, 2021

ENH: Improve get performance #723

Merged

adelavega mentioned this issue Apr 1, 2022

Evaluate ancpbids as a successor to bids.layout #831

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BIDSLayout performance on very large datasets, continued #609

BIDSLayout performance on very large datasets, continued #609

johnsaigle commented May 12, 2020

yarikoptic commented May 12, 2020

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020 •

edited

Loading

tyarkoni commented May 12, 2020 •

edited

Loading

yarikoptic commented May 12, 2020

johnsaigle commented May 12, 2020

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020

tyarkoni commented May 12, 2020

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020 •

edited

Loading

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020

johnsaigle commented May 20, 2020

BIDSLayout performance on very large datasets, continued #609

BIDSLayout performance on very large datasets, continued #609

Comments

johnsaigle commented May 12, 2020

yarikoptic commented May 12, 2020

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020 • edited Loading

tyarkoni commented May 12, 2020 • edited Loading

yarikoptic commented May 12, 2020

johnsaigle commented May 12, 2020

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020

tyarkoni commented May 12, 2020

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020 • edited Loading

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020

tyarkoni commented May 12, 2020

johnsaigle commented May 12, 2020

johnsaigle commented May 20, 2020

johnsaigle commented May 12, 2020 •

edited

Loading

tyarkoni commented May 12, 2020 •

edited

Loading

johnsaigle commented May 12, 2020 •

edited

Loading