Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIDSLayout performance on very large datasets, continued #609

Closed
johnsaigle opened this issue May 12, 2020 · 16 comments · Fixed by #647
Closed

BIDSLayout performance on very large datasets, continued #609

johnsaigle opened this issue May 12, 2020 · 16 comments · Fixed by #647

Comments

@johnsaigle
Copy link

Hi! 👋 I'm working on a very large BIDS dataset (about 40k participants). Our analysis script is choking on a call to BIDSLayout.

The issue is similar to #285. The author of this issue, @gkiar, is a coworker of mine and he's been coaching me through an approach to massaging the data so that the processing pipeline can move forward.

As a workaround, Greg has suggested analyzing the data in chunks. We would take the first 1000 candidates (as an example) and create a BIDSLayout object for just these candidates by supplying an appropriate regex to the exclude parameter. We would do this for each batch of 1000 and then glue each BIDSLayout object together manually within the script.

I'm looking for some guidance on how to combine BIDSLayout objects. I assume I can't just do BIDSLayoutN = BIDSLayout1 + BIDSLayout2 + .... .

What elements of this class would need to be concatenated in order to get a combined, working BIDSLayout object?

Also - if we're way off track with this approach, any other suggestions as to how to work with pybids and BIDSLayout with a dataset of this scale would be greatly appreciated. :)

Thanks in advance!

@yarikoptic
Copy link
Collaborator

is choking

takes too long? or crashes?

@tyarkoni
Copy link
Collaborator

You can't really concatenate BIDSLayout objects, and there's really no use case for that... I think the fact that you're asking about this is a sign that there are deeper problems elsewhere (i.e., with scalability). So, yeah, we need to figure out how to make the BIDSLayout work better with massive datasets.

When you say the script chokes, do you mean that the initial BIDSLayout initialization just never completes, or that it just takes too long to be practical? If the latter, see #521 for some tips that might solve your problem. You can save the DB, so you only have to live through the indexing once. And if you can live without metadata indexing, things should get much faster.

As I say in #521, I'm sure optimization is possible—but is likely to require some effort. From the profiling @gkiar and others have done, it's clear that nearly all of the time is being spent at OS level, and not in Python itself. The main culprit is that os.walk() calls stat on every file, and if you have 40k subjects, you probably have several hundred thousand directories that need to be scanned this way. If you're not already on 3.6+, you can try upgrading Python and see if that helps, as I think os.walk is more efficient in recent versions. But beyond that, to speed things up, we'd need to replace os.walk with some other file-scanning strategy.

At the end of the day, with millions of files, some non-trivial fraction of which need to be read in and stored in the DB (all the JSON files), I doubt you're ever going to have something fast enough to do the indexing at run-time. The best-case scenario is that one-time indexing and then saving the DB to disk works (as suggested in #521) solves your problem.

@johnsaigle
Copy link
Author

johnsaigle commented May 12, 2020

Sorry for the vague language. So far I haven't had a successful run with creating a BIDSLayout for that dataset. The longest I've left it running was about 10 hours after which I was disconnected from the server and the script execution was cancelled (for an unknown reason that's probably not related).

We only need to run the pipeline once; the goal is to insert every path to an imaging file within the BIDS dataset into another database.

I'm on Python 3.6.9 and the latest version of pybids so it sounds like my only option here is to let the BIDSLayout run for longer and cross my fingers.

@tyarkoni
Copy link
Collaborator

tyarkoni commented May 12, 2020

Probably also worth running with index_metadata=False... that might help quite a bit. Of course you might want the metadata accessible, in which case you'll need to re-run with it turned back on, but at least you'll have some assurance that indexing can complete.

@yarikoptic
Copy link
Collaborator

side-topic which triggered me to initiate the "issue": con/fscacher#1

@johnsaigle
Copy link
Author

Thanks @tyarkoni I'll add that argument. I appreciate the help.

@tyarkoni
Copy link
Collaborator

@johnsaigle do you have a sense of how many files we're talking about, in total (or even just per subject)? That might help me build an intuition about how long this should take... 10 hours seems unreasonably long even for 40k subjects, so I'm wondering if there's a separate bottleneck (e.g., inserting records into the DB).

@johnsaigle
Copy link
Author

I'm not exactly sure how many files. I just looked at a few directories arbitrarily and they each seem to have two imaging (NIFTI) files and two accompanying JSON files.

The script never reaches the part where it connects to the destination database. Here's where the script is getting stuck.

    def load_bids_data(self):
        """
        Loads the BIDS study using the BIDSLayout function (part of the pybids
        package) and return the object.

        :return: bids structure
        """


        if self.verbose:
            print('Loading the BIDS dataset with BIDS layout library...\n')


        bids_config = os.environ['LORIS_MRI'] + "/python/lib/bids.json"
        exclude_arr = ['/code/', '/sourcedata/', '/log/', '.git/']
        bids_layout = BIDSLayout(root=self.bids_dir, config=bids_config, ignore=exclude_arr)


        if self.verbose:
            print('\t=> BIDS dataset loaded with BIDS layout\n')


        return bids_layout

The script prints Loading the BIDS dataset with BIDS layout library... but never reaches => BIDS dataset loaded with BIDS layout. It's only after this point that the script connects to the destination database.

The BIDS directory is remotely hosted via sshfs so the RTT could be a potential bottleneck.

@tyarkoni
Copy link
Collaborator

What's in the custom BIDS config (bids.json)? There's no obvious reason why that would slow things down, except in the case there are malformed regular expressions that accidentally multiply the number of detected entities. Do you mind pasting the contents of that file here? Alternatively (or in addition), you could try running with the default config (i.e., just remove the config argument) and see if that changes anything.

@tyarkoni
Copy link
Collaborator

Oh, heh... I missed the last sentence on first read. Yes, if you're trying to index remotely over ssh, that seems very likely to slow things down to the point where they're unworkable. I don't know much about SSHFS, but I'd be pretty surprised if that doesn't impose all kinds of bottlenecks (at minimum, you're probably having to transfer all the JSON files and directory listings; but depending on implementation, you might actually be inadvertently transferring the image files too).

@johnsaigle
Copy link
Author

johnsaigle commented May 12, 2020

Yeah I don't know too much about sshfs either 😅 Unfortunately there's no way around it for now.

I'll run the script again overnight and I'll try using the default config file as well as disabling the index metadata.

bids.json comes from the repo of the pipeline I'm using, LORIS-MRI.

Here is the contents.

{
    "name": "bids",
    "entities": [
        {
            "name": "subject",
            "pattern": "[/\\\\]+sub-([a-zA-Z0-9]+)",
            "directory": "{subject}"
        },
        {
            "name": "session",
            "pattern": "[_/\\\\]+ses-([a-zA-Z0-9]+)",
            "mandatory": false,
            "directory": "{subject}{session}"
        },
        {
            "name": "task",
            "pattern": "[_/\\\\]+task-([a-zA-Z0-9]+)"
        },
        {
            "name": "acquisition",
            "pattern": "[_/\\\\]+acq-([a-zA-Z0-9]+)"
        },
        {
            "name": "ce",
            "pattern": "[_/\\\\]+ce-([a-zA-Z0-9]+)"
        },
        {
            "name": "reconstruction",
            "pattern": "[_/\\\\]+rec-([a-zA-Z0-9]+)"
        },
        {
            "name": "dir",
            "pattern": "[_/\\\\]+dir-([a-zA-Z0-9]+)"
        },
        {
            "name": "run",
            "pattern": "[_/\\\\]+run-0*(\\d+)",
            "dtype": "int"
        },
        {
            "name": "proc",
            "pattern": "[_/\\\\]+proc-([a-zA-Z0-9]+)"
        },
        {
            "name": "modality",
            "pattern": "[_/\\\\]+mod-([a-zA-Z0-9]+)"
        },
        {
            "name": "echo",
            "pattern": "[_/\\\\]+echo-([0-9]+)\\_bold."
        },
        {
            "name": "recording",
            "pattern": "[_/\\\\]+recording-([a-zA-Z0-9]+)"
        },
        {
            "name": "suffix",
            "pattern": "[._]*([a-zA-Z0-9]*?)\\.[^/\\\\]+$"
        },
        {
            "name": "scans",
            "pattern": "(.*\\_scans.tsv)$"
        },
        {
            "name": "fmap",
            "pattern": "(phasediff|magnitude[1-2]|phase[1-2]|fieldmap|epi)\\.nii"
        },
        {
            "name": "datatype",
            "pattern": "[/\\\\]+(func|anat|fmap|dwi|meg|eeg)[/\\\\]+"
        },
        {
            "name": "extension",
            "pattern": "[._]*[a-zA-Z0-9]*?\\.([^/\\\\]+)$"
        }
    ],

    "default_path_patterns": [
        "sub-{subject}[/ses-{session}]/anat/sub-{subject}[_ses-{session}][_acq-{acquisition}][_ce-{contrast}][_rec-{reconstruction}]_{suffix<T1w|T2w|T1rho|T1map|T2map|T2star|FLAIR|FLASH|PDmap|PD|PDT2|inplaneT[12]|angio>}.nii.gz",
        "sub-{subject}[/ses-{session}]/anat/sub-{subject}[_ses-{session}][_acq-{acquisition}][_ce-{contrast}][_rec-{reconstruction}][_mod-{modality}]_{suffix<defacemask>}.nii.gz",
        "sub-{subject}[/ses-{session}]/func/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}]_{suffix<bold>}.nii.gz",
        "sub-{subject}[/ses-{session}]/dwi/sub-{subject}[_ses-{session}][_acq-{acquisition}]_{suffix<dwi>}.{extension<bval|bvec|json|nii\\.gz|nii>|nii\\.gz}",
        "sub-{subject}[/ses-{session}]/fmap/sub-{subject}[_ses-{session}][_acq-{acquisition}][_dir-{direction}][_run-{run}]_{fmap<phasediff|magnitude[1-2]|phase[1-2]|fieldmap|epi>}.nii.gz",
        "sub-{subject}[/ses-{session}]/[{datatype<func|meg|eeg>|func}/]sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}][_recording-{recording}]_{suffix<events>}.{extension<tsv>|tsv}",
        "sub-{subject}[/ses-{session}]/func/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_rec-{reconstruction}][_run-{run}][_echo-{echo}][_recording-{recording}]_{suffix<physio|stim>}.{extension<tsv\\.gz|json}",
        "sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_run-{run}][_proc-{proc}]_meg.{extension|json}",
        "sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}][_run-{run}][_proc-{proc}]_{suffix<channels>}.{extension<tsv>|tsv}",
        "sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}]_{suffix<coordsystem>}.json",
        "sub-{subject}[/ses-{session}]/meg/sub-{subject}[_ses-{session}]_task-{task}[_acq-{acquisition}]_{suffix<photo>}.jpg"
    ]
}

@tyarkoni
Copy link
Collaborator

At a glance, I don't see obvious differences from the default BIDS config, so I don't know if you need that argument. But it shouldn't affect much either way.

I'd suggest trying to initialize a BIDSLayout with maybe 10 subjects and seeing how long that take. If it's longer than a couple of seconds, you're looking at serious (and likely intractable) overhead from sshfs.

@johnsaigle
Copy link
Author

BIDSLayout completed very quickly for 10 subjects. There was barely an instant between the debug messages I described above.

@tyarkoni
Copy link
Collaborator

Interesting. I'm not sure what's going on, then. If you explicitly time a short run (maybe 100 subjects), and extrapolate to what it would take to do 40k, that might give an indication of whether you just haven't waited long enough, or if there's some supralinear scaling that kicks in at some point. If it's the latter, I'll try to look into it as time allows, though it may be a while, as this will likely require some work.

@johnsaigle
Copy link
Author

I'll give that a try if tonight's run fails. Thanks again for the help.

@johnsaigle
Copy link
Author

I ran our pipeline processing script again with a different dataset that has about 200 participants (i.e. sub-* folders). The call to BIDSLayout above completed quickly. I wasn't really measuring but it definitely took less than a minute to complete. 👍 This was over sshfs, as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants