Inconsistent splits naming between `Raw` and `Epochs` when using `BIDS` schema #11870

dmalt · 2023-08-08T13:23:48Z

Description of the problem

Not really a bug, more of an inconsistent behaviour.
Saving large Raw and Epochs data with split_naming='bids' uses different naming schema.

When I save large Raw, I get suffixes split-01 and split-02, etc. but when saving Epochs, the first split is saved without suffix, while the second one has split-01. See the code snippet below for details. All asserts pass for me.

Is there a reason for such difference?

Steps to reproduce

import mne
from pathlib import Path

sample_data_folder = mne.datasets.sample.data_path()
sample_data_raw_file = sample_data_folder / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(sample_data_raw_file, verbose=False).crop(tmax=60)
ev = mne.find_events(raw)
epochs = mne.Epochs(raw, ev, tmin=-0.3, tmax=0.7)

raw.save("raw_meg.fif", split_naming="bids", split_size="50MB")
epochs.save("epochs_epo.fif", split_naming="bids", split_size="50MB")

# all these asserts pass
assert not Path("raw_meg.fif").is_file()
assert Path("raw_split-01_meg.fif").is_file()
assert Path("raw_split-02_meg.fif").is_file()

assert Path("epochs_epo.fif").is_file()  # this should fail though
assert Path("epochs_split-01_epo.fif").is_file()
assert not Path("epochs_split-02_epo.fif").is_file()  # this should fail as well

Link to data

No response

Expected results

These two asserts should fail

assert Path("epochs_epo.fif").is_file()
assert not Path("epochs_split-02_epo.fif").is_file()

Actual results

All asserts pass

Additional information

Platform macOS-13.4.1-arm64-arm-64bit
Python 3.11.3 (main, May 15 2023, 18:01:31) [Clang 14.0.6 ]
Executable /Users/dmalt/Applications/miniconda3/envs/mne_cli_tools/bin/python3
CPU arm (8 cores)
Memory Unavailable (requires "psutil" package)
Core
├☑ mne 1.4.2
├☑ numpy 1.25.2 (OpenBLAS 0.3.23.dev with 8 threads)
├☑ scipy 1.9.3
├☑ matplotlib 3.7.2Installed osx event loop hook.
(backend=MacOSX)
├☑ pooch 1.7.0
└☑ jinja2 3.1.2

Numerical (optional)
├☑ sklearn 1.3.0
├☑ pandas 1.5.3
└☐ unavailable numba, nibabel, nilearn, dipy, openmeeg, cupy

Visualization (optional)
└☐ unavailable pyvista, pyvistaqt, ipyvtklink, vtk, qtpy, ipympl, pyqtgraph, mne-qt-browser

Ecosystem (optional)
└☐ unavailable mne-bids, mne-nirs, mne-features, mne-connectivity, mne-icalabel, mne-bids-pipeline

The text was updated successfully, but these errors were encountered:

larsoner · 2023-08-08T13:25:21Z

Seems like a bug to me. @sappelhoff WDYT?

sappelhoff · 2023-08-08T14:56:18Z

Thanks for the report! I agree, this looks like a bug.

Furthermore, a minor point: the indexes (01, 02, ... ) do not really need to be prefixed with a 0 if they never exceed 9 -- and I guess we can know that before writing the files and adjust the behavior accordingly.

dmalt · 2023-08-08T16:26:41Z

Thanks for such a quick response from both of you!

Frankly, now that I think of it, I like the behaviour of Epochs better than that of Raw, although I suspect it was a conscious design choice in case of Raw.

Picture this. You have to write a data processing pipeline for a new experiment for which only part of the data were recorded. So far all the recorded data are under 2GB, so you don't have to worry about splits. Then the new data arrive and for one subject the file size is above 2GB, so when you save the intermediate preprocessed copy, the saved file will have the split- suffix. Now all the other scripts that are trying to access this file are broken and should be changed. If the intermediate steps don't reduce the file size, like filtering, the result for those steps will also have the split- suffix after saving, and so on. The situation gets even worse if you're using some sort of text-based templating framework for the filenames (like hydra). This actually happened to me once and it was really annoying to fix.

One way to avoid this would be to use the old "neuromag" split naming, but in this case I think BIDS checkers start complaining.

Instead if the main split didn't change its name, like for Epochs now, everything would have worked the same way.
I'm not sure though, what BIDS specification has to say about this matter and what was the original thinking process behind the splits behaviour for Raw as it is now.

What do you guys think?

sappelhoff · 2023-08-09T07:33:45Z

Instead if the main split didn't change its name, like for Epochs now, everything would have worked the same way.
I'm not sure though, what BIDS specification has to say about this matter and what was the original thinking process behind the splits behaviour for Raw as it is now.

The BIDS specification says when a file is split, all parts of that file need to contain the split suffix -- so the behavior as it is for raw currently is the BIDS conformant behavior.

It is RECOMMENDED that .fif files with multiple parts use the split- entity to indicate each part.

https://bids-specification.readthedocs.io/en/latest/glossary.html#split-entities

I understand where you are coming from, but I think you'd have to build in a little logic into your reading code to deal with this problem, and/or rely on BIDS reading software like mne-bids. Although admittedly, we do have open issues about split files there, see:

[BUG] unintuitive behavior with split files mne-bids#1064

dmalt · 2023-08-09T16:21:14Z

The BIDS specification says when a file is split, all parts of that file need to contain the split suffix -- so the behavior as it is for raw currently is the BIDS conformant behavior.

Ok, I see, Epochs.save needs to change then. I can try making a PR .

...and/or rely on BIDS reading software like mne-bids

I don't think mne-bids supports splits reading and writing with the same template at the moment. I would partly solve the problem. For example, this fails:

from pathlib import Path
from mne_bids import BIDSPath, read_raw_bids
from mne.io import read_raw
from mne.datasets import sample

root_path = Path("./tmp")
subj_path = root_path / "sub-01" / "meg"
subj_path.mkdir(parents=True)

raw_fpath = sample.data_path() / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = read_raw(raw_fpath)
raw.crop(0, 10)

bp = BIDSPath(subject="01", root=root_path, suffix="meg", extension=".fif", datatype="meg")
raw.save(bp.fpath, split_size="10MB", split_naming="bids")
# produces tmp/sub-01/meg/sub-01_split-01_meg.fif, tmp/sub-01/meg/sub-01_split-02_meg.fif

loaded_raw = read_raw_bids(bp)  # fails with
# FileNotFoundError: File does not exist:
# tmp/sub-01/meg/sub-01_meg.fif
# Did you mean one of:
# sub-01_split-02_meg.fif
# sub-01_split-01_meg.fif
# instead of:
# sub-01_meg.fif

Something like this should work though, both with and without splits

loaded_raw = read_raw_bids(sorted(bp.match(), key=str)[0])

(sorted(..., key=str) is required, since we want to pick the first split, but BIDSPaths are not comparable)

mne-tools/mne-bids#1064

Splits are a source of never ending joy, it seems :)
Now that I remember, I actually had a splits-related issue with mne-bids two years ago:
mne-tools/mne-bids#712

drammock · 2023-08-09T16:26:47Z

ouch. it's really bad that raw.save(my_bidspath) generates filenames that can't be read by read_raw_bids(my_bidspath).

dmalt · 2023-08-09T17:02:50Z

BTW, is it possible to modify saving raw files in .fif.gz so all the splits live inside the same archive to avoid having splits altogether?
I think "parse, don't validate" rule applies here: it might be better to convert to a format that doesn't use splits once than to think about this corner case each time. Just a hypothetical thought :)

EDIT.
I guess, gunzip doesn't allow merging the splits together. It would have to be a tarball, so something like .fif.tar.gz instead. But still worth a thought.

agramfort · 2023-08-12T13:44:57Z

BTW, is it possible to modify saving raw files in .fif.gz so all the splits live inside the same archive to avoid this altogether?

it's not what .gz (gzip) does. It only compresses one file in mne (to make it smaller).

I think "parse, not validate" rule applies here: it might be better to convert to a format that doesn't use splits once than to think about this corner case each time. Just a thought.

it's a trade off. Using .fif guarantees that all measurement info / metadata are not lost for MEG data. I don't know any other format that guarantees this and that already has readers for most programming languages and will likely be readable in > 10 years. I would say we need to fix the way we handle split files to make it painless even if it involves a bit of magic.

…

Message ID: ***@***.***>

dmalt · 2023-08-12T17:11:39Z

it's not what .gz (gzip) does. It only compresses one file in mne (to make it smaller).

I see. Somehow I thought gzip was like zip archives.

I'm definitely not suggesting moving away from .fif. Merely having an option to wrap it into archive as a potential solution to splits. You already do the compression/decompression in io functions, when filename ends in .fif.gz. The same way you could wrap/unwrap split files (or a single file) in a tarball, when the extension is .fif.tar.gz or .fif.tar. It's not gonna solve the splits problem completely, because you can't really drop the support for the vanilla .fif, so the splits bugs will reappear from time to time. But if tarballs are advertised in the docs as a recommended way of dealing with splits, a lot of these bugs will go away for the end user. .fif splits are not meant to be treated as separate files: any time you move them apart or even rename them, everything breaks. It won't happen if they are inside an archive.

I realise there are issues with this approach but still thought it was worth suggesting. Don't know. Maybe it's all nonsense and will only complicate things further.

agramfort · 2023-08-13T08:41:35Z

I'm definitely not suggesting moving away from .fif. Merely having an option to wrap it into archive as a potential solution to splits. You already do the compression/decompression in io functions, when filename ends in .fif.gz. The same way you could wrap/unwrap split files (or a single file) in a tarball, when the extension is .fif.tar.gz or .fif.tar. It's not gonna solve the splits problem completely, because you can't really drop the support for the vanilla .fif, so the splits bugs will reappear from time to time. But if tarballs are advertised in the docs as a recommended way of dealing with splits, a lot of these bugs will go away for the end user. .fif splits are not meant to be treated as separate files: any time you move them apart or even rename them, everything breaks. It won't happen if they are inside an archive.

the issue with tarball idea is that you would need to decompress the data in memory to be able to start loading data. So I fear it will be heavy both in computation and memory.

…

Message ID: ***@***.***>

dmalt · 2023-08-16T13:31:45Z

Yes, that's true, this will break preload=False since there's no way to know where the second split starts in a tar archive without unarchiving the whole thing. Maybe with zip it's possible, since it has index. I'm not sure.

dmalt added the BUG label Aug 8, 2023

dmalt mentioned this issue Aug 12, 2023

BUG: Fix epoch splits naming #11876

Merged

dmalt mentioned this issue Aug 14, 2023

Refactor test_epochs.py::test_split_saving (1 out of 2) #11880

Merged

dmalt mentioned this issue Aug 16, 2023

Refactor test_epochs.py::test_split_saving (2 out of 2) #11884

Merged

larsoner closed this as completed in #11884 Aug 16, 2023

larsoner mentioned this issue Aug 18, 2023

BUG: read_raw_bids and raw.save round-trip with splits #11900

Open

dmalt mentioned this issue Aug 25, 2023

Proof of concept: add support for zipfile.Path when loading raw data #11924

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent splits naming between `Raw` and `Epochs` when using `BIDS` schema #11870

Inconsistent splits naming between `Raw` and `Epochs` when using `BIDS` schema #11870

dmalt commented Aug 8, 2023 •

edited

Loading

larsoner commented Aug 8, 2023

sappelhoff commented Aug 8, 2023

dmalt commented Aug 8, 2023

sappelhoff commented Aug 9, 2023

dmalt commented Aug 9, 2023 •

edited

Loading

drammock commented Aug 9, 2023

dmalt commented Aug 9, 2023 •

edited

Loading

agramfort commented Aug 12, 2023 via email

dmalt commented Aug 12, 2023

agramfort commented Aug 13, 2023 via email

dmalt commented Aug 16, 2023

Inconsistent splits naming between Raw and Epochs when using BIDS schema #11870

Inconsistent splits naming between Raw and Epochs when using BIDS schema #11870

Comments

dmalt commented Aug 8, 2023 • edited Loading

Description of the problem

Steps to reproduce

Link to data

Expected results

Actual results

Additional information

larsoner commented Aug 8, 2023

sappelhoff commented Aug 8, 2023

dmalt commented Aug 8, 2023

sappelhoff commented Aug 9, 2023

dmalt commented Aug 9, 2023 • edited Loading

drammock commented Aug 9, 2023

dmalt commented Aug 9, 2023 • edited Loading

agramfort commented Aug 12, 2023 via email

dmalt commented Aug 12, 2023

agramfort commented Aug 13, 2023 via email

dmalt commented Aug 16, 2023

Inconsistent splits naming between `Raw` and `Epochs` when using `BIDS` schema #11870

Inconsistent splits naming between `Raw` and `Epochs` when using `BIDS` schema #11870

dmalt commented Aug 8, 2023 •

edited

Loading

dmalt commented Aug 9, 2023 •

edited

Loading

dmalt commented Aug 9, 2023 •

edited

Loading