-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Accept Parquet (<entities>_<suffix>.parquet
) as alternative to .tsv
and .tsv.gz
formats
#1792
Comments
Well put. I would just add that the new PR to include HDF5 and/or Zarr (#1614), plus your comments about how this could be extended to handle tabular data (https://github.com/bids-standard/bids-specification/pull/1614/files#r1499517851), ameliorates most of my concerns. I still like parquet here, but I also feel that my concerns could be mostly addressed by using HDF5 and/or Zarr in the way you propose there. |
I am +1 in the proposal and +1 in how Chris describes it (totally agree with that "Well put" by @bendichter) Just a nuance:
The problems with TSV in BEP 020 are more about the not-very-explicit-but-not-implicit enforcement in BIDS that TSV.GZ files MUST encode only continuous, regularly sampled, and single-epoch data. This could be easily workarounded by:
I would imagine that this issue is orthogonal to the actual data format. |
I agree with @oesteban that an optional timestamps column would be helpful, though I think that's a separable issue from the file type discussion. Maybe we could discuss it in a new issue? |
Your idea
BIDS has generally followed the convention of adopting human-readable or widely-adopted standards for its files. At 1.0, we used
.tsv
for all tabular files except physiological and stimulus recordings, which use a headerless.tsv.gz
format. In 1.9, we added a headerlessmotion.tsv
file, which is quite large. The eye-tracking BEP (#1128) is underway, which is having to cope with some limitations in the TSV options.In 2024 we now have over a decade of the Apache Parquet format development. The format specification is open, and there is a Project(Arrow) which includes native libraries or bindings for Python, MATLAB, R, Julia, Java, Javascript and C, among others.
For data that do not benefit from human readability (TSV files > ~1k lines), Parquet offers advantages such as typed columns, chunked compression, as well as not requiring round-trips between floating point and ASCII decimal representations.
I propose the following:
.parquet
files anywhere that a TSV or TSV-GZ file is currently permitted..tsv
for high-level metadata tables, such asparticipants.tsv
,*_sessions.tsv
and*_scans.tsv
as well as*_channels.tsv
,*_electrodes.tsv
and similar metadata files.This is pulled out of #197, which is about N-dimensional data. I am excerpting the relevant recent posts here:
@satra (#197 (comment))
@effigies (#197 (comment))
@bendichter (#197 (comment))
The text was updated successfully, but these errors were encountered: