Evaluate ancpbids as a successor to bids.layout #831

adelavega · 2022-04-01T17:27:47Z

The ancpbids project has made formidable progress in implementing a BIDSLayout-like API, with marked performance improvements, using a different underlying implementation with a custom query language.

Given limited resources to maintain key community-led BIDS infrastructure, it is important to compare pybids to ancp-bids, and evaluate the possibility of combining efforts in order to prevent a fragmentation of the ecosystem.

The text was updated successfully, but these errors were encountered:

adelavega · 2022-04-01T17:29:33Z

ancpBIDS:

instead of SQL, we have a custom query language without any external dependencies

the data structure used to query against is an in-memory graph representation of the BIDS dataset

the schema definitions are used to generate code that can assist in developing pipelines using professional IDEs (like PyCharm or VS Code) AND to drive the validation

most concepts are implemented as plug-ins

each BIDS schema version is represented in its own Python module which allows to dynamically choose the schema according to the BIDSVersion field in the dataset (at the moment, 1.7.0 and 1.7.1 are available)

most important aspects of BIDSLayout are covered, no modality specific functions to keep the API clean

Additional note:
There is an exemplary analysis pipeline which demonstrates how ancpBIDS can be used:
https://github.com/ANCPLabOldenburg/ancp-bids-apps/blob/main/ancpbidsapps/analysis/nilearn_first_level.py

Originally posted by @erdalkaraca in #818 (comment)

adelavega · 2022-04-01T17:33:05Z

The main current limitations of pybids:

Poor performance on large datasets
We should determine is to what extent SQL is the cause of the performance problems in pybids, or if its more related to meta-data ingestion and poorly optimized supporting code.
Maintainability
@tsalo raised the concern that the SQL underpinnings of pybids are difficult for people to contribute to.
We need to evaluate if ancpbids is any easier to contribute to for new-ish contributors. If anything, SQL is something more ppl have familiarity with as its used in many other places. A developers guide for pybids would also be useful.
Extensibility
pybids does not use BIDS Schema (yet). It is unclear to what extent decisions have been hardcoded into pybids.
See: Define scope of project pybids-light#1
Modularity
pybids has become somewhat bloated dependency wise. What are the heavy dependencies that we would like to get rid of from a "core" or "light" version of pyBIDS, and to what extent they are a problem?
Can this be solved with extras in Python packaging, or is it better to fully modularize each sub-module into own pypi package / repo. My main concern here is that sometimes splitting things up into different code bases also causes its own set of headaches. cc: @yarikoptic
See: Proposal: bids.ext namespace for extension packages #817

adelavega · 2022-04-01T17:55:08Z

An informative exercise may be to use ancpbids instead of BIDSLayout in other modules of pybids such as variables and modeling. This would highlight differences in functionality between the two packages. We could also create a suite of benchmarks that either package should be able to do in order to more rigorously evaluate them (also in terms of performance)

erdalkaraca · 2022-04-01T18:32:31Z

I initially analyzed the runtime performance behavior of pybids: most of the time was used to initialize the SQL database using SQLAlchemy. Note that SQLAlchemy as a object relational manager is rather used in business applications providing significant convenience mechanisms to simplify communication with a database when the business domain is rather complex. For pybids, it would have been better to not use SQLAlchemy as a mediation layer as there are only a handful of domain entities (BIDSFile, Entity, FileAssociation, etc.). I.e. directly communicating with the underlying SQLite database without SQLAlchemy (using the db-api package) would already give significant performance benefits.

erdalkaraca · 2022-04-01T18:44:45Z

I think that was the initial intention behind a "pybids lite" which mutated into ancpBIDS.

There is already a very simple benchmark (as unit tests) that we could further elaborate on:
https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/tests/manual/test_benchmark.py

An informative exercise may be to use ancpbids instead of BIDSLayout in other modules of pybids such as variables and modeling. This would highlight differences in functionality between the two packages. We could also create a suite of benchmarks that either package should be able to do in order to more rigorously evaluate them (also in terms of performance)

erdalkaraca · 2022-04-01T18:58:06Z

I just ran the "benchmark" unit test. Note the performance difference (4 sec vs. 1 min):

adelavega · 2022-04-01T19:44:15Z

That's great. Question: is ancp bids indexing meta-data by default? I think it would be useful to try pybids in those unit tests with index_metadata=False to evaluate the impact of meta-data.

erdalkaraca · 2022-04-01T19:51:11Z

ancpbids scans the file system once and builds up the graph in-memory. As the amount of meta-data is very low compared to the imaging data, meta-data is part of the graph as additional leaf nodes. As memory access is fast, there is no need to index meta-data.
For pybids, I just tried with index_metadata=False and execution time decreased to 53 sec.

If you have a more specific test case, let me know, I am happy to try out.

adelavega · 2022-04-01T19:55:26Z

Another idea is to try to run the main unit tests in pybids w/ ancp bids. Obviously many will fail simply because the APIs are not identical, but it would be useful to compare differences.

erdalkaraca · 2022-04-01T20:02:08Z

There are several unit tests which use BIDSLayout as their querying interface, for example:
https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/tests/auto/test_query.py

I may refactor/extract some use cases into the benchmark to make it more exhaustive.
I filed an new issue to track this idea:
ANCPLabOldenburg/ancp-bids#49

adelavega · 2022-04-01T22:15:01Z

@erdalkaraca I'm trying to reconcile your performance numbers w a previous investigation into performance:

see:

Basically, we concluded that most time was actually spent in os.walk and not in SQLAlchemy. Hence, I'm a bit confused how you manage to optimize on this so much. It's possible we didn't profile this correctly and made the wrong conclusion.

cc: @gkiar do you remember where your profiling results were?

Note: some of this profiling is pretty old, so its possible pybids pre SQL was EVEN slower

I also think its interesting to note that @tyarkoni noted he moved to SQL for maintainability, which goes against @tsalo's experience.

Not sure I have the right answer here but just trying to pin down the difference in approaches.

gkiar · 2022-04-01T22:19:20Z

My profiling results are in #285 ; good luck!

adelavega · 2022-04-01T22:35:39Z

@erdalkaraca what's the best way to contact you? we're having a bids event soon that i'd like to invite you to participate in to discuss this further. if you want, you can reach me at the email listed on my profile.

erdalkaraca · 2022-04-01T23:04:48Z

On our lab server (using network storage) the dataset used in the benchmark took 6.5 mins to load/index using pybids. Of the 6.5 mins I could pinpoint 50% execution time to sqlalchemy interaction, the rest mostly being file system operations like os.walk as you mentioned.
On a lab dataset with lots of derivatives, a lab member reported that pybids took more than an hour to load/index the dataset and about 2 min when re-using the database. With ancpbids it took only 13 sec.

@erdalkaraca I'm trying to reconcile your performance numbers w a previous investigation into performance:

see:

BIDSLayout performance on very large datasets, continued #609

pybids 0.9.4 issue with slowness #521

Basically, we concluded that most time was actually spent in os.walk and not in SQLAlchemy. Hence, I'm a bit confused how you manage to optimize on this so much. It's possible we didn't profile this correctly and made the wrong conclusion.

cc: @gkiar do you remember where your profiling results were?

Note: some of this profiling is pretty old, so its possible pybids pre SQL was EVEN slower

I also think its interesting to note that @tyarkoni noted he moved to SQL for maintainability, which goes against @tsalo's experience.

Not sure I have the right answer here but just trying to pin down the difference in approaches.

erdalkaraca · 2022-04-01T23:19:46Z

@gkiar out of curiosity: could you use ancpbids to load the dataset you mentioned in #285?

you would have to from ancpbids import BIDSLayout after pip install ancpbids

gkiar · 2022-04-05T18:15:09Z

@gkiar out of curiosity: could you use ancpbids to load the dataset you mentioned in #285?

you would have to from ancpbids import BIDSLayout after pip install ancpbids

Happily! I'm a bit swamped and don't have the reference environment anymore, so it'll have to wait a bit, but consider it added to my plate 🙂 If you attend the BIDS meeting @adelavega mentioned, we can also sync up more about this, then.

gkiar · 2022-05-02T21:09:18Z

Updated profiling...

Tests

In [1]: def ancp_test(pth):
    ...:     import time
    ...:     from ancpbids import BIDSLayout
    ...:     start = time.time()
    ...:     bl = BIDSLayout(pth)
    ...:     dur = time.time() - start
    ...:     return bl, dur
    ...: 

In [2]: def pybids_test(pth):
    ...:     import time
    ...:     from bids.layout import BIDSLayout
    ...:     start = time.time()
    ...:     bl = BIDSLayout(pth)
    ...:     dur = time.time() - start
    ...:     return bl, dur
    ...:

Results

Nsubs	Library	Time
492	`pybids`	348.6 s
492	`ancpbids`	7.7 s
1334	`pybids`	1215.4 s
1334	`ancpbids`	17.3 s

Next Step

Verify that metadata indexing is of similar quality
Evaluate similarity in API (in particular, for metadata extraction)
Identify source of super-linear slowdown in pybids

@erdalkaraca this is looking great! I'll keep poking at it this week. Will you be online?

erdalkaraca · 2022-05-02T21:21:26Z

Thanks a lot for posting the results, Greg @gkiar

Yes, I will be online for the zoom/discord call this week to talk about the implementation.

adelavega mentioned this issue Apr 1, 2022

Use BIDS schema to "drive" pybids #818

Open

adelavega added the question label Apr 1, 2022

tsalo changed the title ~~Evaluate ancpbids as a sucesssor to bids.layout~~ Evaluate ancpbids as a successor to bids.layout Apr 1, 2022

gkiar mentioned this issue Apr 28, 2022

🐛 Need more flexible output/source directory ingress FCP-INDI/C-PAC#1723

Open

4 tasks

adelavega added the layout label May 4, 2022

gkiar mentioned this issue May 4, 2022

[WIP] Replace BIDSLayout backend with ancpbids #851

Closed

2 tasks

erdalkaraca mentioned this issue Jun 11, 2022

Pre-submission enquiry openjournals/joss#1090

Closed

effigies mentioned this issue Nov 18, 2022

Lightweight BIDS Layouts for all brainhackorg/global2022#87

Open

2 tasks

alyssadai mentioned this issue Mar 21, 2023

See if we can speed things up with ancpBIDS neurobagel/bagel-cli#8

Closed

3 tasks

htwangtw mentioned this issue Apr 18, 2023

Relax requirement for input BIDS data structure for first level GLM nilearn/nilearn#3567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate ancpbids as a successor to bids.layout #831

Evaluate ancpbids as a successor to bids.layout #831

adelavega commented Apr 1, 2022 •

edited

Loading

adelavega commented Apr 1, 2022 •

edited

Loading

adelavega commented Apr 1, 2022 •

edited

Loading

adelavega commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

adelavega commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

adelavega commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

adelavega commented Apr 1, 2022 •

edited

Loading

gkiar commented Apr 1, 2022

adelavega commented Apr 1, 2022 •

edited

Loading

erdalkaraca commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

gkiar commented Apr 5, 2022

gkiar commented May 2, 2022

erdalkaraca commented May 2, 2022

Evaluate ancpbids as a successor to bids.layout #831

Evaluate ancpbids as a successor to bids.layout #831

Comments

adelavega commented Apr 1, 2022 • edited Loading

adelavega commented Apr 1, 2022 • edited Loading

adelavega commented Apr 1, 2022 • edited Loading

adelavega commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

adelavega commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

adelavega commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

adelavega commented Apr 1, 2022 • edited Loading

gkiar commented Apr 1, 2022

adelavega commented Apr 1, 2022 • edited Loading

erdalkaraca commented Apr 1, 2022

erdalkaraca commented Apr 1, 2022

gkiar commented Apr 5, 2022

gkiar commented May 2, 2022

Tests

Results

Next Step

erdalkaraca commented May 2, 2022

adelavega commented Apr 1, 2022 •

edited

Loading

adelavega commented Apr 1, 2022 •

edited

Loading

adelavega commented Apr 1, 2022 •

edited

Loading

adelavega commented Apr 1, 2022 •

edited

Loading

adelavega commented Apr 1, 2022 •

edited

Loading