Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate ancpbids as a successor to bids.layout #831

Open
adelavega opened this issue Apr 1, 2022 · 18 comments
Open

Evaluate ancpbids as a successor to bids.layout #831

adelavega opened this issue Apr 1, 2022 · 18 comments

Comments

@adelavega
Copy link
Collaborator

adelavega commented Apr 1, 2022

The ancpbids project has made formidable progress in implementing a BIDSLayout-like API, with marked performance improvements, using a different underlying implementation with a custom query language.

Given limited resources to maintain key community-led BIDS infrastructure, it is important to compare pybids to ancp-bids, and evaluate the possibility of combining efforts in order to prevent a fragmentation of the ecosystem.

@adelavega
Copy link
Collaborator Author

adelavega commented Apr 1, 2022

ancpBIDS:

  • instead of SQL, we have a custom query language without any external dependencies
  • the data structure used to query against is an in-memory graph representation of the BIDS dataset
  • the schema definitions are used to generate code that can assist in developing pipelines using professional IDEs (like PyCharm or VS Code) AND to drive the validation
  • most concepts are implemented as plug-ins
  • each BIDS schema version is represented in its own Python module which allows to dynamically choose the schema according to the BIDSVersion field in the dataset (at the moment, 1.7.0 and 1.7.1 are available)
  • most important aspects of BIDSLayout are covered, no modality specific functions to keep the API clean

Additional note:
There is an exemplary analysis pipeline which demonstrates how ancpBIDS can be used:
https://github.com/ANCPLabOldenburg/ancp-bids-apps/blob/main/ancpbidsapps/analysis/nilearn_first_level.py

Originally posted by @erdalkaraca in #818 (comment)

@adelavega
Copy link
Collaborator Author

adelavega commented Apr 1, 2022

The main current limitations of pybids:

  • Poor performance on large datasets
    We should determine is to what extent SQL is the cause of the performance problems in pybids, or if its more related to meta-data ingestion and poorly optimized supporting code.

  • Maintainability
    @tsalo raised the concern that the SQL underpinnings of pybids are difficult for people to contribute to.
    We need to evaluate if ancpbids is any easier to contribute to for new-ish contributors. If anything, SQL is something more ppl have familiarity with as its used in many other places. A developers guide for pybids would also be useful.

  • Extensibility
    pybids does not use BIDS Schema (yet). It is unclear to what extent decisions have been hardcoded into pybids.
    See: Define scope of project pybids-light#1

  • Modularity
    pybids has become somewhat bloated dependency wise. What are the heavy dependencies that we would like to get rid of from a "core" or "light" version of pyBIDS, and to what extent they are a problem?
    Can this be solved with extras in Python packaging, or is it better to fully modularize each sub-module into own pypi package / repo. My main concern here is that sometimes splitting things up into different code bases also causes its own set of headaches. cc: @yarikoptic
    See: Proposal: bids.ext namespace for extension packages #817

@adelavega
Copy link
Collaborator Author

An informative exercise may be to use ancpbids instead of BIDSLayout in other modules of pybids such as variables and modeling. This would highlight differences in functionality between the two packages. We could also create a suite of benchmarks that either package should be able to do in order to more rigorously evaluate them (also in terms of performance)

@erdalkaraca
Copy link

I initially analyzed the runtime performance behavior of pybids: most of the time was used to initialize the SQL database using SQLAlchemy. Note that SQLAlchemy as a object relational manager is rather used in business applications providing significant convenience mechanisms to simplify communication with a database when the business domain is rather complex. For pybids, it would have been better to not use SQLAlchemy as a mediation layer as there are only a handful of domain entities (BIDSFile, Entity, FileAssociation, etc.). I.e. directly communicating with the underlying SQLite database without SQLAlchemy (using the db-api package) would already give significant performance benefits.

@erdalkaraca
Copy link

I think that was the initial intention behind a "pybids lite" which mutated into ancpBIDS.

There is already a very simple benchmark (as unit tests) that we could further elaborate on:
https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/tests/manual/test_benchmark.py

An informative exercise may be to use ancpbids instead of BIDSLayout in other modules of pybids such as variables and modeling. This would highlight differences in functionality between the two packages. We could also create a suite of benchmarks that either package should be able to do in order to more rigorously evaluate them (also in terms of performance)

@tsalo tsalo changed the title Evaluate ancpbids as a sucesssor to bids.layout Evaluate ancpbids as a successor to bids.layout Apr 1, 2022
@erdalkaraca
Copy link

I just ran the "benchmark" unit test. Note the performance difference (4 sec vs. 1 min):

image

@adelavega
Copy link
Collaborator Author

That's great. Question: is ancp bids indexing meta-data by default? I think it would be useful to try pybids in those unit tests with index_metadata=False to evaluate the impact of meta-data.

@erdalkaraca
Copy link

ancpbids scans the file system once and builds up the graph in-memory. As the amount of meta-data is very low compared to the imaging data, meta-data is part of the graph as additional leaf nodes. As memory access is fast, there is no need to index meta-data.
For pybids, I just tried with index_metadata=False and execution time decreased to 53 sec.

If you have a more specific test case, let me know, I am happy to try out.

@adelavega
Copy link
Collaborator Author

Another idea is to try to run the main unit tests in pybids w/ ancp bids. Obviously many will fail simply because the APIs are not identical, but it would be useful to compare differences.

@erdalkaraca
Copy link

There are several unit tests which use BIDSLayout as their querying interface, for example:
https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/tests/auto/test_query.py

I may refactor/extract some use cases into the benchmark to make it more exhaustive.
I filed an new issue to track this idea:
ANCPLabOldenburg/ancp-bids#49

@adelavega
Copy link
Collaborator Author

adelavega commented Apr 1, 2022

@erdalkaraca I'm trying to reconcile your performance numbers w a previous investigation into performance:

see:

Basically, we concluded that most time was actually spent in os.walk and not in SQLAlchemy. Hence, I'm a bit confused how you manage to optimize on this so much. It's possible we didn't profile this correctly and made the wrong conclusion.

cc: @gkiar do you remember where your profiling results were?

Note: some of this profiling is pretty old, so its possible pybids pre SQL was EVEN slower

I also think its interesting to note that @tyarkoni noted he moved to SQL for maintainability, which goes against @tsalo's experience.

Not sure I have the right answer here but just trying to pin down the difference in approaches.

@gkiar
Copy link
Contributor

gkiar commented Apr 1, 2022

My profiling results are in #285 ; good luck!

@adelavega
Copy link
Collaborator Author

adelavega commented Apr 1, 2022

@erdalkaraca what's the best way to contact you? we're having a bids event soon that i'd like to invite you to participate in to discuss this further. if you want, you can reach me at the email listed on my profile.

@erdalkaraca
Copy link

On our lab server (using network storage) the dataset used in the benchmark took 6.5 mins to load/index using pybids. Of the 6.5 mins I could pinpoint 50% execution time to sqlalchemy interaction, the rest mostly being file system operations like os.walk as you mentioned.
On a lab dataset with lots of derivatives, a lab member reported that pybids took more than an hour to load/index the dataset and about 2 min when re-using the database. With ancpbids it took only 13 sec.

@erdalkaraca I'm trying to reconcile your performance numbers w a previous investigation into performance:

see:

Basically, we concluded that most time was actually spent in os.walk and not in SQLAlchemy. Hence, I'm a bit confused how you manage to optimize on this so much. It's possible we didn't profile this correctly and made the wrong conclusion.

cc: @gkiar do you remember where your profiling results were?

Note: some of this profiling is pretty old, so its possible pybids pre SQL was EVEN slower

I also think its interesting to note that @tyarkoni noted he moved to SQL for maintainability, which goes against @tsalo's experience.

Not sure I have the right answer here but just trying to pin down the difference in approaches.

@erdalkaraca
Copy link

@gkiar out of curiosity: could you use ancpbids to load the dataset you mentioned in #285?

you would have to from ancpbids import BIDSLayout after pip install ancpbids

@gkiar
Copy link
Contributor

gkiar commented Apr 5, 2022

@gkiar out of curiosity: could you use ancpbids to load the dataset you mentioned in #285?

you would have to from ancpbids import BIDSLayout after pip install ancpbids

Happily! I'm a bit swamped and don't have the reference environment anymore, so it'll have to wait a bit, but consider it added to my plate 🙂 If you attend the BIDS meeting @adelavega mentioned, we can also sync up more about this, then.

@gkiar
Copy link
Contributor

gkiar commented May 2, 2022

Updated profiling...

Tests

In [1]: def ancp_test(pth):
    ...:     import time
    ...:     from ancpbids import BIDSLayout
    ...:     start = time.time()
    ...:     bl = BIDSLayout(pth)
    ...:     dur = time.time() - start
    ...:     return bl, dur
    ...: 

In [2]: def pybids_test(pth):
    ...:     import time
    ...:     from bids.layout import BIDSLayout
    ...:     start = time.time()
    ...:     bl = BIDSLayout(pth)
    ...:     dur = time.time() - start
    ...:     return bl, dur
    ...: 

Results

Nsubs Library Time
492 pybids 348.6 s
492 ancpbids 7.7 s
1334 pybids 1215.4 s
1334 ancpbids 17.3 s

Next Step

  • Verify that metadata indexing is of similar quality
  • Evaluate similarity in API (in particular, for metadata extraction)
  • Identify source of super-linear slowdown in pybids

@erdalkaraca this is looking great! I'll keep poking at it this week. Will you be online?

@erdalkaraca
Copy link

Thanks a lot for posting the results, Greg @gkiar

Yes, I will be online for the zoom/discord call this week to talk about the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants