Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pybids 0.9.4 issue with slowness #521

Closed
DESm1th opened this issue Oct 22, 2019 · 2 comments
Closed

pybids 0.9.4 issue with slowness #521

DESm1th opened this issue Oct 22, 2019 · 2 comments

Comments

@DESm1th
Copy link

DESm1th commented Oct 22, 2019

Hello!

I'm trying to use pybids to work with a large-ish dataset. I installed what seems to be the newest version (0.9.4) and have found it to be really really slow (it took ~40 minutes to get the layout alone). I spoke to my coworker who uses pybids and who was insisting it shouldn't be as bad as I was experiencing and after a bit of investigation we discovered that she was using a much older version (pybids 0.7.0). I installed this and suddenly the layout took less than two minutes to construct for the same dataset.

I decided to profile both versions and found that most of the time in the new version is being spent on sqlite3 related instructions, which the older version doesn't seem to use (See the attached 'layout_x.x.x.txt' files). Do you know why there would be such an extreme difference in performance between the two versions? I would imagine the sqlite3 indexing is meant to speed things up, but while layout creation is nearly instant after the database has been made retrieving any data is still about ten times slower than for version 0.7.0 (see the two 'subject_x.x.x.txt' cProfile files I've attached for layout.get(subject='XXXXX')). I'd like to use the newest version if possible, is there anything I can do to turn off the sqlite3 usage in the new version? Or anything I can do to speed it up?

More details, just in case it helps:

  • OS == Ubuntu 16.04
  • sqlite3 version == 3.11.0-1ubuntu1.2
  • number of subjects == 475
  • number of files == 14,869
  • All subject data is on an NFS mounted file system
  • When using the newer version I had pybids make the database file on the local hard drive, not on NFS

layout_0.9.0.txt
layout_0.7.0.txt
subject_0.9.0.txt
subject_0.7.0.txt

@tyarkoni
Copy link
Collaborator

There's definitely plenty of optimization that could be done on the indexing side of things, but the main reason for moving to a DB was actually robustness and code maintainability—adding features that involve merging/joining data was becoming a nightmare, and is now much more straightforward. Performance is secondary.

For what it's worth, it's unlikely that the DB-backed version will ever be as fast as 0.7... there's considerable overhead associated with the SQLAlchemy ORM layer (in addition to the SQLite DB itself).

One thing to note is that the 0.7 series didn't index metadata by default, whereas the present version does. So it's possible that a lot of the difference is coming from that. If you don't need to, e.g., search files by metadata keys, you can initialize the BIDSLayout with index_metadata=False, and that may help quite a bit.

Note also that you can .save() a BIDSLayout and load from the saved file by passing database_file at initialization. So while it may take a long time to initialize the first time you index, you at least don't need to reindex every time you run some code.

@tyarkoni
Copy link
Collaborator

Closing this for now, as a major refactor of the indexing code is unlikely to happen any time soon, and the tips above plus #523 should help some.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants