Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Enabling Cython-based PDB parser backend for speed improvements #139

Open
a-r-j opened this issue Aug 29, 2023 · 4 comments
Open

[RFC] Enabling Cython-based PDB parser backend for speed improvements #139

a-r-j opened this issue Aug 29, 2023 · 4 comments

Comments

@a-r-j
Copy link
Contributor

a-r-j commented Aug 29, 2023

Describe the workflow you want to enable

Currently, the pure-python of PDB parsing in BioPandas is quite slow - certainly too slow for highthroughput structural bioinformatics or ML.

Describe your proposed solution

I have written a Cython-based implementation (CPDB) which is considerably faster and would like to set this as the default parsing backend. As it stands, I believe this to be one of the fastest (if not the fastest) available PDB parser for Python.

Screenshot 2023-08-29 at 13 25 44

Performance comparison

However, given BioPandas' widespread usage, I am unclear if distributing this with a Cython component will lead to dependency problems for users.

Describe alternatives you've considered, if relevant

Speeding up the passage of time

Additional context

@rasbt
Copy link
Member

rasbt commented Aug 29, 2023

@a-r-j This is super cool.

Btw. perhaps we don't need to worry about it extra dependencies here because NumPy already uses Cython (https://github.com/numpy/numpy/blob/main/build_requirements.txt), and pandas is build on NumPy, and BioPandas is build on pandas :P

@a-r-j
Copy link
Contributor Author

a-r-j commented Aug 29, 2023

That's a good point! I was mostly concerned about the potential for build problems (mostly as cpdb is my first time working with Cython). I'll make a PR tonight and push a dev release so we can collect some feedback.

@Ruibin-Liu
Copy link
Contributor

One difference in the comparison is that your Cython implementation only reads ATOM, HETATM, and ENDMDL lines while biopandas reads all. Would be interesting to compare the performance if all lines are read (no need to parse like biopandas?).

@a-r-j
Copy link
Contributor Author

a-r-j commented Aug 30, 2023

@Ruibin-Liu Hmm, that's a really great point. I could add a read_header arg to cpdb. In any case, I wouldn't have thought it would make a huge difference to speed; in terms of line count PDB files are most coordinates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants