Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Pandas Dataframe Merge and Other Performance Issues #199

Closed
d33bs opened this issue May 21, 2022 · 3 comments
Closed

Address Pandas Dataframe Merge and Other Performance Issues #199

d33bs opened this issue May 21, 2022 · 3 comments

Comments

@d33bs
Copy link
Member

d33bs commented May 21, 2022

Pandas Dataframe merge (and likely other) performance during various runtime operations may hinder or completely stall progress. This issue is dedicated to addressing Pandas merge and other performance challenges, including solutions which may not involve or migrate from Pandas itself.

Issues which may be related or tied to this:

@d33bs
Copy link
Member Author

d33bs commented Jun 1, 2022

Hi @gwaygenomics, just wanted to provide a quick update on this issue based on some findings yesterday. I ran some tests using some of the work from #198 to see just how much we could reduce the memory consumption. In addition to the dataset being modified to include NULL's in the float columns and using ConnectorX for the SQLite read, there are a few performance tweaks that I found decreased the memory consumption:

  1. As-is performance with SQLite 'nan's:
  2. ConnectorX reads for load_compartment and with SQLite 'nan's replaced by NULL:
    • Duration: ~6 minutes (down ~22 minutes from as-is)
    • Peak Memory Consumption: 42.4 GB (down 22.3 GB from as-is)
    • Link to flamegraph
  3. Same as "2." and with the above "performance tweaks" added.
    • Duration: ~5 minutes (down ~23 minutes from as-is)
    • Peak Memory Consumption: 32.7 GB (down 32 GB from as-is)
    • Link to flamegraph

I'm still working on the merges themselves as these seem to still be a large bottleneck of memory consumption. In exploring performance with another resource profiling tool, Scalene, it predicts (using some presumptions built into that library) that the Pandas merges are a possible source of memory leakage, but only does so when using large datasets. I've stored some early results from this tool here. Please note there are some differences in how this tool vs Memray measure performance, so they do not show the exact same results.

@gwaybio
Copy link
Member

gwaybio commented Aug 16, 2022

@d33bs - does #219 solve this issue? Can we close?

@d33bs
Copy link
Member Author

d33bs commented Aug 19, 2022

Hi @gwaybio - I feel good about closing this issue given #219's solution. If there are other known areas where Pandas DataFrame merges are causing performance issues we could address those more specifically as they come up.

@gwaybio gwaybio closed this as completed Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants