-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse_gctx.py performance improvements #12
Comments
cc @shababo |
Yes, this is exactly what we observed - roughly 2m30s to parse the methylation when saved as There are a few axes to consider when optimizing methylation data loading:
Feel free to add to this list. For now we decided to punt on this because the current RRBS dataset is relatively small - so the max diff between optimal and suboptimal is smaller. If we want to address this, we can certainly design a set of experiments and run them. Open to it. I'm sure @Armandpl and I could make this happen. |
We're going to be writing out |
That's definitely helpful. I think that also supports the "let's come back to this later" path. It's likely we'll want to run these tests at some point though. |
I'm not sure if this is the same or a different issue from what @shababo brought up the other week. But parsing the GCTX file takes ~8 times longer than loading the same data via
pd.read_table
when loading subsets of data. It is ~2X slower when loading the full matrices. It's unclear if the compression is the same on these files. I hadn't noticed this previously as I typically was loading once and often loading only methylation via the*.tsv.gz
files.Tagging @bsiranosian @ANaka for visibility and to bring the discussion into github.
The text was updated successfully, but these errors were encountered: