Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_gctx.py performance improvements #12

Open
DavidTingley opened this issue May 23, 2022 · 4 comments
Open

parse_gctx.py performance improvements #12

DavidTingley opened this issue May 23, 2022 · 4 comments
Assignees

Comments

@DavidTingley
Copy link

I'm not sure if this is the same or a different issue from what @shababo brought up the other week. But parsing the GCTX file takes ~8 times longer than loading the same data via pd.read_table when loading subsets of data. It is ~2X slower when loading the full matrices. It's unclear if the compression is the same on these files. I hadn't noticed this previously as I typically was loading once and often loading only methylation via the *.tsv.gz files.

image

Tagging @bsiranosian @ANaka for visibility and to bring the discussion into github.

@DavidTingley DavidTingley self-assigned this May 23, 2022
@ANaka
Copy link

ANaka commented May 23, 2022

cc @shababo

@shababo
Copy link

shababo commented May 23, 2022

Yes, this is exactly what we observed - roughly 2m30s to parse the methylation when saved as gctx. We were able to bring this down to roughly 7 seconds - IIRC - by loading the methylation matrix from an uncompressed h5 file. That said, this might still be slower than other options, e.g. loading from npy. There is also a newer file type that is potentially faster than h5 for python called zarr.

There are a few axes to consider when optimizing methylation data loading:

  • are we loading all of the data or subsets at each load
  • multi-thread, gpu, etc
  • what file type we choose
  • how we associate metadata
  • do we need to perform operations over columns/rows and how this relates to how the data is stored

Feel free to add to this list.

For now we decided to punt on this because the current RRBS dataset is relatively small - so the max diff between optimal and suboptimal is smaller. If we want to address this, we can certainly design a set of experiments and run them. Open to it. I'm sure @Armandpl and I could make this happen.

@bsiranosian
Copy link

We're going to be writing out npy files for all columns (samples) of the matrix after the next iteration of @tiffanyshi 's filtering script. So those will always be available as individual files shortly. Doesn't solve all the issues, but it does give you more flexibility.

@shababo
Copy link

shababo commented May 23, 2022

That's definitely helpful. I think that also supports the "let's come back to this later" path. It's likely we'll want to run these tests at some point though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants