parse_gctx.py performance improvements #12

DavidTingley · 2022-05-23T13:40:06Z

I'm not sure if this is the same or a different issue from what @shababo brought up the other week. But parsing the GCTX file takes ~8 times longer than loading the same data via pd.read_table when loading subsets of data. It is ~2X slower when loading the full matrices. It's unclear if the compression is the same on these files. I hadn't noticed this previously as I typically was loading once and often loading only methylation via the *.tsv.gz files.

Tagging @bsiranosian @ANaka for visibility and to bring the discussion into github.

The text was updated successfully, but these errors were encountered:

ANaka · 2022-05-23T15:23:22Z

cc @shababo

shababo · 2022-05-23T17:01:57Z

Yes, this is exactly what we observed - roughly 2m30s to parse the methylation when saved as gctx. We were able to bring this down to roughly 7 seconds - IIRC - by loading the methylation matrix from an uncompressed h5 file. That said, this might still be slower than other options, e.g. loading from npy. There is also a newer file type that is potentially faster than h5 for python called zarr.

There are a few axes to consider when optimizing methylation data loading:

are we loading all of the data or subsets at each load
multi-thread, gpu, etc
what file type we choose
how we associate metadata
do we need to perform operations over columns/rows and how this relates to how the data is stored

Feel free to add to this list.

For now we decided to punt on this because the current RRBS dataset is relatively small - so the max diff between optimal and suboptimal is smaller. If we want to address this, we can certainly design a set of experiments and run them. Open to it. I'm sure @Armandpl and I could make this happen.

bsiranosian · 2022-05-23T17:11:10Z

We're going to be writing out npy files for all columns (samples) of the matrix after the next iteration of @tiffanyshi 's filtering script. So those will always be available as individual files shortly. Doesn't solve all the issues, but it does give you more flexibility.

shababo · 2022-05-23T20:12:45Z

That's definitely helpful. I think that also supports the "let's come back to this later" path. It's likely we'll want to run these tests at some point though.

DavidTingley self-assigned this May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_gctx.py performance improvements #12

parse_gctx.py performance improvements #12

DavidTingley commented May 23, 2022

ANaka commented May 23, 2022

shababo commented May 23, 2022 •

edited

Loading

bsiranosian commented May 23, 2022

shababo commented May 23, 2022

parse_gctx.py performance improvements #12

parse_gctx.py performance improvements #12

Comments

DavidTingley commented May 23, 2022

ANaka commented May 23, 2022

shababo commented May 23, 2022 • edited Loading

bsiranosian commented May 23, 2022

shababo commented May 23, 2022

shababo commented May 23, 2022 •

edited

Loading