[WIP] IO refactor #167

ivirshup · 2019-06-20T09:44:30Z

This is a major reworking of the IO which is meant to make it easier to modify and figure out how each element gets written to a file by factoring them out. This makes it easier to read one object at a time from an h5ad file. Additionally, this PR makes allows reading and writing of dataframes to locations other than obs and var.

TODO:

Fixes #200

Koncopd · 2019-06-20T13:40:00Z

Hi, the important thing is to use read_direct where possible, it is really important for performance.
Like here or here.

https://github.com/Koncopd/anndata-scanpy-benchmarks/blob/master/memory_issue_huge.ipynb

ivirshup · 2019-06-21T03:50:10Z

the important thing is to use read_direct where possible, it is really important for performance.

Is this only for cases when we would have an array to overwrite anyways, or others as well? I don't see a change if performance for this simple case:

simple hdf5 benchmark

import h5py
import numpy as np

with h5py.File("tmp.h5") as f:
    f["x"] = np.random.random_sample((10000, 10000))

def read_direct(d):
    a = np.empty(d.shape, dtype=d.dtype)
    d.read_direct(a)

def read_parens(d):
    a = d[()]

def read_colon(d):
    a = d[:]

# Benchmark:
f = h5py.File("tmp.h5", "r")
d = f["x"]

# Time:

%timeit read_colon(d)
# 474 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit read_parens(d)
# 467 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit read_direct(d)
# 461 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Memory (all measures fluctuated around 20 MiB):

%memit read_colon(d)
# peak memory: 1576.33 MiB, increment: 740.04 MiB

%memit read_parens(d)
# peak memory: 1558.64 MiB, increment: 722.46 MiB

%memit read_direct(d)
# peak memory: 1583.15 MiB, increment: 746.96 MiB

Koncopd · 2019-06-21T12:37:54Z

Yeah, for small examples it doesn't show any difference, i also faced this fact when tried to fix the memory issue.
Maybe you remember - scverse/scanpy#146
For really big datasets it does have a difference.

However now i feel a bit unsure about it... Data in your benchmark should be big enough to show the difference. And if there are no difference, why then it worked so well in the previous version of read_h5ad (it actually solved the memory issue). I think we just should test this on the same huge datasets when you finish your reading function.

ivirshup · 2019-06-23T05:29:47Z

I think the h5ad io code is pretty close to done. Here's some benchmark results I've just gotten using the benchmark suite I'm working on (ivirshup/anndata-benchmarks), comparing io speed and memory usage from the current commit on this branch to master:

Benchmark results

$ asv compare c20ed18 18c730f8f


       before           after         ratio
     [c20ed187]       [18c730f8]
     <0.6.22rc1>       <io_refactor>
            6.07M            6.07M     1.00  readwrite.H5ADReadSuite.mem_readfull_object('pbmc3k_raw.h5ad')
            2.32M            2.32M     1.00  readwrite.H5ADReadSuite.mem_readfull_object('10x_pbmc68k_reduced.h5ad')
-            124M            95.5M     0.77  readwrite.H5ADReadSuite.peakmem_read_full('pbmc3k_raw.h5ad')
              77M            72.9M     0.95  readwrite.H5ADReadSuite.peakmem_read_full('10x_pbmc68k_reduced.h5ad')
-         166±3ms       77.2±0.9ms     0.46  readwrite.H5ADReadSuite.time_read_full('pbmc3k_raw.h5ad')
-         117±1ms         61.1±2ms     0.52  readwrite.H5ADReadSuite.time_read_full('10x_pbmc68k_reduced.h5ad')
  1.1476142424679256  1.2287729571466595     1.07  readwrite.H5ADReadSuite.track_read_full_memratio('pbmc3k_raw.h5ad')
  1.0346641166626822  1.0333333333333334     1.00  readwrite.H5ADReadSuite.track_read_full_memratio('10x_pbmc68k_reduced.h5ad')
-            125M             104M     0.83  readwrite.H5ADWriteSuite.peakmem_write_compressed('pbmc3k_raw.h5ad')
            79.3M            76.8M     0.97  readwrite.H5ADWriteSuite.peakmem_write_compressed('10x_pbmc68k_reduced.h5ad')
-            125M             104M     0.83  readwrite.H5ADWriteSuite.peakmem_write_full('pbmc3k_raw.h5ad')
            80.6M            74.5M     0.92  readwrite.H5ADWriteSuite.peakmem_write_full('10x_pbmc68k_reduced.h5ad')
         693±20ms         680±10ms     0.98  readwrite.H5ADWriteSuite.time_write_compressed('pbmc3k_raw.h5ad')
          139±2ms          136±2ms     0.97  readwrite.H5ADWriteSuite.time_write_compressed('10x_pbmc68k_reduced.h5ad')
         406±10ms          363±8ms    ~0.89  readwrite.H5ADWriteSuite.time_write_full('pbmc3k_raw.h5ad')
-        61.5±2ms         55.3±1ms     0.90  readwrite.H5ADWriteSuite.time_write_full('10x_pbmc68k_reduced.h5ad')
-         20.1875          13.0625     0.65  readwrite.H5ADWriteSuite.track_peakmem_write_compressed('pbmc3k_raw.h5ad')
+      3.30859375       4.91015625     1.48  readwrite.H5ADWriteSuite.track_peakmem_write_compressed('10x_pbmc68k_reduced.h5ad')
-      20.6953125         13.84375     0.67  readwrite.H5ADWriteSuite.track_peakmem_write_full('pbmc3k_raw.h5ad')
+        3.296875       4.08203125     1.24  readwrite.H5ADWriteSuite.track_peakmem_write_full('10x_pbmc68k_reduced.h5ad')

The + and - indicate whether a significant increase or decrease (respectively) in measured value occurred.

I still need to get set up on a better machine for benchmarks than my laptop, but I think these results would carry over. By the way, I'd appreciate any input you had on the choices of things to benchmark!

Koncopd · 2019-06-23T12:10:43Z

If it is almost ready, i can check this on the same huge datasets.

falexwolf · 2019-06-24T07:16:27Z

This makes everything so much better! What a major improvement again! Awesome

Benchmarks: Great! And thank you @Koncopd, for pointing out the large-data memory issues!

* This removes multipledispatch as a dependency * This looks kinda ugly * Ideally, this would all be handled within an object, just not there yet * Partial writing could also fit this layout

Getting this done has made some code much uglier. Consider alternatives.

Now categoricals with null values won't cause segfaults on write (I'm pretty sure), and will be read back as null. In case this fails, or if I have missed some key assumption about how categoricals work, here's an alternative implementation for `make_h5_cat_dtype`: ```python def make_h5_cat_dtype(s): """This creates a hdf5 enum dtype based on a pandas categorical array.""" codes = s.codes if -1 in codes: # Handle null values codes = codes.copy() # Might be a problem with dtype size here codes[codes == -1] = np.max(codes) + 1 return h5py.special_dtype(enum=( s.codes.dtype, dict(zip(s.categories, np.unique(codes))) )) ```

There were a couple things going on here. The main issue was that if we tried to write from a backed object to a new file, it wouldn't work if the backed matrix was sparse. This was due to (I think) some logic in writing and (definitley) backed sparse matrices attempting to be copied when the whole matrix is sliced (i.e. `mtx[:]`).

@flyingsheep

* Addressing some comments from @flyingsheep from scverse#167 around 6566a09 * Allow comparison between backed and in memory views * Make `raw.X` always 2d * Make backed raw work better * Test Raw more, especially in regards to subsetting and backed mode * Noted need to implement views for `Raw`

ivirshup · 2019-08-24T11:03:44Z

Thanks for the review!

Its a lot, so I know a complete look over would be too much, theres just a few things I wanted some input on/ to make sure someone else looked at.

I would like to have more legacy files present for testing against. Should we use git-lfs to manage these?
Should I use the hdf5 enum dtype for categoricals, or just handle them the same way I do it for zarr, where the categories are stored in the attributes? This is mainly a question of internal consistency vs consistency with the storage format.
Dataframes on disk changed the most. Could you give that code a look over? Any cases you think might not work?

ivirshup · 2019-09-01T06:10:38Z

@flying-sheep, do you think you'd be able to review those three parts sometime soon?

flying-sheep · 2019-09-03T09:23:31Z

Hi! I’ve been thesis writing / on holiday, but back now 😄

I’m for Git LFS if we can detect its absence (will there be dummy files?), and we skip those tests when it isn’t installed.
I’d say we should go native enum: Looks like it’s easy to do with h5py, and I don’t see much reason to stay similar to zarr, as there’s already enough difference.
On it!

flying-sheep

Great stuff! The dataframe handling is certainly a big improvement to the stringly typed stuff we had before.

anndata/core/anndata.py

anndata/readwrite/h5ad.py

anndata/readwrite/utils.py

Remove usage of the hdf5 enum type. This makes the code a little more simple by removing some helper functions, and should make it easier to share code between zarr and hdf5 in the future. I'm also pretty sure what we're doing now is close to equivalent.

…o io_refactor

anndata/readwrite/h5ad.py

flying-sheep · 2019-09-05T13:00:03Z

I would like to remove the anndata.h5py file, since I'm pretty sure we can handle sparse matrices without also having our own File, Group, and Dataset. Since we already have to deal with indexing ourselves, I think it'll be easier to share code between backends (like zarr) as well. Do have any thoughts on this?

Sounds good. And idk if we can get rid of anndata.h5py so easily, don’t we have to fix have code so __getitem__ on a file or group gives us a SparseDataset?

ivirshup · 2019-09-05T13:33:36Z

Thesis writing and holiday? Shouldn't those be opposites?

I’m for Git LFS if we can detect its absence (will there be dummy files?), and we skip those tests when it isn’t installed.

I think these might be important enough tests that they should always run. Why would you want it to be optional? That said, it should be easy to see if the file looks like a stub.

I’d say we should go native enum: Looks like it’s easy to do with h5py, and I don’t see much reason to stay similar to zarr, as there’s already enough difference.

Haha, I only saw the comments on the code when I was making changes. I ended up representing it the same way for zarr and hdf5. It wasn't actually that easy to do with h5py. Statements I thought would be identical caused segfaults plus it's not well documented.

ivirshup · 2019-09-05T13:41:52Z

Sounds good. And idk if we can get rid of anndata.h5py so easily, don’t we have to fix have code so __getitem__ on a file or group gives us a SparseDataset?

We already handle the file directly in the X accessor, so I don't think this will add much complexity. I'm probably not going to make a PR just to remove that module, but I think it'll come about as I try to make our data management more flexible.

Update:

Turns out SparseDataset is totally independent of the custom File, Group, and Dataset definitions (commented those out, SparseDataset still works). I don't think this will be too hard.

flying-sheep · 2019-09-05T14:21:16Z

I ended up representing it the same way for zarr and hdf5. It wasn't actually that easy to do with h5py. Statements I thought would be identical caused segfaults plus it's not well documented.

Sounds like a good reason for going native enum after all: Subtle differences are harder to maintain than two separate, straightforward solutions, right?

Your call. Otherwise LGTM!

ivirshup · 2019-09-05T14:25:40Z

Sounds like a good reason for going native enum after all:

Oh, sorry, I wasn't clear before. I meant to say using h5pys enum caused segfaults and is poorly documented.

flying-sheep · 2019-09-05T14:59:22Z

Ah! OK, then let’s get this merged finally 😄

flying-sheep · 2019-09-05T15:09:57Z

Congrats and thank you for this huge effort!

flying-sheep · 2019-09-05T15:10:51Z

I slimmed down the sub-commit messages in a9b8b03 a bit, please tell me if I missed something

ivirshup · 2019-09-22T07:08:51Z

Oh, I had totally missed that you rebased the commits here.

I don't think squashing this much together makes sense for a PR of this size. For instance, I was just trying to blame some lines in h5sparse that I commented ambiguously, but now it's hard to tell when this code was written, or what commit message was intended to go with it.

I'm all for attempting to keep a cleaner git history for the repo. It'd make sense to squash commits which modified the same code. In this case a bunch of commits to different parts of the code were squashed together, which makes it very difficult to tell which part of the commit message relates to the code I blamed. I also think the author should be the one to clean the history, unless there's a very thorough review of all parts of the code. They're in the best place to know what documentation (in the form of commit messages) should be kept, and what code it's related to.

flying-sheep · 2019-09-23T09:09:20Z

You’re right, I should probably have waited for you to rebase this into something clean.

It would have been hard though, a lot of fixes in not necessarily the order where they could be squased into preceding commits, you’d have had to reorder a lot, with potential for merge conflicts during rebase 😨

I’m sorry!

ivirshup mentioned this pull request Jun 27, 2019

v0.7 #171

Closed

8 tasks

ivirshup added 16 commits July 29, 2019 13:45

First take io_refactor

3845107

Writing working for h5ad

3960d99

Made anndata.h5py.Group a Mapping

6ec13fa

Got h5ad reading passing most tests

7cad226

Rearange how h5ad write methods are dispatched

9f40457

* This removes multipledispatch as a dependency * This looks kinda ugly * Ideally, this would all be handled within an object, just not there yet * Partial writing could also fit this layout

Docs + minor reorganization

b646ea2

Begin zarr refactor (squash into me!)

93c3e70

Most backed h5ad tests are passing

e6b827e

Getting this done has made some code much uglier. Consider alternatives.

General cleanup, checkpoint, and vlen strings

a550ecf

All backed and h5ad tests passing!

dcddc82

Make df to hdf5 recarray more elegant

059eed1

Make df to hdf5 recarray more elegant

a3d7da2

Clean up read_h5ad

f7f6f8a

Simplify h5ad read_attribute dispatch

3bda9e1

Fix and test writing with compression

2d543a7

ivirshup force-pushed the io_refactor branch from 902be27 to 193dbee Compare July 29, 2019 03:46

ivirshup added 6 commits August 6, 2019 00:51

In progress zarr support

c80c8bc

Actually this time

7849775

Cleanup zarr.py

5769bd2

Improved errors for reading zarr, added support for Raw

c0ea0f6

Docstring for key reporting decorator

732ca75

ivirshup added 4 commits August 24, 2019 18:22

Make raw a group for hdf5 backend

413277d

Finish cleaning up checking backed raw.X

896257c

Remove df_to_records_fixed_width

d32efe5

ivirshup mentioned this pull request Sep 3, 2019

reading categorical with only one category #200

Closed

flying-sheep reviewed Sep 3, 2019

View reviewed changes

anndata/core/anndata.py Show resolved Hide resolved

anndata/readwrite/h5ad.py Outdated Show resolved Hide resolved

anndata/readwrite/h5ad.py Outdated Show resolved Hide resolved

anndata/readwrite/utils.py Show resolved Hide resolved

flying-sheep and others added 6 commits September 3, 2019 12:28

Merge branch 'master' into io_refactor

8cd6fe8

Fix reading backed if raw/varm not present

50c31cf

Add Group.create_group to patched h5py

c614217

Remove dead code for categoricals in recarrays

62f944c

Merge branch 'io_refactor' of https://github.com/ivirshup/anndata int…

638913f

…o io_refactor

flying-sheep reviewed Sep 5, 2019

View reviewed changes

anndata/readwrite/h5ad.py Show resolved Hide resolved

flying-sheep merged commit a9b8b03 into scverse:master Sep 5, 2019

This was referenced Sep 10, 2019

Categorical columns with same name get read wrong #147

Closed

adata.uns dataframe gets converted into numpy.ndarray when saving and loading h5ad #134

Closed

Can't write old style (obsm compound dataset) h5ad in backed mode #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] IO refactor #167

[WIP] IO refactor #167

ivirshup commented Jun 20, 2019 •

edited by flying-sheep

Loading

Koncopd commented Jun 20, 2019 •

edited

Loading

ivirshup commented Jun 21, 2019

Koncopd commented Jun 21, 2019 •

edited

Loading

ivirshup commented Jun 23, 2019 •

edited

Loading

Koncopd commented Jun 23, 2019

falexwolf commented Jun 24, 2019

ivirshup commented Aug 24, 2019

ivirshup commented Sep 1, 2019

flying-sheep commented Sep 3, 2019 •

edited

Loading

flying-sheep left a comment

flying-sheep commented Sep 5, 2019

ivirshup commented Sep 5, 2019

ivirshup commented Sep 5, 2019 •

edited

Loading

flying-sheep commented Sep 5, 2019

ivirshup commented Sep 5, 2019

flying-sheep commented Sep 5, 2019

flying-sheep commented Sep 5, 2019

flying-sheep commented Sep 5, 2019

ivirshup commented Sep 22, 2019

flying-sheep commented Sep 23, 2019 •

edited

Loading

[WIP] IO refactor #167

[WIP] IO refactor #167

Conversation

ivirshup commented Jun 20, 2019 • edited by flying-sheep Loading

Koncopd commented Jun 20, 2019 • edited Loading

ivirshup commented Jun 21, 2019

Koncopd commented Jun 21, 2019 • edited Loading

ivirshup commented Jun 23, 2019 • edited Loading

Koncopd commented Jun 23, 2019

falexwolf commented Jun 24, 2019

ivirshup commented Aug 24, 2019

ivirshup commented Sep 1, 2019

flying-sheep commented Sep 3, 2019 • edited Loading

flying-sheep left a comment

Choose a reason for hiding this comment

flying-sheep commented Sep 5, 2019

ivirshup commented Sep 5, 2019

ivirshup commented Sep 5, 2019 • edited Loading

flying-sheep commented Sep 5, 2019

ivirshup commented Sep 5, 2019

flying-sheep commented Sep 5, 2019

flying-sheep commented Sep 5, 2019

flying-sheep commented Sep 5, 2019

ivirshup commented Sep 22, 2019

flying-sheep commented Sep 23, 2019 • edited Loading

ivirshup commented Jun 20, 2019 •

edited by flying-sheep

Loading

Koncopd commented Jun 20, 2019 •

edited

Loading

Koncopd commented Jun 21, 2019 •

edited

Loading

ivirshup commented Jun 23, 2019 •

edited

Loading

flying-sheep commented Sep 3, 2019 •

edited

Loading

ivirshup commented Sep 5, 2019 •

edited

Loading

flying-sheep commented Sep 23, 2019 •

edited

Loading