Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] IO refactor #167

Merged
merged 60 commits into from
Sep 5, 2019
Merged

[WIP] IO refactor #167

merged 60 commits into from
Sep 5, 2019

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Jun 20, 2019

This is a major reworking of the IO which is meant to make it easier to modify and figure out how each element gets written to a file by factoring them out. This makes it easier to read one object at a time from an h5ad file. Additionally, this PR makes allows reading and writing of dataframes to locations other than obs and var.

TODO:

  • Chunked zarr reading/ writing
  • Get zarr reading and writing working
  • Cleanup old code
  • H5AD dataframes as columnar
  • Tests for reading legacy files

Fixes #200

@Koncopd
Copy link
Member

Koncopd commented Jun 20, 2019

Hi, the important thing is to use read_direct where possible, it is really important for performance.
Like here or here.

https://github.com/Koncopd/anndata-scanpy-benchmarks/blob/master/memory_issue_huge.ipynb

@ivirshup
Copy link
Member Author

the important thing is to use read_direct where possible, it is really important for performance.

Is this only for cases when we would have an array to overwrite anyways, or others as well? I don't see a change if performance for this simple case:

simple hdf5 benchmark
import h5py
import numpy as np

with h5py.File("tmp.h5") as f:
    f["x"] = np.random.random_sample((10000, 10000))

def read_direct(d):
    a = np.empty(d.shape, dtype=d.dtype)
    d.read_direct(a)

def read_parens(d):
    a = d[()]

def read_colon(d):
    a = d[:]

# Benchmark:
f = h5py.File("tmp.h5", "r")
d = f["x"]

# Time:

%timeit read_colon(d)
# 474 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit read_parens(d)
# 467 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit read_direct(d)
# 461 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Memory (all measures fluctuated around 20 MiB):

%memit read_colon(d)
# peak memory: 1576.33 MiB, increment: 740.04 MiB

%memit read_parens(d)
# peak memory: 1558.64 MiB, increment: 722.46 MiB

%memit read_direct(d)
# peak memory: 1583.15 MiB, increment: 746.96 MiB

@Koncopd
Copy link
Member

Koncopd commented Jun 21, 2019

Yeah, for small examples it doesn't show any difference, i also faced this fact when tried to fix the memory issue.
Maybe you remember - scverse/scanpy#146
For really big datasets it does have a difference.

However now i feel a bit unsure about it... Data in your benchmark should be big enough to show the difference. And if there are no difference, why then it worked so well in the previous version of read_h5ad (it actually solved the memory issue). I think we just should test this on the same huge datasets when you finish your reading function.

@ivirshup
Copy link
Member Author

ivirshup commented Jun 23, 2019

I think the h5ad io code is pretty close to done. Here's some benchmark results I've just gotten using the benchmark suite I'm working on (ivirshup/anndata-benchmarks), comparing io speed and memory usage from the current commit on this branch to master:

Benchmark results
$ asv compare c20ed18 18c730f8f


       before           after         ratio
     [c20ed187]       [18c730f8]
     <0.6.22rc1>       <io_refactor>
            6.07M            6.07M     1.00  readwrite.H5ADReadSuite.mem_readfull_object('pbmc3k_raw.h5ad')
            2.32M            2.32M     1.00  readwrite.H5ADReadSuite.mem_readfull_object('10x_pbmc68k_reduced.h5ad')
-            124M            95.5M     0.77  readwrite.H5ADReadSuite.peakmem_read_full('pbmc3k_raw.h5ad')
              77M            72.9M     0.95  readwrite.H5ADReadSuite.peakmem_read_full('10x_pbmc68k_reduced.h5ad')
-         166±3ms       77.2±0.9ms     0.46  readwrite.H5ADReadSuite.time_read_full('pbmc3k_raw.h5ad')
-         117±1ms         61.1±2ms     0.52  readwrite.H5ADReadSuite.time_read_full('10x_pbmc68k_reduced.h5ad')
  1.1476142424679256  1.2287729571466595     1.07  readwrite.H5ADReadSuite.track_read_full_memratio('pbmc3k_raw.h5ad')
  1.0346641166626822  1.0333333333333334     1.00  readwrite.H5ADReadSuite.track_read_full_memratio('10x_pbmc68k_reduced.h5ad')
-            125M             104M     0.83  readwrite.H5ADWriteSuite.peakmem_write_compressed('pbmc3k_raw.h5ad')
            79.3M            76.8M     0.97  readwrite.H5ADWriteSuite.peakmem_write_compressed('10x_pbmc68k_reduced.h5ad')
-            125M             104M     0.83  readwrite.H5ADWriteSuite.peakmem_write_full('pbmc3k_raw.h5ad')
            80.6M            74.5M     0.92  readwrite.H5ADWriteSuite.peakmem_write_full('10x_pbmc68k_reduced.h5ad')
         693±20ms         680±10ms     0.98  readwrite.H5ADWriteSuite.time_write_compressed('pbmc3k_raw.h5ad')
          139±2ms          136±2ms     0.97  readwrite.H5ADWriteSuite.time_write_compressed('10x_pbmc68k_reduced.h5ad')
         406±10ms          363±8ms    ~0.89  readwrite.H5ADWriteSuite.time_write_full('pbmc3k_raw.h5ad')
-        61.5±2ms         55.3±1ms     0.90  readwrite.H5ADWriteSuite.time_write_full('10x_pbmc68k_reduced.h5ad')
-         20.1875          13.0625     0.65  readwrite.H5ADWriteSuite.track_peakmem_write_compressed('pbmc3k_raw.h5ad')
+      3.30859375       4.91015625     1.48  readwrite.H5ADWriteSuite.track_peakmem_write_compressed('10x_pbmc68k_reduced.h5ad')
-      20.6953125         13.84375     0.67  readwrite.H5ADWriteSuite.track_peakmem_write_full('pbmc3k_raw.h5ad')
+        3.296875       4.08203125     1.24  readwrite.H5ADWriteSuite.track_peakmem_write_full('10x_pbmc68k_reduced.h5ad')

The + and - indicate whether a significant increase or decrease (respectively) in measured value occurred.

I still need to get set up on a better machine for benchmarks than my laptop, but I think these results would carry over. By the way, I'd appreciate any input you had on the choices of things to benchmark!

@Koncopd
Copy link
Member

Koncopd commented Jun 23, 2019

If it is almost ready, i can check this on the same huge datasets.

@falexwolf
Copy link
Member

This makes everything so much better! What a major improvement again! Awesome

Benchmarks: Great! And thank you @Koncopd, for pointing out the large-data memory issues!

@ivirshup ivirshup mentioned this pull request Jun 27, 2019
8 tasks
ivirshup added 16 commits July 29, 2019 13:45
* This removes multipledispatch as a dependency
* This looks kinda ugly
* Ideally, this would all be handled within an object, just not there yet
* Partial writing could also fit this layout
Getting this done has made some code much uglier. Consider alternatives.
Now categoricals with null values won't cause segfaults on write (I'm pretty sure), and will be read back as null. In case this fails, or if I have missed some key assumption about how categoricals work, here's an alternative implementation for `make_h5_cat_dtype`:

```python
def make_h5_cat_dtype(s):
    """This creates a hdf5 enum dtype based on a pandas categorical array."""
    codes = s.codes
    if -1 in codes:  # Handle null values
        codes = codes.copy()  # Might be a problem with dtype size here
        codes[codes == -1] = np.max(codes) + 1
    return h5py.special_dtype(enum=(
        s.codes.dtype,
        dict(zip(s.categories, np.unique(codes)))
    ))
```
There were a couple things going on here. The main issue was that if we tried to write from a backed object to a new file, it wouldn't work if the backed matrix was sparse. This was due to (I think) some logic in writing and (definitley) backed sparse matrices attempting to be copied when the whole matrix is sliced (i.e. `mtx[:]`).
* Addressing some comments from @flyingsheep from scverse#167 around 6566a09
* Allow comparison between backed and in memory views
* Make `raw.X` always 2d
* Make backed raw work better
* Test Raw more, especially in regards to subsetting and backed mode
* Noted need to implement views for `Raw`
@ivirshup
Copy link
Member Author

Thanks for the review!

Its a lot, so I know a complete look over would be too much, theres just a few things I wanted some input on/ to make sure someone else looked at.

  • I would like to have more legacy files present for testing against. Should we use git-lfs to manage these?
  • Should I use the hdf5 enum dtype for categoricals, or just handle them the same way I do it for zarr, where the categories are stored in the attributes? This is mainly a question of internal consistency vs consistency with the storage format.
  • Dataframes on disk changed the most. Could you give that code a look over? Any cases you think might not work?

@ivirshup
Copy link
Member Author

ivirshup commented Sep 1, 2019

@flying-sheep, do you think you'd be able to review those three parts sometime soon?

@flying-sheep
Copy link
Member

flying-sheep commented Sep 3, 2019

Hi! I’ve been thesis writing / on holiday, but back now 😄

  • I’m for Git LFS if we can detect its absence (will there be dummy files?), and we skip those tests when it isn’t installed.
  • I’d say we should go native enum: Looks like it’s easy to do with h5py, and I don’t see much reason to stay similar to zarr, as there’s already enough difference.
  • On it!

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! The dataframe handling is certainly a big improvement to the stringly typed stuff we had before.

anndata/core/anndata.py Show resolved Hide resolved
anndata/readwrite/h5ad.py Outdated Show resolved Hide resolved
anndata/readwrite/h5ad.py Outdated Show resolved Hide resolved
anndata/readwrite/utils.py Show resolved Hide resolved
flying-sheep and others added 6 commits September 3, 2019 12:28
Remove usage of the hdf5 enum type. This makes the code a little more simple by removing some helper functions, and should make it easier to share code between zarr and hdf5 in the future. I'm also pretty sure what we're doing now is close to equivalent.
@flying-sheep
Copy link
Member

I would like to remove the anndata.h5py file, since I'm pretty sure we can handle sparse matrices without also having our own File, Group, and Dataset. Since we already have to deal with indexing ourselves, I think it'll be easier to share code between backends (like zarr) as well. Do have any thoughts on this?

Sounds good. And idk if we can get rid of anndata.h5py so easily, don’t we have to fix have code so __getitem__ on a file or group gives us a SparseDataset?

@ivirshup
Copy link
Member Author

ivirshup commented Sep 5, 2019

Thesis writing and holiday? Shouldn't those be opposites?

I’m for Git LFS if we can detect its absence (will there be dummy files?), and we skip those tests when it isn’t installed.

I think these might be important enough tests that they should always run. Why would you want it to be optional? That said, it should be easy to see if the file looks like a stub.

I’d say we should go native enum: Looks like it’s easy to do with h5py, and I don’t see much reason to stay similar to zarr, as there’s already enough difference.

Haha, I only saw the comments on the code when I was making changes. I ended up representing it the same way for zarr and hdf5. It wasn't actually that easy to do with h5py. Statements I thought would be identical caused segfaults plus it's not well documented.

@ivirshup
Copy link
Member Author

ivirshup commented Sep 5, 2019

Sounds good. And idk if we can get rid of anndata.h5py so easily, don’t we have to fix have code so __getitem__ on a file or group gives us a SparseDataset?

We already handle the file directly in the X accessor, so I don't think this will add much complexity. I'm probably not going to make a PR just to remove that module, but I think it'll come about as I try to make our data management more flexible.


Update:

Turns out SparseDataset is totally independent of the custom File, Group, and Dataset definitions (commented those out, SparseDataset still works). I don't think this will be too hard.

@flying-sheep
Copy link
Member

I ended up representing it the same way for zarr and hdf5. It wasn't actually that easy to do with h5py. Statements I thought would be identical caused segfaults plus it's not well documented.

Sounds like a good reason for going native enum after all: Subtle differences are harder to maintain than two separate, straightforward solutions, right?

Your call. Otherwise LGTM!

@ivirshup
Copy link
Member Author

ivirshup commented Sep 5, 2019

Sounds like a good reason for going native enum after all:

Oh, sorry, I wasn't clear before. I meant to say using h5pys enum caused segfaults and is poorly documented.

@flying-sheep
Copy link
Member

Ah! OK, then let’s get this merged finally 😄

@flying-sheep flying-sheep merged commit a9b8b03 into scverse:master Sep 5, 2019
@flying-sheep
Copy link
Member

Congrats and thank you for this huge effort!

@flying-sheep
Copy link
Member

I slimmed down the sub-commit messages in a9b8b03 a bit, please tell me if I missed something

@ivirshup
Copy link
Member Author

Oh, I had totally missed that you rebased the commits here.

I don't think squashing this much together makes sense for a PR of this size. For instance, I was just trying to blame some lines in h5sparse that I commented ambiguously, but now it's hard to tell when this code was written, or what commit message was intended to go with it.

I'm all for attempting to keep a cleaner git history for the repo. It'd make sense to squash commits which modified the same code. In this case a bunch of commits to different parts of the code were squashed together, which makes it very difficult to tell which part of the commit message relates to the code I blamed. I also think the author should be the one to clean the history, unless there's a very thorough review of all parts of the code. They're in the best place to know what documentation (in the form of commit messages) should be kept, and what code it's related to.

@flying-sheep
Copy link
Member

flying-sheep commented Sep 23, 2019

You’re right, I should probably have waited for you to rebase this into something clean.

It would have been hard though, a lot of fixes in not necessarily the order where they could be squased into preceding commits, you’d have had to reorder a lot, with potential for merge conflicts during rebase 😨

I’m sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reading categorical with only one category
4 participants