-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] IO refactor #167
[WIP] IO refactor #167
Conversation
Is this only for cases when we would have an array to overwrite anyways, or others as well? I don't see a change if performance for this simple case: simple hdf5 benchmarkimport h5py
import numpy as np
with h5py.File("tmp.h5") as f:
f["x"] = np.random.random_sample((10000, 10000))
def read_direct(d):
a = np.empty(d.shape, dtype=d.dtype)
d.read_direct(a)
def read_parens(d):
a = d[()]
def read_colon(d):
a = d[:]
# Benchmark:
f = h5py.File("tmp.h5", "r")
d = f["x"]
# Time:
%timeit read_colon(d)
# 474 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit read_parens(d)
# 467 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit read_direct(d)
# 461 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Memory (all measures fluctuated around 20 MiB):
%memit read_colon(d)
# peak memory: 1576.33 MiB, increment: 740.04 MiB
%memit read_parens(d)
# peak memory: 1558.64 MiB, increment: 722.46 MiB
%memit read_direct(d)
# peak memory: 1583.15 MiB, increment: 746.96 MiB |
Yeah, for small examples it doesn't show any difference, i also faced this fact when tried to fix the memory issue. However now i feel a bit unsure about it... Data in your benchmark should be big enough to show the difference. And if there are no difference, why then it worked so well in the previous version of |
I think the h5ad io code is pretty close to done. Here's some benchmark results I've just gotten using the benchmark suite I'm working on (ivirshup/anndata-benchmarks), comparing io speed and memory usage from the current commit on this branch to master: Benchmark results
The I still need to get set up on a better machine for benchmarks than my laptop, but I think these results would carry over. By the way, I'd appreciate any input you had on the choices of things to benchmark! |
If it is almost ready, i can check this on the same huge datasets. |
This makes everything so much better! What a major improvement again! Awesome Benchmarks: Great! And thank you @Koncopd, for pointing out the large-data memory issues! |
* This removes multipledispatch as a dependency * This looks kinda ugly * Ideally, this would all be handled within an object, just not there yet * Partial writing could also fit this layout
Getting this done has made some code much uglier. Consider alternatives.
Now categoricals with null values won't cause segfaults on write (I'm pretty sure), and will be read back as null. In case this fails, or if I have missed some key assumption about how categoricals work, here's an alternative implementation for `make_h5_cat_dtype`: ```python def make_h5_cat_dtype(s): """This creates a hdf5 enum dtype based on a pandas categorical array.""" codes = s.codes if -1 in codes: # Handle null values codes = codes.copy() # Might be a problem with dtype size here codes[codes == -1] = np.max(codes) + 1 return h5py.special_dtype(enum=( s.codes.dtype, dict(zip(s.categories, np.unique(codes))) )) ```
There were a couple things going on here. The main issue was that if we tried to write from a backed object to a new file, it wouldn't work if the backed matrix was sparse. This was due to (I think) some logic in writing and (definitley) backed sparse matrices attempting to be copied when the whole matrix is sliced (i.e. `mtx[:]`).
* Addressing some comments from @flyingsheep from scverse#167 around 6566a09 * Allow comparison between backed and in memory views * Make `raw.X` always 2d * Make backed raw work better * Test Raw more, especially in regards to subsetting and backed mode * Noted need to implement views for `Raw`
Thanks for the review! Its a lot, so I know a complete look over would be too much, theres just a few things I wanted some input on/ to make sure someone else looked at.
|
@flying-sheep, do you think you'd be able to review those three parts sometime soon? |
Hi! I’ve been thesis writing / on holiday, but back now 😄
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! The dataframe handling is certainly a big improvement to the stringly typed stuff we had before.
Remove usage of the hdf5 enum type. This makes the code a little more simple by removing some helper functions, and should make it easier to share code between zarr and hdf5 in the future. I'm also pretty sure what we're doing now is close to equivalent.
Sounds good. And idk if we can get rid of |
Thesis writing and holiday? Shouldn't those be opposites?
I think these might be important enough tests that they should always run. Why would you want it to be optional? That said, it should be easy to see if the file looks like a stub.
Haha, I only saw the comments on the code when I was making changes. I ended up representing it the same way for zarr and hdf5. It wasn't actually that easy to do with |
We already handle the file directly in the Update: Turns out |
Sounds like a good reason for going native enum after all: Subtle differences are harder to maintain than two separate, straightforward solutions, right? Your call. Otherwise LGTM! |
Oh, sorry, I wasn't clear before. I meant to say using |
Ah! OK, then let’s get this merged finally 😄 |
Congrats and thank you for this huge effort! |
I slimmed down the sub-commit messages in a9b8b03 a bit, please tell me if I missed something |
Oh, I had totally missed that you rebased the commits here. I don't think squashing this much together makes sense for a PR of this size. For instance, I was just trying to blame some lines in h5sparse that I commented ambiguously, but now it's hard to tell when this code was written, or what commit message was intended to go with it. I'm all for attempting to keep a cleaner git history for the repo. It'd make sense to squash commits which modified the same code. In this case a bunch of commits to different parts of the code were squashed together, which makes it very difficult to tell which part of the commit message relates to the code I blamed. I also think the author should be the one to clean the history, unless there's a very thorough review of all parts of the code. They're in the best place to know what documentation (in the form of commit messages) should be kept, and what code it's related to. |
You’re right, I should probably have waited for you to rebase this into something clean. It would have been hard though, a lot of fixes in not necessarily the order where they could be squased into preceding commits, you’d have had to reorder a lot, with potential for merge conflicts during rebase 😨 I’m sorry! |
This is a major reworking of the IO which is meant to make it easier to modify and figure out how each element gets written to a file by factoring them out. This makes it easier to read one object at a time from an h5ad file. Additionally, this PR makes allows reading and writing of dataframes to locations other than
obs
andvar
.TODO:
Fixes #200