-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
insert method #47
Comments
I have been thinking about this a bit.
then all the chunks which are just shifted by 1000 can be reused. But I think we could do the same thing with inserts that do not align to the chunk size if those chunks are stored contiguously in |
So if I understand correctly, the idea is insert full chunks, which may not be used fully. So the chunks in I think we might be able to support this. Now that we have ndindex, it should be easier to do some of the slice calculations on offset chunks like this. But it still opens a question. After doing many inserts like this, there will be a lot of empty blocks in the raw data. This is particularly exacerbated for multidimensional datasets. Currently, for a multidimensional dataset any "edge" chunk can be nonfull. So for instance, if your chunk size is But perhaps with compression enabled, this is not a real issue? I'm unsure how compression works in HDF5, so we would need to test this. There's also the question of performance. The time to read/write in a dataset will correspond to the number of chunks that the corresponding subslice of the dataset is in. Presently, this is easy to guess because you can just look at the chunk size and the slice. For example, you know with chunk size |
I think the virtual chunks could still be of full size (except for the last).
This relies on the fact that in this case the raw chunks happen to be contiguous because of the original write. PS: I was just thinking about one dimension, I think this does not generalize to more dimensions? |
So with that, things would no longer be chunk aligned in the raw dataset. I think we could do this. But we originally made the data chunks match HDF5 chunks so that we would get the best performance. If we implement something like this, we would need to see how it would affect performance. |
It could be useful to have an insert method on the dataset objects, since we can do inserts more efficiently than the naive way, by only reading in the chunks that are actually going to change.
The text was updated successfully, but these errors were encountered: