Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filling large DelayedArray with iid standard normals #89

Closed
ekernf01 opened this issue Apr 3, 2021 · 5 comments
Closed

Filling large DelayedArray with iid standard normals #89

ekernf01 opened this issue Apr 3, 2021 · 5 comments

Comments

@ekernf01
Copy link

ekernf01 commented Apr 3, 2021

Hi DelayedArray devs, how would you fill a 1000 by 100,000 DelayedArray or HDF5Array with iid standard Normal draws? Here's what I have tried.

  • Sample it all at once and pass it to the constructor. This runs out of memory.
  • Fill one column at a time. The tree gets really big, and the C stack error comes up ("Error: C stack usage ... is too close to the limit").
  • Fill one column at a time, but periodically call simplify. Seems to make no difference to the too-big tree.
  • Fill one column at a time, but periodically call realize with HDF5Array backend. This works but seems slow.

Thanks in advance for considering this request. The package is awesome -- easy to use and valuable.

@hpages
Copy link
Contributor

hpages commented Apr 3, 2021

Hi @ekernf01 ,

You want to write your own arbitrary data to an HDF5 file. This doesn't need to involve DelayedArray objects and can be easily achieved with plain use of the rhdf5 package. However, the DelayedArray/HDF5Array framework provides RealizationSink objects to make this more convenient, and to abstract away the details of the particular backend being used (e.g. HDF5 file or TileTB). This helps make the code simpler, easier to understand, and portable across backends.

See ?write_block in the DelayedArray package for more information. I think that the first example (USING THE "RealizationSink API": EXAMPLE 1) in the examples section does something close to what you are trying to achieve, so hopefully it will be easy to adapt to your particular use case.

H.

@ekernf01
Copy link
Author

ekernf01 commented Apr 4, 2021

Thanks, I adapted that RealizationSink example and it works really well. If I want to just us rhdf5 in the future, can the DelayedArray package read any hdf5 file, or are there certain expectations that have to be met? I don't know much about hdf5 yet, so please forgive me if it's an ignorant question, but I'm asking because in the past I have had some trouble writing hdf5 files with one scRNA package and then trying to read them with another.

@hpages
Copy link
Contributor

hpages commented Apr 4, 2021

The DelayedArray package implements all the backend agnostic stuff used by DelayedArray objects in general so is not geared specifically towards hdf5 datasets.

The HDF5Array() constructor in the HDF5Array package should be able to read most hdf5 datasets. There are no particular expectations to be met. However performance of the HDF5Array object will depend a lot on some important parameters like chunk geometry, compression level, and storage type, etc... that control how the dataset is physically stored on disk. All these parameters need to be decided ahead of time when the dataset is written to disk. The chunk geometry is probably the most important one and the best geometry will ultimately depend on the typical access pattern of your downstream analysis.

These parameters are documented in ?writeHDF5Array in the HDF5Array package. The HDF5RealizationSink() constructor has the same arguments as the writeHDF5Array() function.

@ekernf01
Copy link
Author

ekernf01 commented Apr 4, 2021

Thanks! After a little trouble with locking, I got that to work too now. I will close the issue.

@ekernf01 ekernf01 closed this as completed Apr 4, 2021
@LTLA
Copy link
Contributor

LTLA commented Apr 18, 2021

FWIW, the original question can also be answered with:

library(DelayedRandomArray) # see https://github.com/LTLA/DelayedRandomArray
randnorm <- RandomNormArray(c(1000, 100000)) 

library(HDF5Array)
writeHDF5Array(randnorm, file="foo.h5", path="bar") # generates a pretty large file; not very compressible.

Takes about 20 seconds for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants