Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading data along chunked dimension does not scale linearly with amount of data #116

Open
ali-ramadhan opened this issue Feb 19, 2020 · 3 comments

Comments

@ali-ramadhan
Copy link

Super cool work on integrating DiskArrays.jl with NetCDF.jl! Looking forward to ditching xarray in favor of a pure Julia solution.

@visr helped me get up and running but we noticed that grabbing 2x as much data seems to take ~4x longer whereas I expected it to scale linearly. I am unfortunately interested in grabbing data along the dimension with chunk size 1...

julia> using NetCDF

julia> ds = NetCDF.open("/home/alir/cnhlab004/bsose_i122/bsose_i122_2013to2017_1day_Theta.nc", "THETA")
Disk Array with size 2160 x 588 x 52 x 1826

julia> NetCDF.getchunksize(ds)
(2160, 588, 19, 1)

julia> @time ds[100, 200, :, 300]
  0.012066 seconds (48 allocations: 2.500 KiB)

julia> @time ds[100, 200, :, 320:330]
  0.010111 seconds (55 allocations: 4.750 KiB)

julia> @time ds[100, 200, :, 300:400]
  5.256234 seconds (56 allocations: 23.016 KiB)

julia> @time ds[100, 200, :, 600:800]
 19.074392 seconds (56 allocations: 43.328 KiB)
@visr
Copy link
Member

visr commented Feb 20, 2020

It's great to have an example of such a large NetCDF. At this moment I cannot tell if this time is spent in the NetCDF C library or in the Julia wrapper code. Though I think running the slower calls under a profiler should be able to give that information.

@meggart
Copy link
Member

meggart commented Feb 20, 2020

I agree with @visr it is hard to say where the time is spent. Please note also that the NetCDF C library does some internal caching, so I guess your 3rd call was profiting from the previous reads. I found it very difficult to debug these kinds of problems. Ideally you would restart your Julia session after every data access to make sure NetCDF did not cache anything, but then you include precompilation in your timings...

@bjarthur
Copy link

i cannot reproduce with my dataset which is of similar size but only three dimensions. @ali-ramadhan is this still a problem for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants