Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performances in fetching data #28

Merged
merged 110 commits into from
Oct 6, 2020
Merged

Improve performances in fetching data #28

merged 110 commits into from
Oct 6, 2020

Conversation

gmaze
Copy link
Member

@gmaze gmaze commented Jun 23, 2020

Add an asynchrone open_mfdataset to the httpstore
and much more ...

Close #27 #16 #51

Add an asynchrone open_mfdataset to the httpstore
@gmaze gmaze added performance internals Internal machinery labels Jun 23, 2020
@gmaze gmaze self-assigned this Jun 23, 2020
@gmaze gmaze marked this pull request as draft June 24, 2020 21:03
@gmaze
Copy link
Member Author

gmaze commented Jul 1, 2020

Here are some benchmarking results for the time of data fetching using the erddap and a large box.
Box limit varies to increase dataset to fetch in size.

The following code was repeated for box depth of 0-50m and 2018 alone and then 0-50m, 0-100m, 0-200m, 0-300m and 2018/2019.
The horizontal domain size is 40 deg in longitude and 20 deg in latitude.
I use small random variations around the longitude boundaries to avoid the erddap server to cache requests.
I also use the expert mode to limit the amount of postprocessing on the client side.
Note the new fetcher options parallel=True, chunks='auto', box_maxsize=[10, 10, 50].
We run the same request n=5 times to get an ensemble approach and more stable metrics.

for run in range(5):
    start_time = time.time()
    large_box = [-70+np.random.random_sample(1)[0], -30+np.random.random_sample(1)[0], 20, 40, 0, 200, '2018-01-01', '2020-01-01']
    fetcher = ArgoDataFetcher(mode='expert', parallel=True, chunks='auto', box_maxsize=[10, 10, 50]).region(large_box)
    ds = fetcher.to_xarray()
    par_bench.append({'ETIM':time.time() - start_time, 'NPTS':np.max(ds['N_POINTS'].values), 
'MBYT': ds.nbytes/1e6, 'CHUNKS': fetcher.fetcher.chunks, 'NREQ': len(fetcher.fetcher.urls)})
    print("Run #%i" % run, par_bench[-1])

ps: The request fails for 0-400m and 0-800m boxes.

Results:
dev-pr28-erddap-chunks-bench-results-02

gmaze added 13 commits July 2, 2020 00:12
Can switch from thread vs process pools and support for dask client
also add a progress bar
## modules:
Rename fsspec_wrappers.py to filesystems.py

## Unit tests:
This makes it clearer between testing the facade compared to testing each individual fetcher.
protocol options not passed properly when using caching system !
Copy link
Member

@quai20 quai20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue testing, but code is ok to me, and based on docs everything works fine.

@gmaze
Copy link
Member Author

gmaze commented Oct 6, 2020

ok, thanks @quai20 for the review
do you think, this is ready for a merge ?

@quai20
Copy link
Member

quai20 commented Oct 6, 2020

ok, thanks @quai20 for the review
do you think, this is ready for a merge ?

ok to me !

@gmaze gmaze changed the title Improve performances in fetching data online Improve performances in fetching data Oct 6, 2020
@gmaze gmaze merged commit 14f747c into master Oct 6, 2020
@gmaze gmaze deleted the parallel-requests branch December 9, 2020 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals Internal machinery performance
Projects
None yet
2 participants