Improve performances in fetching data #28

gmaze · 2020-06-23T21:56:12Z

Add an asynchrone open_mfdataset to the httpstore
and much more ...

Add an asynchrone open_mfdataset to the httpstore

gmaze · 2020-07-01T19:24:12Z

Here are some benchmarking results for the time of data fetching using the erddap and a large box.
Box limit varies to increase dataset to fetch in size.

The following code was repeated for box depth of 0-50m and 2018 alone and then 0-50m, 0-100m, 0-200m, 0-300m and 2018/2019.
The horizontal domain size is 40 deg in longitude and 20 deg in latitude.
I use small random variations around the longitude boundaries to avoid the erddap server to cache requests.
I also use the expert mode to limit the amount of postprocessing on the client side.
Note the new fetcher options parallel=True, chunks='auto', box_maxsize=[10, 10, 50].
We run the same request n=5 times to get an ensemble approach and more stable metrics.

for run in range(5):
    start_time = time.time()
    large_box = [-70+np.random.random_sample(1)[0], -30+np.random.random_sample(1)[0], 20, 40, 0, 200, '2018-01-01', '2020-01-01']
    fetcher = ArgoDataFetcher(mode='expert', parallel=True, chunks='auto', box_maxsize=[10, 10, 50]).region(large_box)
    ds = fetcher.to_xarray()
    par_bench.append({'ETIM':time.time() - start_time, 'NPTS':np.max(ds['N_POINTS'].values), 
'MBYT': ds.nbytes/1e6, 'CHUNKS': fetcher.fetcher.chunks, 'NREQ': len(fetcher.fetcher.urls)})
    print("Run #%i" % run, par_bench[-1])

ps: The request fails for 0-400m and 0-800m boxes.

Results:

Can switch from thread vs process pools and support for dask client also add a progress bar

## modules: Rename fsspec_wrappers.py to filesystems.py ## Unit tests: This makes it clearer between testing the facade compared to testing each individual fetcher.

protocol options not passed properly when using caching system !

quai20

I'll continue testing, but code is ok to me, and based on docs everything works fine.

gmaze · 2020-10-06T12:02:39Z

ok, thanks @quai20 for the review
do you think, this is ready for a merge ?

quai20 · 2020-10-06T15:11:08Z

ok, thanks @quai20 for the review
do you think, this is ready for a merge ?

ok to me !

Update fsspec_wrappers.py

b28270a

Add an asynchrone open_mfdataset to the httpstore

gmaze added performance internals Internal machinery labels Jun 23, 2020

gmaze self-assigned this Jun 23, 2020

Add parallel option to erddap data fetcher

b913516

gmaze marked this pull request as draft June 24, 2020 21:03

gmaze added 10 commits July 1, 2020 15:23

block_size bug in fsspec

bab9ec5

Update argopy_api_status.json

82b8191

json tests

f6c936f

Update README.md

4697c6d

Removed test json

2c3c5df

Update setup.py

f1a4119

Merge branch 'master' into parallel-requests

f8051fd

chunks

bf68130

Update erddap_data.py

4ccd6b9

Chunking of requests

9717338

gmaze added 13 commits July 2, 2020 00:12

More parallel options

c5efed5

Can switch from thread vs process pools and support for dask client also add a progress bar

tqdm

5d248a1

Refactor open_dataframe to read_csv

7b466ba

api_timeout global option for web API

5b09dab

Major internal refactoring

7021392

## modules: Rename fsspec_wrappers.py to filesystems.py ## Unit tests: This makes it clearer between testing the facade compared to testing each individual fetcher.

Update filesystems.py

ad5d433

Update filesystems.py

3d6c72e

protocol options not passed properly when using caching system !

Update filesystems.py

ef99a0f

Back to initial erddap server

630fab5

Parallel download of wmos

b9b0e4e

Update fetchers.py

cd67299

Update environment.yml

0d256f1

Add more default var for erddap

890abb2

gmaze added 8 commits October 5, 2020 12:35

Introduce safe_to_server_errors fixture for web API

c5fe00d

Update test_fetchers_data_argovis.py

db4ed3d

Update test_fetchers_data_erddap.py

ccb09bb

Update test_fetchers_data_argovis.py

1777e03

more server error safety net

038210f

More tests for mf stores

843f629

More unit tests

2e55dbf

better chunker tests

cdf42d2

quai20 approved these changes Oct 5, 2020

View reviewed changes

gmaze added 15 commits October 5, 2020 15:21

Update test_utilities.py

41e226e

Disable open_etopo1

a8d4814

and more

dbe2414

Update test_utilities.py

ca5f69e

Safe to aiohttp ServerDisconnectedError

0df861b

Update test_utilities.py

b1b15ab

Better test for data facade

c716799

Update .coveragerc

44cc5e6

Refactor facade test

5d017dd

Better index testing

4f061fe

move safe_to_server_errors at module level

27c5056

again

ac66363

Update localftp_index.py

23d7def

Update fetchers.py

3f07e2e

Update __init__.py

eb140bb

gmaze changed the title ~~Improve performances in fetching data online~~ Improve performances in fetching data Oct 6, 2020

gmaze merged commit 14f747c into master Oct 6, 2020

quai20 mentioned this pull request Oct 20, 2020

ValueError: Got more bytes so far (>2602512) than requested (2594828) when using argo_loader #67

Closed

gmaze deleted the parallel-requests branch December 9, 2020 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performances in fetching data #28

Improve performances in fetching data #28

gmaze commented Jun 23, 2020 •

edited

Loading

gmaze commented Jul 1, 2020 •

edited

Loading

quai20 left a comment •

edited

Loading

gmaze commented Oct 6, 2020 •

edited

Loading

quai20 commented Oct 6, 2020

Improve performances in fetching data #28

Improve performances in fetching data #28

Conversation

gmaze commented Jun 23, 2020 • edited Loading

gmaze commented Jul 1, 2020 • edited Loading

quai20 left a comment • edited Loading

Choose a reason for hiding this comment

gmaze commented Oct 6, 2020 • edited Loading

quai20 commented Oct 6, 2020

gmaze commented Jun 23, 2020 •

edited

Loading

gmaze commented Jul 1, 2020 •

edited

Loading

quai20 left a comment •

edited

Loading

gmaze commented Oct 6, 2020 •

edited

Loading