Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove chunks that do not intersect target area in gradient search resampling #282

Merged
merged 44 commits into from
May 29, 2020

Conversation

pnuu
Copy link
Member

@pnuu pnuu commented May 14, 2020

This PR adds filtering to gradient search resampling that discards source data chunks that do not intersect the target area.

The gradient search is now an optional mode, and requires shapely to be installed.

With this PR the following work with gradient search resampling:

  • resampling of datasets with different resolutions

  • resampling of datasets that previously failed with IndexError: Bad index because chunks had incompatible shapes

  • resampling of data with AreaDefinition

  • resampling of data with SwathDefinition with "smooth" coordinates (= not VIIRS, MODIS, point data)

  • Tests added

  • Tests passed

  • Passes git diff origin/master **/*py | flake8 --diff

Sorry, something went wrong.

@pnuu
Copy link
Member Author

pnuu commented May 14, 2020

Some timing results so far for the following script:

import glob
import os
os.environ['PYTROLL_CHUNK_SIZE'] = '1024'
os.environ['DASK_NUM_WORKERS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'

from satpy import Scene

fnames = glob.glob('/home/lahtinep/data/satellite/geo/msg/*202004170600*')
glbl = Scene(reader='seviri_l1b_hrit', filenames=fnames)
glbl.load(['natural_color', 'overview', 'red_snow'], generate=False)
lcl = glbl.resample('eurol', resampler='gradient_search', reduce_data=False)
lcl.save_datasets(base_dir='/tmp')
  • master branch: 31.1 s (1 core) / 23.1 s (2 cores) / 20.6 s (4 cores)
  • this pr: 23.6 s / 16.7 s / 14.7

This image shows how the data was reduced before resampling (black = target area, blue = chunks used in processing, red = discarded chunks):
geos_crop_boundaries_fixed_covers_seviri

There are still some problems with ABI data. Although the chunk filtering works ...
geos_crop_boundaries_fixed_covers

... the image still doesn't have full coverage:
airmass_20170727_090038

@pnuu
Copy link
Member Author

pnuu commented May 14, 2020

The problem with ABI was my assumption of how the chunks are reordered for processing. The stepping goes first in Y direction, then in X. I've got it reversed. It works wit Seviri because there are only one chunk in X direction.

@djhoese
Copy link
Member

djhoese commented May 14, 2020

The problem with ABI was my assumption of how the chunks are reordered for processing. The stepping goes first in Y direction, then in X. I've got it reversed. It works wit Seviri because there are only one chunk in X direction.

So now after your last fix the ABI image is correct?

@pnuu
Copy link
Member Author

pnuu commented May 14, 2020

The last commit fixed the chunk collection for ABI, so here are some timing results for ABI L1b (reduced resolution via Eumetcast) for three RGBs (['airmass', 'ash', 'night_fog']), resampled to my test area 'conus_5km' (area def shown below):

  • master branch: 30.8 s (1 core) / 24.5 s (2 cores) / 22.5 s (4 cores)
  • this pr: 11.7 s / 9.8 s / 9.0 s

The six chunks that were used in processing are shown in the above image in blue surrounding the black target area.

conus_5km:
  description: Continental USA, 5 km resolution
  projection:
    proj: laea
    a: 6370997.0
    b: 6370997.0
    lat_0: 50.0
    lon_0: -100.0
  shape:
    height: 669
    width: 1027
  area_extent:
    lower_left_xy: [-2864809.2037627744, -2830762.316377244]
    upper_right_xy: [2268528.5715566175, 514757.94726551295]

@pnuu
Copy link
Member Author

pnuu commented May 14, 2020

So now after your last fix the ABI image is correct?

Yes:
airmass_20170727_090038

@pnuu
Copy link
Member Author

pnuu commented May 14, 2020

According to dask.diagnostics.ResourceProfiler the memory usage dropped from ~3.3 GB (master branch) to ~1.4 GB when using 4 Dask workers for the ABI case.

@pnuu
Copy link
Member Author

pnuu commented May 15, 2020

With this latest commit, the timings are

  • ABI: 11. 3 s (1 core), 9.4 s (2 cores), 11.3 s (4 cores)
  • SEVIRI: 24.3 s (1 core). 16.6 (2 cores), 14.7 s (4 cores)

For ABI there's a slight improvement even with this rather small target area, and with larger target area (= more chunks) the gain would be larger. There's zero gain for SEVIRI as all the used source chunks cover all the target chunks.

@pnuu
Copy link
Member Author

pnuu commented May 15, 2020

Another test with 1 km CONUS area (3345 x 5135) and ABI:

  • master: 5 m 11.5 s
  • filter only source chunks: 1 m 7.3 s
  • filter also target chunks (the latest commit): 1 m 9.4 s
  • filter also target chunks, cache polygons: 1 m 7.5 s

Pretty significant jump from the master branch, I'd say.

To get the most of the target chunk filtering, we'd need to pre-compute all the boundary polygons, but I didn't find a way to pass them via da.blockwise() to the worker function. Or if there were even a way to pass some info on the chunk location it'd be possible to access them from each of the worker.

@pnuu
Copy link
Member Author

pnuu commented May 19, 2020

After the last major restructuring (pair-up the input and output chunks for processing) the 1 km CONUS case takes 1 m 13 s to process and uses 10 GB of memory. The previous iteration used only 2.5 GB, so next I need to look at reducing the memory usage.

@pnuu
Copy link
Member Author

pnuu commented May 19, 2020

The memory consumption is now solved. I was doing a lot of unnecessary padding, stacking and da.nanmax()'ing. With the latest commit the ABI case takes 40.0 s to process with a memory use of 2.5 GB. Compared to the 5+ minutes and ~5 GB of the original version in master branch, I'm quite happy.

@pnuu pnuu marked this pull request as ready for review May 22, 2020 10:46
@pnuu pnuu requested a review from mraspaud May 22, 2020 10:47
@pnuu pnuu self-assigned this May 22, 2020
@pnuu
Copy link
Member Author

pnuu commented May 22, 2020

Appveyor doesn't seem to be using conda to install the packages, so I don't know how to fix the linux build error.

@pnuu
Copy link
Member Author

pnuu commented May 25, 2020

I copied some fixes from #247 so that the arrays passed to Cython code are read-only (const) and makes distributed processing to work more reliably. This also required the C source code to be re-generated, which will probably cause a huge merge conflict.

@pnuu pnuu mentioned this pull request May 26, 2020
5 tasks
Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding the pre-checks! I have two small inline comments that needs to be fixed.

setup.cfg Outdated
@@ -1,5 +1,5 @@
[bdist_rpm]
requires=python3-numpy pykdtree python3-numexpr pyproj python3-configobj
requires=python3-numpy pykdtree python3-numexpr pyproj python3-configobj shapely
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be python3-shapely in RHEL8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

return dst_poly

def get_chunk_mappings(self):
""""""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do better :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring added.

@codecov
Copy link

codecov bot commented May 29, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@719d8ff). Click here to learn what that means.
The diff coverage is 98.30%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #282   +/-   ##
=========================================
  Coverage          ?   92.55%           
=========================================
  Files             ?       43           
  Lines             ?     9018           
  Branches          ?        0           
=========================================
  Hits              ?     8347           
  Misses            ?      671           
  Partials          ?        0           
Impacted Files Coverage Δ
pyresample/test/test_gradient.py 97.93% <97.79%> (ø)
pyresample/gradient/__init__.py 97.92% <99.00%> (ø)
pyresample/test/test_image.py 100.00% <0.00%> (ø)
pyresample/geo_filter.py 100.00% <0.00%> (ø)
pyresample/_spatial_mp.py 83.56% <0.00%> (ø)
pyresample/grid.py 88.23% <0.00%> (ø)
pyresample/test/test_bucket.py 100.00% <0.00%> (ø)
pyresample/geometry.py 82.87% <0.00%> (ø)
pyresample/utils/rasterio.py 51.78% <0.00%> (ø)
pyresample/utils/proj4.py 86.95% <0.00%> (ø)
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 719d8ff...d998c73. Read the comment docs.

@mraspaud mraspaud merged commit 1c2b794 into pytroll:master May 29, 2020
@pnuu pnuu deleted the gradient-skip-chunks branch May 29, 2020 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants