Skip to content

Serverless high-resolution aerial imagery map tiles from Cloud-Optimized GeoTIFFs for the U.S.

License

Notifications You must be signed in to change notification settings

kylebarron/naip-cogeo-mosaic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

naip-cogeo-mosaic

Serverless high-resolution NAIP map tiles from Cloud-Optimized GeoTIFFs for the lower 48 U.S. states.

Interactive example

60cm-resolution NAIP imagery of the Grand Canyon from 2017.

Overview

The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. All NAIP imagery between 2011 and 2018 is stored in an AWS S3 public dataset, and crucially the naip-visualization bucket stores images in Cloud-Optimized GeoTIFF (COG) format. Because this data format supports streaming portions of the image at a time, it enables serving a basemap of imagery on demand without needing to preprocess and store any imagery.

This repository is designed to create MosaicJSON files representing collections of NAIP imagery that can be used with titiler to serve map tiles on demand.

Using

If you'd like to get running quickly, you can use a built mosaicJSON file in the filled/ folder and skip down to "Deploy".

Otherwise, the following describes how to create a custom mosaicJSON file from specified years of NAIP imagery available on AWS.

Install

Clone the repository and install Python dependencies.

git clone https://github.com/kylebarron/naip-cogeo-mosaic
cd naip-cogeo-mosaic
conda env create -f environment.yml
source activate naip-cogeo-mosaic

If you prefer using pip, you can run

pip install awscli click pygeos 'rio-cogeo>=2.0' 'cogeo-mosaic>=3.0a5'

Select TIF URLs

This section outlines methods for selecting files that represent a country-wide mosaic of NAIP imagery, which can then be put into a MosaicJSON file for serving.

Download manifest.txt. This file has a listing of all files stored on the naip-visualization bucket.

aws s3 cp s3://naip-visualization/manifest.txt ./manifest.txt --request-payer

In the NAIP program, different states are photographed in different years, with a target of imaging all states within every 3 years. Here's an interactive map of when each state was photographed, (though it doesn't appear to include 2018 yet; this graphic shows which states were photographed in 2018).

All (lower 48) states were photographed between 2011-2013, and again in 2014-2015. All states except Maine were photographed in 2016-2017. All states except Oregon were photographed in 2017-2018.

Therefore, I'll generate four MosaicJSONs, with each spanning a range of 2011-2013, 2014-2015, 2015-2017, and 2016-2018. For the last two, I include an extra start year just for Maine/Oregon, but set each to use the latest available imagery, so only Maine takes imagery from 2015 and only Oregon takes imagery from 2016, respectively.

The following code block selects imagery for each time span. You can run python code/naip.py --help for a full description of available options.

python code/naip.py manifest \
    -s 2011 \
    -e 2013 \
    --select-method last \
    manifest.txt \
    | sed -e 's|^|s3://naip-visualization/|' \
    > urls_2011_2013.txt
python code/naip.py manifest \
    -s 2014 \
    -e 2015 \
    --select-method last \
    manifest.txt \
    | sed -e 's|^|s3://naip-visualization/|' \
    > urls_2014_2015.txt
python code/naip.py manifest \
    -s 2015 \
    -e 2017 \
    --select-method last \
    manifest.txt \
    | sed -e 's|^|s3://naip-visualization/|' \
    > urls_2015_2017.txt
python code/naip.py manifest \
    -s 2016 \
    -e 2018 \
    --select-method last \
    manifest.txt \
    | sed -e 's|^|s3://naip-visualization/|' \
    > urls_2016_2018.txt

Each output file includes one filename for each quad identifier, deduplicated across years.

> head -n 5 urls_2016_2018.txt
s3://naip-visualization/al/2017/100cm/rgb/30085/m_3008501_ne_16_1_20171018.tif
s3://naip-visualization/al/2017/100cm/rgb/30085/m_3008501_nw_16_1_20171006.tif
s3://naip-visualization/al/2017/100cm/rgb/30085/m_3008502_ne_16_1_20170909.tif
s3://naip-visualization/al/2017/100cm/rgb/30085/m_3008502_nw_16_1_20170909.tif
s3://naip-visualization/al/2017/100cm/rgb/30085/m_3008503_ne_16_1_20171017.tif

Additionally, files along state borders are deduplicated. Often, cells on state borders are duplicated across years. For example, this image is duplicated in both Texas's and Louisiana's datasets:

tx/2012/100cm/rgb/29093/m_2909302_ne_15_1_20120522.tif
la/2013/100cm/rgb/29093/m_2909302_ne_15_1_20130702.tif

As you can tell by the cell and name, these are the same position across different years. I deduplicate these to reduce load on the lambda function parsing the mosaicJSON.

To visualize the quads covered by a list of urls, you can visualize its footprint. For example, to get the mosaic footprint of Rhode Island from the 2011-2013 mosaic:

export AWS_REQUEST_PAYER="requester"
cat urls_2011_2013.txt \
    | grep "^s3://naip-visualization/ri/" \
    | cogeo-mosaic footprint - > footprint.geojson

And inspect it with kepler.gl:

kepler footprint.geojson

Here you can see that the tiles to be used in the mosaic of Rhode Island don't include the state's border. That's because the Python script to parse the manifest deduplicates tiles on the border when they're include in both states. If you looked at the footprint of Connecticut, you'd see the missing tiles on the border.

Total number of files

> wc -l urls_2011_2013.txt
  213197 urls_2011_2013.txt

Create MosaicJSON

NAIP imagery tiffs are in a requester pays bucket. In order to access them, you need to set the AWS_REQUEST_PAYER environment variable:

export AWS_REQUEST_PAYER="requester"

I also found that on an AWS EC2 instance; cogeo-mosaic create was failing while it was working on my local computer. In general, if cogeo-mosaic create isn't working for some URL; you should run rio info <URL> and see what the error is, since cogeo-mosaic uses rasterio internally, but doesn't currently print rasterio errors to stdout. In my case, I had to set the certificates path (see cogeotiff/rio-tiler#19, mapbox/rasterio#942).

export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

I don't know how much data cogeo-mosaic create downloads (it only requests the GeoTIFF headers of each file), but it might be wise to run the mosaic creation on an AWS EC2 instance in the us-west-2 region (the same region where the NAIP imagery is located), so that you don't have to pay for egress bandwidth on the requests. I found that creating the mosaic took about 1.5GB of memory; it finished in about 7 hours per mosaic on a t2.small instance.

Then create the MosaicJSON file. GET requests are priced at $0.0004 per 1000 requests, so creating the MosaicJSON should cost 0.0004 * (200000 / 1000) = 0.08. 8 cents!

cat urls_2011_2013.txt \
    | cogeo-mosaic create - \
    > naip_2011_2013_mosaic.json
cat urls_2014_2015.txt \
    | cogeo-mosaic create - \
    > naip_2014_2015_mosaic.json
cat urls_2015_2017.txt \
    | cogeo-mosaic create - \
    > naip_2015_2017_mosaic.json
cat urls_2016_2018.txt \
    | cogeo-mosaic create - \
    > naip_2016_2018_mosaic.json

Fill in missing quadkeys

Some of these years have small missing areas. For example, in some years parts of Montana weren't photographed. fill_mosaic_holes.py is a simple script to fill mosaic quadkeys across years.

This following tells the script to look at all the mosaics data/*.json and create new filled scripts output to the filled/ folder.

python code/fill_mosaic_holes.py -o filled data/naip_201*.json

Note that this fills in entire quadkeys that are missing in one year but that exist in another. However, if a year is missing some areas, there will be quadkeys that exist but only have partial data. So without more effort there can still be some small holes in the data. See issue #8.

Deploy

The older cogeo-mosaic-tiler is being deprecated in favor of the newer, more stable titiler. Refer to titiler's documentation for deployment instructions. Note that since NAIP images are in a requester-pays bucket, you'll need to set AWS_REQUEST_PAYER="requester" in the environment.

Upload MosaicJSON files

In order for titiler to create your tiles on demand, it needs to access a MosaicJSON file, which you need to host somewhere accessible by titiler (preferably in the same AWS region).

Generally the simplest method is uploading the JSON file to S3. However since these files are so large (~64MB uncompressed), I found that it was taking 2.5s to load and parse the JSON.

As of v3, cogeo-mosaic (and thus also titiler) support alternate backends, such as DynamoDB. DynamoDB is a serverless database that makes loading the MosaicJSON fast, because the tiler only needs one or two reads, which each take around 10ms (as long as the DynamoDB table is in the same region as the tiler).

For full backend docs, see cogeo-mosaic's documentation.

DynamoDB

If you wish to connect the tiler to one or more DynamoDB tables, you need to deploy with

sls deploy --bucket your-mosaic-bucket --aws-account-id your-aws-account-id

You can find your AWS account ID with

aws sts get-caller-identity

To upload a MosaicJSON to DynamoDB, run:

pip install -U cogeo-mosaic>=3.0a5
cogeo-mosaic upload --url 'dynamodb://{aws_region}/{table_name}' mosaic.json

That uploads the MosaicJSON to the given table in the specified region, creating the table if necessary.

Proxy your endpoint with Cloudflare

I like to proxy through Cloudflare to take advantage of their free caching. You can read my blog post here to see how to do that.

Low Zoom Overviews

The NAIP imagery hosted on AWS in the naip-visualization bucket has full-resolution imagery plus 5 levels of internal overviews within each GeoTIFF. That means that it's fast to read image data for 6 or 7 zoom levels, but will slow down as you zoom out, since you'll necessarily need to read and combine many images, and perform downsampling on the fly.

The native zoom range for source NAIP COG imagery is roughly 12-18. That means that for one zoom 6 tile, you'd have to combine imagery for 4^6 = 4,096 COGs on the fly.

A way to solve this issue is to pregenerate lower zoom overviews given a mosaic. This removes some flexibility, as you have to choose upfront how to combine the higher-resolution images, but enables performant serving of lower-resolution imagery.

If you don't want to generate your own overviews, pregenerated overviews are available in a requester pays bucket of mine, and overview mosaics are available in the filled/ directory. My overview COGs are available at s3://rp.kylebarron.dev/cog/naip/deflate/.

This section outlines how to create these overviews, which are themselves COGs. After creating the overviews, you'll have fast low-zoom imagery, as you can see in this screenshot of Colorado.

Note, this code is experimental.

I want to have a seamless downsampled map. To do this I'll create overview COGs, and then create another mosaic from these downsampled images. For zooms 12 and up, the map will fetch images using the full-resolution mosaic; for zooms 6-12, the map will use the lower-resolution mosaic.

To make this mosaic, I'll create a new overview COG for each zoom 6 mercator tile. Then within a zoom 6 tile, only one source COG will need to be read.

First, split the large, U.S.-wide mosaic into mosaics for each zoom 6 quadkey, using a script in code/overviews.py.

python code/overviews.py split-mosaic \
    -z 6 \
    -o overview-mosaics \
    --prefix naip_2011_2013_ \
    filled/naip_2011_2013_mosaic.json
python code/overviews.py split-mosaic \
    -z 6 \
    -o overview-mosaics \
    --prefix naip_2014_2015_ \
    filled/naip_2014_2015_mosaic.json
python code/overviews.py split-mosaic \
    -z 6 \
    -o overview-mosaics \
    --prefix naip_2015_2017_ \
    filled/naip_2015_2017_mosaic.json
python code/overviews.py split-mosaic \
    -z 6 \
    -o overview-mosaics \
    --prefix naip_2016_2018_ \
    filled/naip_2016_2018_mosaic.json

This creates about 50 temporary mosaics for each country-wide mosaic, and just over 200 total. Each of these mosaics defines the source input imagery necessary to create one overview COG.

Then we need a loop that will create an overview image for each of the above temporary mosaics. This step uses the overview code in cogeo-mosaic.

I highly recommend to do this step on an EC2 instance in the same region as the NAIP data (us-west-2). Since NAIP data is in a requester pays bucket, if you do this step elsewhere, you'll have to pay egress fees for intermediate imagery taken out of the region. Additionally, this code is not yet memory- optimized, so you may need an instance with a good amount of memory.

# NAIP imagery is in a requester pays bucket
export AWS_REQUEST_PAYER="requester"
# Necessary on an EC2 instance
export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
# Necessary to prevent expensive, unnecessary S3 LIST requests!
export GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR
mkdir out && cd out
python ../code/overviews.py create-overviews \
    `# Number of processes to use in parallel` \
    -j 2 \
    `# Input directory of temporary mosaics created above` \
    `# Output dir is current directory` \
    ../overview-mosaics

The output images should be in COG already, but to make sure I'll use rio-cogeo to convert them before final storage on S3. You should use a blocksize and overview-blocksize that matches the size of imagery you use in your website.

mkdir ../cog_deflate/
for file in $(ls *.tif); do
    rio cogeo create \
        --blocksize 256 \
        --overview-blocksize 256 \
        $file ../cog_deflate/$file;
done

Then upload your output images to an S3 bucket of yours, and finally, create new overview mosaics from these images. For full information, see its documentation (code/overviews.py create-overview-mosaic --help). As input you should pass a list of S3 urls to your overview COG files you just created. Don't rename the filenames before passing to the script.