Using Rasterio's new linux wheels #942

sgillies · 2016-12-08T11:43:18Z

We're making binary wheels for Linux that include all the C libraries Rasterio needs for all of the pre-1.0 releases. This is a post about how to use them.

These wheels are not intended for production use by the internet, but should be perfectly adequate for integration testing of Python software that requires Rasterio. They might even be useful for developing prototype services.

The GDAL library included in these wheels is only lightly provisioned with format drivers. The JPEG2000 driver based on Jasper is the only non-default driver. There are no proprietary drivers.

My example: installing Rasterio wheels on Ubuntu 14.04 and performing the little extra configuration needed to access AWS Public Datasets like Landsat on AWS.

Installation

I'm going to use a container based on the Ubuntu 14.04 image in Docker Hub as a host. It has Python 3 installed, but pip, the program we're going to use to install Rasterio, is not installed. Rather than install the python3-pip apt package (possibly requiring apt-get update) and drag in a mess of other dependencies, let's get pip via wget.

$ apt-get install wget
$ wget https://bootstrap.pypa.io/get-pip.py
$ python3 get-pip.py

Rasterio has a host of extra Python dependencies, thus it's always a good idea to install Rasterio applications in a dedicated environment. Create and activate one with virtualenv.

$ pip install virtualenv
$ virtualenv -p python3 venv
$ source venv/bin/activate

Now install Rasterio into the environment using pip, also requesting the optional "s3" set of extra dependencies (boto3 and more).

(venv)$ pip install --pre rasterio[s3]>=1.0a4

This fetches the rasterio-1.0a4-cp34-cp34m-manylinux1_x86_64.whl file from the Python Package Index and extracts it into the environment's site-packages directory. A peek into site-packages reveals the included C libraries.

(venv)$ ls -l venv/lib/python3.4/site-packages/rasterio/.libs/
total 122004
-rwxr-xr-x 1 root root  3659864 Dec  8 09:29 libcurl-96d9b940.so.4.4.0
-rwxr-xr-x 1 root root 94185184 Dec  8 09:29 libgdal-03eecd3b.so.20.1.2
-rwxr-xr-x 1 root root 22032320 Dec  8 09:29 libgeos-3-fc05f4c1.5.0.so
-rwxr-xr-x 1 root root  1499128 Dec  8 09:29 libgeos_c-09576097.so.1.9.0
-rwxr-xr-x 1 root root  1428600 Dec  8 09:29 libjasper-fb9de72f.so.1.0.0
-rwxr-xr-x 1 root root    43712 Dec  8 09:29 libjson-c-ca0558d5.so.2.0.1
-rwxr-xr-x 1 root root  2074320 Dec  8 09:29 libproj-18c59ecd.so.12.0.0

Yes, the libs are big. The wheels are heavy. I'm working on it, I promise.

Start a Python interpreter and import rasterio as a last check.

(venv)$ python
Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import rasterio
>>> rasterio.__gdal_version__
'2.1.2'

Configuration

Rasterio includes a program named "rio" and its "info" sub-command provides many of the same features as the venerable "gdalinfo" program. Before you can use it to query datasets on S3, you need to do a little extra system configuration.

First, set language and locale environment variables so rio will run properly with Python 3.

(venv)$ export LC_ALL=C.UTF-8
(venv)$ export LANG=C.UTF-8

Next, specify where to find the SSL certs on your host. Rasterio's libcurl, which is built on CentOS, expects /etc/pki/tls/certs/ca-bundle.crt. Ubuntu's are in a different location.

(venv)$ export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

Finally, set up AWS credentials. Rasterio uses boto3 to deal with credentials and these can be configured following the directions in the AWS CLI guide.

(venv)$ mkdir ~/.aws
(venv)$ cat << EOF > ~/.aws/credentials
> [default]
> aws_access_key_id = AWS_ACCESS_KEY_ID
> aws_secret_access_key = AWS_SECRET_ACCESS_KEY
> EOF

Running rio-info

Give an s3-prefixed object identifier, the same kind you would use with the AWS CLI, to rio info with a --indent 2 option to get pretty-printed JSON.

(venv)$ rio info --indent 2 s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF

{
  "blockxsize": 512,
  "blockysize": 512,
  "bounds": [
    381885.0,
    2279085.0,
    610515.0,
    2512815.0
  ],
  "colorinterp": [
    "grey"
  ],
  "compress": "deflate",
  "count": 1,
  "crs": "EPSG:32645",
  "descriptions": [
    null
  ],
  "driver": "GTiff",
  "dtype": "uint16",
  "height": 7791,
  "indexes": [
    1
  ],
  "interleave": "band",
  "lnglat": [
    86.96327090815723,
    21.666821827007773
  ],
  "mask_flags": [
    [
      "all_valid"
    ]
  ],
  "nodata": null,
  "res": [
    30.0,
    30.0
  ],
  "shape": [
    7791,
    7621
  ],
  "tiled": true,
  "transform": [
    30.0,
    0.0,
    381885.0,
    0.0,
    -30.0,
    2512815.0,
    0.0,
    0.0,
    1.0
  ],
  "units": [
    null
  ],
  "width": 7621
}

Efficient metadata queries

Access to S3 GeoTIFF metadata is very efficient. Thanks to GDAL's support for HTTP range requests, Rasterio only needs to download 0.03% of the dataset's bytes in order to query its metadata. Turn up the verbosity of rio-info and ask for extra curl logging to see the individual HTTP requests.

(venv)$ CPL_CURL_VERBOSE=1 rio -vv info s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF 2>&1 > /dev/null | grep '< '
< HTTP/1.1 400 Bad Request
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Thu, 08 Dec 2016 09:53:18 GMT
< Connection: close
< Server: AmazonS3
<
< HTTP/1.1 200 OK
< Date: Thu, 08 Dec 2016 09:53:21 GMT
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Server: AmazonS3
<
< HTTP/1.1 206 Partial Content
< Date: Thu, 08 Dec 2016 09:53:21 GMT
< Last-Modified: Sat, 14 Mar 2015 23:20:01 GMT
< ETag: "f08bdf1e626bf0039746c102fbd2c2b8"
< Accept-Ranges: bytes
< Content-Range: bytes 0-16383/51099231
< Content-Type: image/tiff
< Content-Length: 16384
< Server: AmazonS3
<

The HTTP/1.1 400 Bad Request is in response to probing of the object's folder that GDAL does by default. In a future version of GDAL the probing can be disabled.

Efficient partial data queries

Because the Landsat GeoTIFFs are tiled, subsets of them can be queried for a fraction of the cost of downloading the entire dataset. I'm going to use Rasterio's dataset inspector, rio-insp, to demonstrate. Knowing that the GeoTIFF is tiled and that the tiles are 512 x 512 bytes, I'm going to request a subset corresponding to a single tile in the middle of the raster.

(venv)$ CPL_CURL_VERBOSE=1 rio -vv insp s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
Rasterio 1.0a4 Interactive Inspector (Python 3.4.3)
Type "src.meta", "src.read(1)", or "help(src)" for more information.
>>> from rasterio.windows import Window
>>> src.read(window=Window(2048, 2048, 512, 512))

Here are the request details printed to stderr:

> GET /L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF HTTP/1.1
Host: landsat-pds.s3.amazonaws.com
Range: bytes=12189696-12550143
Accept: */*

< HTTP/1.1 206 Partial Content
< Date: Fri, 09 Dec 2016 10:09:59 GMT
< Last-Modified: Sat, 14 Mar 2015 23:20:01 GMT
< ETag: "f08bdf1e626bf0039746c102fbd2c2b8"
< Accept-Ranges: bytes
< Content-Range: bytes 12189696-12550143/51099231
< Content-Type: image/tiff
< Content-Length: 360448
< Server: AmazonS3
<

And here is the abbreviated representation of the 512 x 512 array in the Python console:

array([[[10311, 10249, 10306, ..., 10736, 10637, 10468],
        [10320, 10262, 10231, ..., 10834, 10682, 10461],
        [10225, 10287, 10305, ..., 10742, 10660, 10516],
        ...,
        [10055, 10072, 10042, ..., 10509, 10555, 10548],
        [10034, 10055, 10042, ..., 10566, 10529, 10563],
        [10005,  9996, 10030, ..., 10592, 10549, 10551]]], dtype=uint16)

Only about 0.7% of the dataset's bytes have to be read in order to get that subset. If I ask for the tile in the upper left corner, which happens to be all zeros and has been compressed to nearly nothing, there's no additional HTTP request: all the data for that tile was already picked up in the initial 16 kb request and cached by GDAL.

>>> src.read(window=Window(0, 0, 512, 512))
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint16)

That's it for examples in this post. There's more spacewalking to be done with other datasets and other formats. I'll leave that up to you.

Feedback is very welcome

Are these useful to you? Can they be more useful with a modest amount of effort? Please let us know.

Thanks for reading!

The text was updated successfully, but these errors were encountered:

sgillies · 2016-12-09T10:33:27Z

I made edits to the post a few minutes ago. A reader pointed out to me in an email that Rasterio shouldn't be making so many requests for a tile. Indeed: I'd misused Rasterio's Window() constructor, and after fixing my usage I find that partial data access is even more efficient than I'd initially reported and correct in comparison to gdal_translate results.

MelnykAndriy · 2017-10-06T15:03:19Z

I need similar functionality but with google cloud. Is it possible?

sgillies · 2017-10-06T18:04:39Z

@MelnykAndriy search the repo for "google cloud": https://github.com/mapbox/rasterio/search?q=google+cloud&type=Issues&utf8=%E2%9C%93.

svHatch · 2017-12-12T21:49:22Z

Is there a possibility of using "/vsizip/" as well as S3 to query metadata from a large zip compressed geotiff on S3?

geowurster · 2017-12-12T22:10:09Z

@svHatch At the very least you should be able to use a GDAL connection string like /vsizip/vsis3/bucket/path/to/file right now, but in the future I think you will be able to do: zip+s3://bucket/path/to/file (#1190).

yellowcap · 2018-01-17T16:52:46Z

I have tried to use this with Sentinel-2 images from the sentinel-s2-l1c AWS Public Dataset bucket. The sentinel images are stored in the JPEG2000 format, and internally tiled in blocks of 1014x1024 pixels. The windowed partial data query works fine, as long as I only request data from "within" one internal tile. If I request a block that spans over multiple tiles, the routine gives an error I can not interpret.

Using

(venv)$ CPL_CURL_VERBOSE=1 rio -vv insp s3://sentinel-s2-l1c/tiles/29/S/ND/2017/11/16/0/B03.jp2

The following works, but seems to be less efficient, as it does a lot more requests than in the TIF file example above.

>>> from rasterio.windows import Window
>>> src.read(window=Window(1024, 1024, 512, 512))

... (lots of output)

array([[[ 817,  779,  940, ...,  781,  669,  720],
        [ 811,  797,  966, ...,  930,  695,  707],
        [ 859,  894,  971, ..., 1161,  927,  806],
        ...,
        [ 759,  772,  763, ...,  886,  844,  728],
        [ 751,  747,  725, ...,  847,  825,  745],
        [ 723,  678,  683, ..., 1022,  938,  806]]], dtype=uint16)

The following fails

>>> src.read(window=Window(1000, 1000, 512, 512))
DEBUG:rasterio._io:Output nodata value read from file: None
DEBUG:rasterio._io:Output nodata values: [None]
DEBUG:rasterio._io:Jump straight to _read()
DEBUG:rasterio._io:Window: Window(col_off=1000, row_off=1000, width=512, height=512)
DEBUG:rasterio._io:IO window xoff=1000.0 yoff=1000.0 width=512.0 height=512.0
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "rasterio/_io.pyx", line 330, in rasterio._io.DatasetReaderBase.read
  File "rasterio/_io.pyx", line 591, in rasterio._io.DatasetReaderBase._read
OSError: Read or write failed

So I guess I have two questions: why is the partial querying doing many more requests than in the TIF example, and why am I getting the above errors?

My GDAL version

>>> import rasterio
>>> rasterio.__gdal_version__
'2.2.2'

sgillies · 2018-01-17T17:19:54Z

@yellowcap I have also noticed poor performance with the same JP2 files. I think they're not optimized for remote access with GDAL like the Landsat PDS GeoTIFFs are.

Can you make a new ticket for the OSError issue above? That looks like a bug to me.

yellowcap · 2018-01-17T17:38:13Z

Thanks for the info @sgillies regarding performance. Any chance the Sentinel-2 data access can be optimized in the future through a software update without changes in the files? Or is that related to the files and can not be worked around? Opened separate ticket for error as requested.

- rasterio on debian/ubuntu requires this curl config rasterio/rasterio#942

sgillies added the devlog label Dec 8, 2016

sgillies mentioned this issue Dec 9, 2016

Errors opening TIF files over S3 with rasterio #938

Closed

sgillies mentioned this issue Feb 3, 2017

Accessing datasets located in buffers using MemoryFile and ZipMemoryFile #977

Open

sgillies added this to the 1.0 milestone Feb 13, 2017

jbeezley mentioned this issue Mar 15, 2017

Support reading raster data from s3 and http OpenGeoscience/geonotebook#108

Open

yellowcap mentioned this issue Jan 17, 2018

Error when reading pixel block from internally tiled JPEG2000 file when reading over tile boundaries. #1247

Closed

sgillies mentioned this issue Mar 7, 2018

CURL error when trying to read raster from S3 bucket #1289

Closed

sgillies closed this as completed in b621d92 Jun 20, 2018

sgillies reopened this Jun 20, 2018

sgillies removed this from the 1.0 milestone Jun 21, 2018

sgillies mentioned this issue Jul 29, 2019

Fails to read GTiff inside remote TAR #1736

Closed

kylebarron mentioned this issue Mar 2, 2020

Create mosaic from S3 requester-pays tiffs developmentseed/cogeo-mosaic#35

Closed

guidorice pushed a commit to developmentseed/covid-wb-api that referenced this issue Jul 6, 2020

Add CURL_CA_BUNDLE for rasterio.

c343dc0

- rasterio on debian/ubuntu requires this curl config rasterio/rasterio#942

rbavery mentioned this issue Jan 8, 2022

remove gdal dependency and rely on rasterio+rasterio’s gdal binaries to simplify the code base and make installation easier CosmiQ/solaris#439

Open

rasterio locked as resolved and limited conversation to collaborators Jul 7, 2022

snowman2 closed this as completed Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Rasterio's new linux wheels #942

Using Rasterio's new linux wheels #942

sgillies commented Dec 8, 2016 •

edited

Loading

sgillies commented Dec 9, 2016

MelnykAndriy commented Oct 6, 2017

sgillies commented Oct 6, 2017

svHatch commented Dec 12, 2017

geowurster commented Dec 12, 2017

yellowcap commented Jan 17, 2018

sgillies commented Jan 17, 2018

yellowcap commented Jan 17, 2018

Using Rasterio's new linux wheels #942

Using Rasterio's new linux wheels #942

Comments

sgillies commented Dec 8, 2016 • edited Loading

Installation

Configuration

Running rio-info

Efficient metadata queries

Efficient partial data queries

See also

Feedback is very welcome

sgillies commented Dec 9, 2016

MelnykAndriy commented Oct 6, 2017

sgillies commented Oct 6, 2017

svHatch commented Dec 12, 2017

geowurster commented Dec 12, 2017

yellowcap commented Jan 17, 2018

sgillies commented Jan 17, 2018

yellowcap commented Jan 17, 2018

sgillies commented Dec 8, 2016 •

edited

Loading