Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Rasterio's new linux wheels #942

Closed
sgillies opened this issue Dec 8, 2016 · 8 comments
Closed

Using Rasterio's new linux wheels #942

sgillies opened this issue Dec 8, 2016 · 8 comments
Labels

Comments

@sgillies
Copy link
Member

sgillies commented Dec 8, 2016

We're making binary wheels for Linux that include all the C libraries Rasterio needs for all of the pre-1.0 releases. This is a post about how to use them.

These wheels are not intended for production use by the internet, but should be perfectly adequate for integration testing of Python software that requires Rasterio. They might even be useful for developing prototype services.

The GDAL library included in these wheels is only lightly provisioned with format drivers. The JPEG2000 driver based on Jasper is the only non-default driver. There are no proprietary drivers.

My example: installing Rasterio wheels on Ubuntu 14.04 and performing the little extra configuration needed to access AWS Public Datasets like Landsat on AWS.

mars_rovers_wheels_isometric

Installation

I'm going to use a container based on the Ubuntu 14.04 image in Docker Hub as a host. It has Python 3 installed, but pip, the program we're going to use to install Rasterio, is not installed. Rather than install the python3-pip apt package (possibly requiring apt-get update) and drag in a mess of other dependencies, let's get pip via wget.

$ apt-get install wget
$ wget https://bootstrap.pypa.io/get-pip.py
$ python3 get-pip.py

Rasterio has a host of extra Python dependencies, thus it's always a good idea to install Rasterio applications in a dedicated environment. Create and activate one with virtualenv.

$ pip install virtualenv
$ virtualenv -p python3 venv
$ source venv/bin/activate

Now install Rasterio into the environment using pip, also requesting the optional "s3" set of extra dependencies (boto3 and more).

(venv)$ pip install --pre rasterio[s3]>=1.0a4

This fetches the rasterio-1.0a4-cp34-cp34m-manylinux1_x86_64.whl file from the Python Package Index and extracts it into the environment's site-packages directory. A peek into site-packages reveals the included C libraries.

(venv)$ ls -l venv/lib/python3.4/site-packages/rasterio/.libs/
total 122004
-rwxr-xr-x 1 root root  3659864 Dec  8 09:29 libcurl-96d9b940.so.4.4.0
-rwxr-xr-x 1 root root 94185184 Dec  8 09:29 libgdal-03eecd3b.so.20.1.2
-rwxr-xr-x 1 root root 22032320 Dec  8 09:29 libgeos-3-fc05f4c1.5.0.so
-rwxr-xr-x 1 root root  1499128 Dec  8 09:29 libgeos_c-09576097.so.1.9.0
-rwxr-xr-x 1 root root  1428600 Dec  8 09:29 libjasper-fb9de72f.so.1.0.0
-rwxr-xr-x 1 root root    43712 Dec  8 09:29 libjson-c-ca0558d5.so.2.0.1
-rwxr-xr-x 1 root root  2074320 Dec  8 09:29 libproj-18c59ecd.so.12.0.0

Yes, the libs are big. The wheels are heavy. I'm working on it, I promise.

Start a Python interpreter and import rasterio as a last check.

(venv)$ python
Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import rasterio
>>> rasterio.__gdal_version__
'2.1.2'

Configuration

Rasterio includes a program named "rio" and its "info" sub-command provides many of the same features as the venerable "gdalinfo" program. Before you can use it to query datasets on S3, you need to do a little extra system configuration.

First, set language and locale environment variables so rio will run properly with Python 3.

(venv)$ export LC_ALL=C.UTF-8
(venv)$ export LANG=C.UTF-8

Next, specify where to find the SSL certs on your host. Rasterio's libcurl, which is built on CentOS, expects /etc/pki/tls/certs/ca-bundle.crt. Ubuntu's are in a different location.

(venv)$ export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

Finally, set up AWS credentials. Rasterio uses boto3 to deal with credentials and these can be configured following the directions in the AWS CLI guide.

(venv)$ mkdir ~/.aws
(venv)$ cat << EOF > ~/.aws/credentials
> [default]
> aws_access_key_id = AWS_ACCESS_KEY_ID
> aws_secret_access_key = AWS_SECRET_ACCESS_KEY
> EOF

Running rio-info

Give an s3-prefixed object identifier, the same kind you would use with the AWS CLI, to rio info with a --indent 2 option to get pretty-printed JSON.

(venv)$ rio info --indent 2 s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
{
  "blockxsize": 512,
  "blockysize": 512,
  "bounds": [
    381885.0,
    2279085.0,
    610515.0,
    2512815.0
  ],
  "colorinterp": [
    "grey"
  ],
  "compress": "deflate",
  "count": 1,
  "crs": "EPSG:32645",
  "descriptions": [
    null
  ],
  "driver": "GTiff",
  "dtype": "uint16",
  "height": 7791,
  "indexes": [
    1
  ],
  "interleave": "band",
  "lnglat": [
    86.96327090815723,
    21.666821827007773
  ],
  "mask_flags": [
    [
      "all_valid"
    ]
  ],
  "nodata": null,
  "res": [
    30.0,
    30.0
  ],
  "shape": [
    7791,
    7621
  ],
  "tiled": true,
  "transform": [
    30.0,
    0.0,
    381885.0,
    0.0,
    -30.0,
    2512815.0,
    0.0,
    0.0,
    1.0
  ],
  "units": [
    null
  ],
  "width": 7621
}

Efficient metadata queries

Access to S3 GeoTIFF metadata is very efficient. Thanks to GDAL's support for HTTP range requests, Rasterio only needs to download 0.03% of the dataset's bytes in order to query its metadata. Turn up the verbosity of rio-info and ask for extra curl logging to see the individual HTTP requests.

(venv)$ CPL_CURL_VERBOSE=1 rio -vv info s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF 2>&1 > /dev/null | grep '< '
< HTTP/1.1 400 Bad Request
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Thu, 08 Dec 2016 09:53:18 GMT
< Connection: close
< Server: AmazonS3
<
< HTTP/1.1 200 OK
< Date: Thu, 08 Dec 2016 09:53:21 GMT
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Server: AmazonS3
<
< HTTP/1.1 206 Partial Content
< Date: Thu, 08 Dec 2016 09:53:21 GMT
< Last-Modified: Sat, 14 Mar 2015 23:20:01 GMT
< ETag: "f08bdf1e626bf0039746c102fbd2c2b8"
< Accept-Ranges: bytes
< Content-Range: bytes 0-16383/51099231
< Content-Type: image/tiff
< Content-Length: 16384
< Server: AmazonS3
<

The HTTP/1.1 400 Bad Request is in response to probing of the object's folder that GDAL does by default. In a future version of GDAL the probing can be disabled.

Efficient partial data queries

Because the Landsat GeoTIFFs are tiled, subsets of them can be queried for a fraction of the cost of downloading the entire dataset. I'm going to use Rasterio's dataset inspector, rio-insp, to demonstrate. Knowing that the GeoTIFF is tiled and that the tiles are 512 x 512 bytes, I'm going to request a subset corresponding to a single tile in the middle of the raster.

(venv)$ CPL_CURL_VERBOSE=1 rio -vv insp s3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF
Rasterio 1.0a4 Interactive Inspector (Python 3.4.3)
Type "src.meta", "src.read(1)", or "help(src)" for more information.
>>> from rasterio.windows import Window
>>> src.read(window=Window(2048, 2048, 512, 512))

Here are the request details printed to stderr:

> GET /L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF HTTP/1.1
Host: landsat-pds.s3.amazonaws.com
Range: bytes=12189696-12550143
Accept: */*

< HTTP/1.1 206 Partial Content
< Date: Fri, 09 Dec 2016 10:09:59 GMT
< Last-Modified: Sat, 14 Mar 2015 23:20:01 GMT
< ETag: "f08bdf1e626bf0039746c102fbd2c2b8"
< Accept-Ranges: bytes
< Content-Range: bytes 12189696-12550143/51099231
< Content-Type: image/tiff
< Content-Length: 360448
< Server: AmazonS3
<

And here is the abbreviated representation of the 512 x 512 array in the Python console:

array([[[10311, 10249, 10306, ..., 10736, 10637, 10468],
        [10320, 10262, 10231, ..., 10834, 10682, 10461],
        [10225, 10287, 10305, ..., 10742, 10660, 10516],
        ...,
        [10055, 10072, 10042, ..., 10509, 10555, 10548],
        [10034, 10055, 10042, ..., 10566, 10529, 10563],
        [10005,  9996, 10030, ..., 10592, 10549, 10551]]], dtype=uint16)

Only about 0.7% of the dataset's bytes have to be read in order to get that subset. If I ask for the tile in the upper left corner, which happens to be all zeros and has been compressed to nearly nothing, there's no additional HTTP request: all the data for that tile was already picked up in the initial 16 kb request and cached by GDAL.

>>> src.read(window=Window(0, 0, 512, 512))
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint16)

That's it for examples in this post. There's more spacewalking to be done with other datasets and other formats. I'll leave that up to you.

389225main_sw_1965_full

See also

The manylinux project is the one that we're closely following to learn how to build these wheels.

The wheel building infrastructure is here: https://github.com/sgillies/frs-wheel-builds.

Feedback is very welcome

Are these useful to you? Can they be more useful with a modest amount of effort? Please let us know.

Thanks for reading!

@sgillies sgillies added the devlog label Dec 8, 2016
@sgillies
Copy link
Member Author

sgillies commented Dec 9, 2016

I made edits to the post a few minutes ago. A reader pointed out to me in an email that Rasterio shouldn't be making so many requests for a tile. Indeed: I'd misused Rasterio's Window() constructor, and after fixing my usage I find that partial data access is even more efficient than I'd initially reported and correct in comparison to gdal_translate results.

@MelnykAndriy
Copy link

I need similar functionality but with google cloud. Is it possible?

@sgillies
Copy link
Member Author

sgillies commented Oct 6, 2017

@svHatch
Copy link

svHatch commented Dec 12, 2017

Is there a possibility of using "/vsizip/" as well as S3 to query metadata from a large zip compressed geotiff on S3?

@geowurster
Copy link
Contributor

@svHatch At the very least you should be able to use a GDAL connection string like /vsizip/vsis3/bucket/path/to/file right now, but in the future I think you will be able to do: zip+s3://bucket/path/to/file (#1190).

@yellowcap
Copy link

I have tried to use this with Sentinel-2 images from the sentinel-s2-l1c AWS Public Dataset bucket. The sentinel images are stored in the JPEG2000 format, and internally tiled in blocks of 1014x1024 pixels. The windowed partial data query works fine, as long as I only request data from "within" one internal tile. If I request a block that spans over multiple tiles, the routine gives an error I can not interpret.

Using

(venv)$ CPL_CURL_VERBOSE=1 rio -vv insp s3://sentinel-s2-l1c/tiles/29/S/ND/2017/11/16/0/B03.jp2

The following works, but seems to be less efficient, as it does a lot more requests than in the TIF file example above.

>>> from rasterio.windows import Window
>>> src.read(window=Window(1024, 1024, 512, 512))

... (lots of output)

array([[[ 817,  779,  940, ...,  781,  669,  720],
        [ 811,  797,  966, ...,  930,  695,  707],
        [ 859,  894,  971, ..., 1161,  927,  806],
        ...,
        [ 759,  772,  763, ...,  886,  844,  728],
        [ 751,  747,  725, ...,  847,  825,  745],
        [ 723,  678,  683, ..., 1022,  938,  806]]], dtype=uint16)

The following fails

>>> src.read(window=Window(1000, 1000, 512, 512))
DEBUG:rasterio._io:Output nodata value read from file: None
DEBUG:rasterio._io:Output nodata values: [None]
DEBUG:rasterio._io:Jump straight to _read()
DEBUG:rasterio._io:Window: Window(col_off=1000, row_off=1000, width=512, height=512)
DEBUG:rasterio._io:IO window xoff=1000.0 yoff=1000.0 width=512.0 height=512.0
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "rasterio/_io.pyx", line 330, in rasterio._io.DatasetReaderBase.read
  File "rasterio/_io.pyx", line 591, in rasterio._io.DatasetReaderBase._read
OSError: Read or write failed

So I guess I have two questions: why is the partial querying doing many more requests than in the TIF example, and why am I getting the above errors?

My GDAL version

>>> import rasterio
>>> rasterio.__gdal_version__
'2.2.2'

@sgillies
Copy link
Member Author

@yellowcap I have also noticed poor performance with the same JP2 files. I think they're not optimized for remote access with GDAL like the Landsat PDS GeoTIFFs are.

Can you make a new ticket for the OSError issue above? That looks like a bug to me.

@yellowcap
Copy link

Thanks for the info @sgillies regarding performance. Any chance the Sentinel-2 data access can be optimized in the future through a software update without changes in the files? Or is that related to the files and can not be worked around? Opened separate ticket for error as requested.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants