diff --git a/book/tutorials/cloud-computing/01-cloud-computing.ipynb b/book/tutorials/cloud-computing/01-cloud-computing.ipynb
index 43ed7d4..196897f 100644
--- a/book/tutorials/cloud-computing/01-cloud-computing.ipynb
+++ b/book/tutorials/cloud-computing/01-cloud-computing.ipynb
@@ -5,11 +5,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# What is cloud computing?\n",
+ "# ☁️ What is cloud computing?\n",
"\n",
"
\n",
"\n",
- "**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n",
+ "**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers (CSPs) such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n",
"\n",
"```{image} ./images/AWS_OurDataCenters_Background.jpg\n",
":width: 600px\n",
@@ -20,13 +20,13 @@
"\n",
">Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services, such as computing power, storage, and databases, on an as-needed basis from a cloud provider like Amazon Web Services (AWS). ([source](https://aws.amazon.com/what-is-cloud-computing/))\n",
"\n",
- "This tutorial will focus on AWS services and terminology, but Google Cloud and Microsoft Azure offer the same services.\n",
+ "This tutorial will focus on AWS services and terminology, as AWS is the cloud service provider used for NASA's Earthdata Cloud (more on that later). But Google Cloud and Microsoft Azure offer the same services. If you're interested, you can read more about the history of AWS on Wikipedia.\n",
"\n",
":::{dropdown} 🏋️ Exercise: How many CPUs and how much memory does your laptop have? And how does that compare with CryoCloud?\n",
":open:\n",
"If you have your laptop available, open the terminal app and use the appropriate commands to determine CPU and memory.\n",
"\n",
- "
A structured data file is composed of two parts: metadata and the raw data. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.
\n", + "A structured data file is composed of two parts: metadata and the raw data. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. It is also crucial to **lazy loading** (see glossary belo). Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.
\n", "\n", "```{image} ./images/hdf5-structure-1.jpg\n", ":width: 600px\n", @@ -52,7 +50,7 @@ "\n", "\n", "\n", - "## When optimizing for the cloud, what structure should be used?\n", + "## How should we structure files for the cloud?\n", "\n", "### A \"moving away from home\" analogy\n", "\n", @@ -66,20 +64,28 @@ "\n", "You can actually make any common geospatial data formats (HDF5/NetCDF, GeoTIFF, LAS (LIDAR Aerial Survey)) \"cloud-optimized\" by:\n", "\n", - "1. Separate metadata from data and store it contiguously so it can be read with one request.\n", + "1. Separate metadata from data and store metadata contiguously so it can be read with one request.\n", "2. Store data in chunks, so the whole file doesn't have to be read to access a portion of the data, and it can be compressed.\n", "3. Make sure the chunks of data are not too small, so more data is fetched with each request.\n", "4. Make sure the chunks are not too large, which means more data has to be transferred and decompression takes longer.\n", "5. Compress these chunks so there is less data to transfer over the network.\n", "\n", - ":::{note} Lazy loading\n", "\n", - "**Separating metadata from data supports lazy loading, which is key to working quickly when data is in the cloud.** Libraries, such as xarray, first read the metadata. They defer actually reading data until it's needed for analysis. When a computation of the data is called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the chunks required. This is also called \"lazy loading\" data. See also [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", + "## Glossary\n", + "\n", + "### latency\n", + "The time between when data is sent to when it is received. [Read more](https://aws.amazon.com/what-is/latency/).\n", + "\n", + "### throughput\n", + "The amount of data that can be transferred over a given time. [Read more](https://en.wikipedia.org/wiki/Network\\_throughput).\n", + "\n", + "### lazy loading\n", + "\n", + "🛋️ 🥔 Lazy loading is deferring loading any data until required. Here's how it works: Metadata stores a mapping of chunk indices to byte ranges in files. Libraries, such as xarray, read only the metadata when opening a dataset. Libraries defer requesting any data until values are required for computation. When a computation of the data is finally called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the byte ranges associated with the data chunks required. See [the s3fs `cat_ranges` function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.cat_ranges) and [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", "\n", - ":::\n", "\n", "\n", - ":::{attention} Opening Arguments\n", + ":::{admonition} Opening Arguments\n", "A few arguments used to open the dataset also make a huge difference, namely with how libraries, such as s3fs and h5py, cache chunks.\n", "\n", "For s3fs, use [`cache_type` and `block_size`](https://s3fs.readthedocs.io/en/latest/api.html?highlight=cache_type#s3fs.core.S3File).\n", diff --git a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb index 995c139..37dcc71 100644 --- a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb +++ b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb @@ -10,15 +10,14 @@ "tags": [] }, "source": [ - "# Cloud-Optimized ICESat-2\n", + "# 🧊 Cloud-Optimized ICESat-2\n", "\n", "## Cloud-Optimized vs Cloud-Native\n", "\n", "Recall from [03-cloud-optimized-data-access.ipynb](./03-cloud-optimized-data-access.ipynb) that we can make any HDF5 file cloud-optimized by restructuring the file so that all the metadata is in one place and chunks are \"not too big\" and \"not too small\". However, as users of the data, not archivers, we don't control how the file is generated and distributed, so if we're restructuring the data we might want to go with something even better - a **\"cloud-native\"** format.\n", "\n", - ":::{important} Cloud-Native Formats\n", - "Cloud-native formats are formats that were designed specifically to be used in a cloud environment. This usually means that metadata and indexes for data is separated from the data itself in a way that allows for logical dataset access across multiple files. In other words, it is fast to open a large dataset and access just the parts of it that you need.\n", - ":::\n", + "### Cloud-Native Formats\n", + "Cloud-native formats are formats that were designed specifically to be used in a cloud environment. This usually means that metadata and indexes for data is separated from the data itself in a way that allows for logical dataset access. Data and metadata are not always stored in the same object or file in order to maximize the amount of data that can be lazily loaded and queried. Some examples of cloud-native formats are [Zarr](https://zarr.dev/) and GeoParquet, which is discussed below.\n", "\n", ":::{warning}\n", "Generating cloud-native formats is non-trivial.\n", @@ -31,11 +30,14 @@ "\n", "## Geoparquet\n", "\n", - "To demonstrate one such cloud-native format, geoparquet, we have generated a geoparquet store (see [atl08_parquet.ipynb](./atl08_parquet_files/atl08_parquet.ipynb)) for the ATL08 dataset and will visualize it using a very performant geospatial vector visualization library, [`lonboard`](https://developmentseed.org/lonboard/latest/).\n", + ">Apache Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet.\n", + "\n", + "From [https://geoparquet.org/](https://geoparquet.org/)\n", + "\n", + "To demonstrate one such cloud-native format, geoparquet, we have generated a geoparquet store (see [atl08_parquet.ipynb](./atl08_parquet_files/atl08_parquet.ipynb)) for a subset of the ATL08 dataset and will visualize it using a very performant geospatial vector visualization library, [`lonboard`](https://developmentseed.org/lonboard/latest/).\n", "\n", - ":::{seealso} Resource on Geoparquet\n", + ":::{seealso} Resources on Geoparquet\n", "* https://guide.cloudnativegeo.org/geoparquet/\n", - "* https://geoparquet.org/\n", ":::\n", "\n", "## Demo" @@ -118,6 +120,15 @@ "m.set_view_state(zoom=2)\n", "m" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The End!\n", + "\n", + "What did you think? Have more questions? Come find me in slack (Aimee Barciauskas) or by email at aimee@ds.io." + ] } ], "metadata": { diff --git a/book/tutorials/cloud-computing/images/s3-bucket-with-objects.png b/book/tutorials/cloud-computing/images/s3-bucket-with-objects.png new file mode 100644 index 0000000..91a0229 Binary files /dev/null and b/book/tutorials/cloud-computing/images/s3-bucket-with-objects.png differ