Targeted Audience: Data Providers who are looking to publish their data products on the AWS Data Exchange service.
AWS Data Exchange is a data marketplace that makes it easy for AWS customers to securely find, subscribe to, and use third-party data in the cloud.
If you are a data provider, AWS Data Exchange (ADX) makes it easy to reach millions of AWS customers migrating to the cloud by removing the need to build and maintain infrastructure for data storage, delivery, billing, and entitlement. ADX also makes it easy for qualified data providers to securely package, license, and deliver data products to millions of AWS customers worldwide.
If you are a data subscriber, you can easily browse the ADX catalog to find hundreds of relevant and up-to-date data products covering a wide range of industries, including financial services, healthcare, life sciences, geospatial, consumer, media & entertainment, and more. Once subscribed to a data product, you can use the ADX API to load data directly into Amazon S3 and then analyze it with a wide variety of AWS analytics and machine learning services.
Rearc is a data provider and one of the ADX launch partners. Products published by Rearc on ADX can be found here.
If you are looking to make your data available to customers via ADX, it will typically involve (1) sourcing the data, (2) transforming the data to make it useful for consumers, (3) creating a dataset on ADX, (4) creating automatic revisions and, (5) finally publishing the data product.
In this example, we will walk you through taking a free, publicly available open dataset and publishing it as a data product on ADX. This repository contains all the necessary code and automation to be able to source the data, perform data transformation and interact with ADX to publish a new data product.
As an example, we are using a free, publicly available open dataset for COVID-19 - World Confirmed Cases, Deaths, and Testing
from Our World in Data. This dataset is updated daily and provides up-to-date data on confirmed cases, deaths, and testing, for the COVID-19 pandemic till date.
Clone this repository.
.
├── pre-processing # Data provider side source code files and automation code
│ ├── pre-processing-code # Source code
│ │ ├── lambda_function.py # Code to interact with ADX for creating a dataset and revision
│ │ ├── source_data.py # Code to acquire dataset and perform necessary data transformation
│ │ ├── requirements.txt # Code dependencies
│ ├── pre-processing-cfn.yaml # CloudFormation template to setup data provider automation
├── dataset-description.md # Dataset description that goes on the ADX listing
├── product-description.md # Product description that goes on the ADX listing
├── init.sh # Initialization script to kick-off the automation
├── README.md
└── .gitignore
- Source Files: The actual data provider side source code and automation is stored inside the
pre-processing
folder. This folder contains python code to acquire the dataset, perform necessary data transformation and interact with ADX to publish a dataset and a new dataset revision. This folder also storespre-processing-cfn.yaml
that is the CloudFormation template to setup the pre-processing-code (python code) to run in AWS Lambda with CloudWatch Scheduled Events rule to trigger AWS Lambda functions periodically. - Dataset Description: Description of the dataset is stored in
dataset-description.md
file. - Product Description: Description of the data product is stored in
product-description.md
file. - Initialization Script: The
init.sh
script triggers the entire workflow as explained below.
- Python, Pip, JQ, AWS CLI V2 and other related developer tools installed and configured on your local developer workstation
- AWS credentials with appropriate permissions to create necessary ADX resources
Once, you have the pre-processing code written/updated and tested locally, you can run the init shell script to move the pre-processing code to S3, create a dataset on ADX and create the first dataset revision.
The init script requires following parameters to be passed:
- Source S3 Bucket: where the dataset and pre-processing automation code resides. For Rearc datasets, it's
rearc-data-provider
- Dataset Name: S3 prefix where the dataset and pre-processing automation code resides. For this example, we are using
covid-19-world-cases-deaths-testing
- Product Name: product name for the ADX listing. For this example, we are using
COVID-19 - World Confirmed Cases, Deaths, and Testing
- Product ID: Since, ADX does not provide APIs to programmatically create Products, it can be left blank for now
- Region: AWS region where the product will be listed on ADX. For this example, we are using
us-east-1
The init script also allows an optional --profile
parameter to be passed in if you wish to use an alternative set of AWS credentials instead of your default profile.
./init.sh --s3-bucket "rearc-data-provider" --dataset-name "covid-19-world-cases-deaths-testing" --product-name "COVID-19 - World Confirmed Cases, Deaths, and Testing" --product-id "blank_for_initial_run" --region "us-east-1"
If the optional profile parameter is needed, add parameter --profile "rearc-adx-alt"
The script does the following:
- Zips up contents within the pre-processing folder
- Copies this pre-processing zip file to S3
- Creates a new dataset on ADX
- Creates the pre-processing CloudFormation stack
- Executes the pre-processing Lambda function which acquires the source dataset, copies it to S3 and creates the first revision on ADX
- Destroys the CloudFormation stack
At this point, dataset and the first revision is available on ADX for us to create a product. Currently, ADX does not provide APIs to create Products; hence we will have to create a product and link it to the dataset manually using the AWS console.
Once the product is created, copy the Product ID
from ADX console and re-run the pre-processing CloudFormation stack (using the AWS console or aws-cli, NOT the init script) while passing all the necessary parameters:
DataSetArn: arn:aws:dataexchange:us-east-1:1234567890:data-sets/12345678909aabbccddffgghh
DataSetName: covid-19-world-cases-deaths-testing
ProductId: prod-xxxxxxxxxxxxxx
Region: us-east-1
S3Bucket: s3_bucket_name
Once the CloudFormation stack is successfully created, based on the CloudWatch Scheduled Events rule, pre-processing Lambda function will automatically create new dataset revisions (based on your cron expression) and publish it to ADX!
- If you find any issues or have enhancements to this process, please open a GitHub issue and we will gladly take a look at it. Better yet, submit a pull request. Any contributions you make are greatly appreciated ❤️.
- If you are looking for specific open datasets currently not available on ADX, please submit a request on our project board here.
- If you have any other questions or feedback, send us an email at data@rearc.io.
Rearc is a cloud, software and services company. We believe that empowering engineers drives innovation. Cloud-native architectures, modern software and data practices, and the ability to safely experiment can enable engineers to realize their full potential. We have partnered with several enterprises and startups to help them achieve agility. Our approach is simple — empower engineers with the best tools possible to make an impact within their industry.