From c60d03d5da9484194e167bc9ddea5c3c7b0f4029 Mon Sep 17 00:00:00 2001 From: Santiago Castro Date: Mon, 17 Apr 2017 01:02:36 -0300 Subject: [PATCH] Fix broken Markdown headings --- README.md | 16 +- samples/DynamoDBExport/readme.md | 6 +- samples/DynamoDBImport/readme.md | 4 +- samples/DynamoDBImportCSV/README.md | 6 +- .../README.md | 10 +- samples/EFSBackup/README.md | 2 +- samples/LoadTsvFilesInS3ToRedshift/README.md | 6 +- samples/RDStoRedshiftSqoop/README.md | 242 +++++++++--------- .../readme.md | 6 +- samples/RedshiftToRDS/readme.md | 8 +- .../readme.md | 6 +- .../readme.md | 8 +- samples/diagnose/README.md | 4 +- samples/hadoop-activity/README.md | 6 +- samples/kinesis/README.md | 6 +- samples/oracle-backup/README.md | 2 +- 16 files changed, 169 insertions(+), 169 deletions(-) diff --git a/README.md b/README.md index 09ba061..9f204e6 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ AWS Data Pipeline is a web service that you can use to automate the movement and # Running the samples -##Setup +## Setup 1 Get the samples by cloning this repository. ```sh $> git clone https://github.com/awslabs/data-pipeline-samples.git @@ -39,11 +39,11 @@ When you are finished experimenting with the examples, deactivate the virtual en $> aws datapipeline create-default-roles ``` -##Run the Hello World sample +## Run the Hello World sample The hello world sample demonstrates a pipeline that creates an EC2 instance and runs `echo Hello World!`. It can be used as a reference template for executing arbitriy shell commands. -###Step 1 +### Step 1 Create the pipelineId by calling the *aws data pipeline create-pipeline* command. We'll use this pipelineId to host the pipeline definition document and ultimately to run and monitor the pipeline. The commands in this section should be called from within the virtual environment that you created above. ```sh @@ -59,7 +59,7 @@ You will receive a pipelineId like this. # +-------------+--------------------------+ ``` -###Step 2 +### Step 2 Upload the helloworld.json sample pipeline definition by calling the *aws datapipeline put-pipeline-definition* command. This will upload and validate your pipeline definition. ```sh @@ -76,7 +76,7 @@ You will receive a validation messages like this # | errored | False | # +-----------+---------+ ``` -###Step 3 +### Step 3 Activate the pipeline by calling the *aws datapipeline activate-pipeline* command. This will cause the pipeline to start running on its defined schedule. ```sh @@ -100,7 +100,7 @@ You will receive status information on the pipeline. # @ShellCommandActivity_HelloWorld_2015-07-19T22:48: 2015-07-19T22:48:34 ``` -##Examine the contents of the sample pipeline definition +## Examine the contents of the sample pipeline definition Let's look at the Hello world example pipeline located at samples/helloworld/helloworld.json. ```json @@ -175,13 +175,13 @@ Let's look at the Hello world example pipeline located at samples/helloworld/hel } ``` -##Check out the other samples +## Check out the other samples This reposity contains a collection of Data Pipeline templates that should help you get started quickly. Browse the content of the /samples folder to discover what samples exist. Also, feel free to submit samples a pull requests. -##Disclaimer +## Disclaimer The samples in this repository are meant to help users get started with Data Pipeline. They may not be sufficient for production environments. Users should carefully inspect samples before running them. _Use at your own risk._ diff --git a/samples/DynamoDBExport/readme.md b/samples/DynamoDBExport/readme.md index 03cad5b..cb047d1 100644 --- a/samples/DynamoDBExport/readme.md +++ b/samples/DynamoDBExport/readme.md @@ -1,9 +1,9 @@ -#DynamoDB to CSV export +# DynamoDB to CSV export -##About the sample +## About the sample The pipeline definition is used for exporting DynamoDB data to a CSV format. -##Running the pipeline +## Running the pipeline Example DynamoDB table with keys: customer_id, income, demographics, financial diff --git a/samples/DynamoDBImport/readme.md b/samples/DynamoDBImport/readme.md index cc6d393..d4d8bd7 100644 --- a/samples/DynamoDBImport/readme.md +++ b/samples/DynamoDBImport/readme.md @@ -1,6 +1,6 @@ -#XML to DynamoDB Import +# XML to DynamoDB Import -##Running the sample pipeline +## Running the sample pipeline The json format could be either directly imported in the Console -> Create Pipeline or used in the aws datapipeline cli.
The Pipeline definition would copy an example xml from s3://data-pipeline-samples/dynamodbxml/input/serde.xml to local. This step is required for creating a temporary xml table using hive. The hive script is configured for running on a DynamoDB table with keys as "customer_id, financial, income, demographics". It finally performs an import from the temporary xml table to dynamodb
The data from the xml file is parsed using hive xml serde. The parsing functionality is similar to parsing in xpath
diff --git a/samples/DynamoDBImportCSV/README.md b/samples/DynamoDBImportCSV/README.md index 316f89d..ab689c0 100644 --- a/samples/DynamoDBImportCSV/README.md +++ b/samples/DynamoDBImportCSV/README.md @@ -1,9 +1,9 @@ -#DynamoDB to CSV import +# DynamoDB to CSV import -##About the sample +## About the sample The pipeline definition is used to import DynamoDB data to a CSV format. -##Running the pipeline +## Running the pipeline Example DynamoDB table with keys: id diff --git a/samples/DynamoDBToRedshiftConvertDataUsingHive/README.md b/samples/DynamoDBToRedshiftConvertDataUsingHive/README.md index fef9f97..81844cc 100644 --- a/samples/DynamoDBToRedshiftConvertDataUsingHive/README.md +++ b/samples/DynamoDBToRedshiftConvertDataUsingHive/README.md @@ -1,4 +1,4 @@ -#DynamoDBToRedshiftConvertDataUsingHive Sample +# DynamoDBToRedshiftConvertDataUsingHive Sample This sample demonstrates how you can use Data Pipeline's HiveActivity and RedshiftCopyActivity to copy data from a DynamoDB table to a Redshift table while performing data conversion using Hive (for data transformation) and S3 (for staging). This sample was motivated by a use case where one wishes to convert the data type of one column to another data type. In this sample, we will be converting a column from binary to base64 string. To make this sample to work, you must ensure you have the following: @@ -14,7 +14,7 @@ We will use the [Handling Binary Type Attributes Using the AWS SDK for Java Docu The column mappings used in this sample are meant to match the table definition used in the above example. -##Hive queries +## Hive queries The following queries will be used to convert the ExtendedMessage column from binary to base64 string. ```sql # tempHiveTable will receive the data from DynamoDB as-is @@ -38,7 +38,7 @@ INSERT OVERWRITE TABLE s3TempTable SELECT Id,ReplyDateTime,Message,base64(Extend You will need to provide the above information in the "put-pipeline-definition" command below. -##Before running the sample +## Before running the sample To simplify the example, the pipeline uses the following EMR cluster configuration: * Release label: emr-4.4.0 * Master instance type: m3.xlarge @@ -47,7 +47,7 @@ To simplify the example, the pipeline uses the following EMR cluster configurati Please feel free to modify this configuration to suite your needs. -##Running this sample +## Running this sample ```sh $> aws datapipeline create-pipeline --name data_conversion_using_hive --unique-id data_conversion_using_hive @@ -111,7 +111,7 @@ $> aws datapipeline list-runs --pipeline-id df-0554887H4KXKTY59MRJ # @TableBackupActivity_2016-03-31T23:38:34 2016-03-31T23:38:38 ``` -##Related documentation +## Related documentation * [HiveActivity](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html) * [RedshiftCopyActivity](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html) diff --git a/samples/EFSBackup/README.md b/samples/EFSBackup/README.md index 64eeef5..a3a2ac6 100644 --- a/samples/EFSBackup/README.md +++ b/samples/EFSBackup/README.md @@ -1,6 +1,6 @@ # EFSBackup -#####A collection of AWS Data Pipeline templates and scripts used to backup & restore Amazon EFS file systems +##### A collection of AWS Data Pipeline templates and scripts used to backup & restore Amazon EFS file systems If you need to be able to recover from unintended changes or deletions in your Amazon EFS file systems, you'll need to implement a backup solution. Once such backup solution is presented in the EFS documentation, and can be found here: http://docs.aws.amazon.com/efs/latest/ug/efs-backup.html. diff --git a/samples/LoadTsvFilesInS3ToRedshift/README.md b/samples/LoadTsvFilesInS3ToRedshift/README.md index 91790e3..31487db 100644 --- a/samples/LoadTsvFilesInS3ToRedshift/README.md +++ b/samples/LoadTsvFilesInS3ToRedshift/README.md @@ -1,9 +1,9 @@ -#Data Pipeline Load Tab Separated Files in S3 to Redshift +# Data Pipeline Load Tab Separated Files in S3 to Redshift -##About the sample +## About the sample This pipeline definition when imported would instruct Redshift to load TSV files under the specified S3 Path into a specified Redshift Table. Table insert mode is OVERWRITE_EXISTING. -##Running this sample +## Running this sample The pipeline requires the following user input point: 1. The S3 folder where the input TSV files are located. diff --git a/samples/RDStoRedshiftSqoop/README.md b/samples/RDStoRedshiftSqoop/README.md index bed4df8..61b912f 100644 --- a/samples/RDStoRedshiftSqoop/README.md +++ b/samples/RDStoRedshiftSqoop/README.md @@ -1,121 +1,121 @@ -# Data Pipeline RDStoRedshiftSqoop Sample - -## Overview - -This sample makes it easy to setup a pipeline that uses [Sqoop](http://sqoop.apache.org/) to move data to from a MySql database hosted in RDS to a Redshift database cluster. S3 is used to stage the data between the databases. - -The project provides scripts for setting up the resources for the pipeline, installing the [data set](http://aws.amazon.com/datasets/6468931156960467), and destroying the resources. The project also provides the [pipeline definition file](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-writing-pipeline-definition.html) which is used to create the pipeline and the AWS CLI commands for creating and executing the pipeline. See the instructions below to get started. - -*Note: Normal AWS charges apply for the resources created by the script. Make sure to run the teardown script as soon as you are done with the sample.* - -## Prerequisites - -You must have the AWS CLI and default IAM roles setup in order to run the sample. Please see the [readme](https://github.com/awslabs/data-pipeline-samples) for the base repository for instructions how to do this. - -You must also provide the S3Path of a S3 bucket with write permissions. See [here](http://docs.aws.amazon.com/AmazonS3/latest/UG/CreatingaBucket.html) for instructions on how to create an S3 bucket. - -Finally, you must install the [Python SDK for AWS](http://boto3.readthedocs.org/en/latest/guide/quickstart.html). -``` -$> pip install boto3 -``` - -## Step 1: Priming this sample - -Run the following commands to run the script. The AWS resources that will be created are a Redshift database, RDS MySQL database, and optionally an S3 bucket. - -The script takes an *optional* parameter for an S3 path for staging data between RDS and Redshift. If you choose to provide your own S3 path, the bucket must be in the same region as what is set for your AWS CLI configuration. In addition, this path cannot be an existing path as Sqoop is expected to create it in order to place the data it extracts from RDS (if the path you provide already exists, the setup process will issue an error message and exit). Finally, please make sure the S3 bucket has a policy that allows data writes to it. - -If the path is not provided, the script will create the S3 bucket for you. - -*Setup and teardown scripts are located in the setup directory under the sqoop directory in the samples directory.* -``` -$> cd /data-pipeline-samples/samples/RDStoRedshiftSqoop -$> python setup/Setup.py --s3-path [s3://optional/path/to/s3/location] -``` - - ## Step 2: Run this sample pipeline using the AWS CLI - -```sh - $> aws datapipeline create-pipeline --name rds_to_rs_sqoop_pipeline --unique-id rds_to_rs_sqoop_pipeline -``` - -You receive a pipelineId like this. -```sh - # ----------------------------------------- - # | CreatePipeline | - # +-------------+--------------------------+ - # | pipelineId | | - # +-------------+--------------------------+ -``` - -Now upload the pipeline definition -- make sure the log path is different from the staging path and the staging path is empty -```sh - $> aws datapipeline put-pipeline-definition --pipeline-definition file://RDStoRedshift.json --parameter-values myS3StagingPath= myS3LogsPath= myRedshiftEndpoint= myRdsEndpoint= --pipeline-id -``` - -You receive a validation messages like this -```sh - # ----------------------- - # |PutPipelineDefinition| - # +-----------+---------+ - # | errored | False | - # +-----------+---------+ -``` - -Now activate the pipeline -```sh - $> aws datapipeline activate-pipeline --pipeline-id -``` - -Check the status of your pipeline -``` - >$ aws datapipeline list-runs --pipeline-id -``` - -You will receive status information on the pipeline. -```sh - # Name Scheduled Start Status - # ID Started Ended - #--------------------------------------------------------------------------------------------------- - # 1. ActivityId_6OGtu 2015-07-29T01:06:17 WAITING_ON_DEPENDENCIES - # @ActivityId_6OGtu_2015-07-29T01:06:17 2015-07-29T01:06:20 - # - # 2. ResourceId_z9RNH 2015-07-29T01:06:17 CREATING - # @ResourceId_z9RNH_2015-07-29T01:06:17 2015-07-29T01:06:20 - # - # 3. DataNodeId_7EqZ7 2015-07-29T01:06:17 WAITING_ON_DEPENDENCIES - # @DataNodeId_7EqZ7_2015-07-29T01:06:17 2015-07-29T01:06:22 - # - # 4. DataNodeId_ImmS9 2015-07-29T01:06:17 FINISHED - # @DataNodeId_ImmS9_2015-07-29T01:06:17 2015-07-29T01:06:20 2015-07-29T01:06:21 - # - # 5. ActivityId_wQhxe 2015-07-29T01:06:17 WAITING_FOR_RUNNER - # @ActivityId_wQhxe_2015-07-29T01:06:17 2015-07-29T01:06:20 -``` - -Let the pipeline complete, then [connect to the Redshift cluster](http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-to-cluster.html) with a sql client and query your data. - -``` - $> psql "host= user= dbname= port= sslmode=verify-ca sslrootcert=" - $ psql> SELECT * FROM songs; -``` - -## Step 3: IMPORTANT! Tear down this sample - -*Note: The setup script will provide the teardown command with parameters at end of the execution.* - -``` -$> python setup/Teardown.py --rds-instance-id --redshift-cluster-id --s3-path [s3://optional/path/to/s3/bucket/created/by/setup] -``` - -## Disclaimer - -The samples in this repository are meant to help users get started with Data Pipeline. They may not be sufficient for production environments. Users should carefully inspect code samples before running them. - -Use at your own risk. - -Copyright 2011-2013 Amazon.com, Inc. or its affiliates. All Rights Reserved. - -Licensed under the Amazon Software License (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at - -http://aws.amazon.com/asl/ +# Data Pipeline RDStoRedshiftSqoop Sample + +## Overview + +This sample makes it easy to setup a pipeline that uses [Sqoop](http://sqoop.apache.org/) to move data to from a MySql database hosted in RDS to a Redshift database cluster. S3 is used to stage the data between the databases. + +The project provides scripts for setting up the resources for the pipeline, installing the [data set](http://aws.amazon.com/datasets/6468931156960467), and destroying the resources. The project also provides the [pipeline definition file](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-writing-pipeline-definition.html) which is used to create the pipeline and the AWS CLI commands for creating and executing the pipeline. See the instructions below to get started. + +*Note: Normal AWS charges apply for the resources created by the script. Make sure to run the teardown script as soon as you are done with the sample.* + +## Prerequisites + +You must have the AWS CLI and default IAM roles setup in order to run the sample. Please see the [readme](https://github.com/awslabs/data-pipeline-samples) for the base repository for instructions how to do this. + +You must also provide the S3Path of a S3 bucket with write permissions. See [here](http://docs.aws.amazon.com/AmazonS3/latest/UG/CreatingaBucket.html) for instructions on how to create an S3 bucket. + +Finally, you must install the [Python SDK for AWS](http://boto3.readthedocs.org/en/latest/guide/quickstart.html). +``` +$> pip install boto3 +``` + +## Step 1: Priming this sample + +Run the following commands to run the script. The AWS resources that will be created are a Redshift database, RDS MySQL database, and optionally an S3 bucket. + +The script takes an *optional* parameter for an S3 path for staging data between RDS and Redshift. If you choose to provide your own S3 path, the bucket must be in the same region as what is set for your AWS CLI configuration. In addition, this path cannot be an existing path as Sqoop is expected to create it in order to place the data it extracts from RDS (if the path you provide already exists, the setup process will issue an error message and exit). Finally, please make sure the S3 bucket has a policy that allows data writes to it. + +If the path is not provided, the script will create the S3 bucket for you. + +*Setup and teardown scripts are located in the setup directory under the sqoop directory in the samples directory.* +``` +$> cd /data-pipeline-samples/samples/RDStoRedshiftSqoop +$> python setup/Setup.py --s3-path [s3://optional/path/to/s3/location] +``` + + ## Step 2: Run this sample pipeline using the AWS CLI + +```sh + $> aws datapipeline create-pipeline --name rds_to_rs_sqoop_pipeline --unique-id rds_to_rs_sqoop_pipeline +``` + +You receive a pipelineId like this. +```sh + # ----------------------------------------- + # | CreatePipeline | + # +-------------+--------------------------+ + # | pipelineId | | + # +-------------+--------------------------+ +``` + +Now upload the pipeline definition -- make sure the log path is different from the staging path and the staging path is empty +```sh + $> aws datapipeline put-pipeline-definition --pipeline-definition file://RDStoRedshift.json --parameter-values myS3StagingPath= myS3LogsPath= myRedshiftEndpoint= myRdsEndpoint= --pipeline-id +``` + +You receive a validation messages like this +```sh + # ----------------------- + # |PutPipelineDefinition| + # +-----------+---------+ + # | errored | False | + # +-----------+---------+ +``` + +Now activate the pipeline +```sh + $> aws datapipeline activate-pipeline --pipeline-id +``` + +Check the status of your pipeline +``` + >$ aws datapipeline list-runs --pipeline-id +``` + +You will receive status information on the pipeline. +```sh + # Name Scheduled Start Status + # ID Started Ended + #--------------------------------------------------------------------------------------------------- + # 1. ActivityId_6OGtu 2015-07-29T01:06:17 WAITING_ON_DEPENDENCIES + # @ActivityId_6OGtu_2015-07-29T01:06:17 2015-07-29T01:06:20 + # + # 2. ResourceId_z9RNH 2015-07-29T01:06:17 CREATING + # @ResourceId_z9RNH_2015-07-29T01:06:17 2015-07-29T01:06:20 + # + # 3. DataNodeId_7EqZ7 2015-07-29T01:06:17 WAITING_ON_DEPENDENCIES + # @DataNodeId_7EqZ7_2015-07-29T01:06:17 2015-07-29T01:06:22 + # + # 4. DataNodeId_ImmS9 2015-07-29T01:06:17 FINISHED + # @DataNodeId_ImmS9_2015-07-29T01:06:17 2015-07-29T01:06:20 2015-07-29T01:06:21 + # + # 5. ActivityId_wQhxe 2015-07-29T01:06:17 WAITING_FOR_RUNNER + # @ActivityId_wQhxe_2015-07-29T01:06:17 2015-07-29T01:06:20 +``` + +Let the pipeline complete, then [connect to the Redshift cluster](http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-to-cluster.html) with a sql client and query your data. + +``` + $> psql "host= user= dbname= port= sslmode=verify-ca sslrootcert=" + $ psql> SELECT * FROM songs; +``` + +## Step 3: IMPORTANT! Tear down this sample + +*Note: The setup script will provide the teardown command with parameters at end of the execution.* + +``` +$> python setup/Teardown.py --rds-instance-id --redshift-cluster-id --s3-path [s3://optional/path/to/s3/bucket/created/by/setup] +``` + +## Disclaimer + +The samples in this repository are meant to help users get started with Data Pipeline. They may not be sufficient for production environments. Users should carefully inspect code samples before running them. + +Use at your own risk. + +Copyright 2011-2013 Amazon.com, Inc. or its affiliates. All Rights Reserved. + +Licensed under the Amazon Software License (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at + +http://aws.amazon.com/asl/ diff --git a/samples/RedshiftCopyActivityFromDynamoDBTable/readme.md b/samples/RedshiftCopyActivityFromDynamoDBTable/readme.md index 4b30ae6..c3e992b 100644 --- a/samples/RedshiftCopyActivityFromDynamoDBTable/readme.md +++ b/samples/RedshiftCopyActivityFromDynamoDBTable/readme.md @@ -1,4 +1,4 @@ -#RedshiftCopyActivityFromDynamoDBTable Sample +# RedshiftCopyActivityFromDynamoDBTable Sample This sample demonstrates how you can use Data Pipeline's RedshiftCopyActivity to copy data from a DynamoDB table to a Redshift table. This sample was motivated by a use case that requires the user to provide AWS credentials to access the DynamoDB table. It is assumed that the owner of the DynamoDB table has granted the user read access to the table. To make this sample to work, you must ensure you have the following: @@ -12,7 +12,7 @@ This sample demonstrates how you can use Data Pipeline's RedshiftCopyActivity to You will need to provide the above information in the "put-pipeline-definition" command below. -##Running this sample +## Running this sample ```sh $> aws datapipeline create-pipeline --name redshift_copy_from_dynamodb_pipeline --unique-id redshift_copy_from_dynamodb_pipeline @@ -58,6 +58,6 @@ You will need to provide the above information in the "put-pipeline-definition" # @ResourceId_idL0Y_2015-11-06T23:52:04 2015-11-06T23:52:11 ``` -##Related documentation +## Related documentation https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html diff --git a/samples/RedshiftToRDS/readme.md b/samples/RedshiftToRDS/readme.md index 06c599b..ebbbbf7 100644 --- a/samples/RedshiftToRDS/readme.md +++ b/samples/RedshiftToRDS/readme.md @@ -1,4 +1,4 @@ -#Redshift To RDS Without RDS Table create Sample +# Redshift To RDS Without RDS Table create Sample This sample demonstrates how you can use Data Pipeline to copy data from a Redshift table to a RDS table. The template assumes that the table is already created in RDS. To make this sample to work, you must ensure you have the following: @@ -17,7 +17,7 @@ This sample demonstrates how you can use Data Pipeline to copy data from a Redsh You will need to provide the above information in the "put-pipeline-definition" command below. -##Running this sample +## Running this sample ```sh $> aws datapipeline create-pipeline --name redshift_to_rds_without_table_create --unique-id redshift_to_rds_without_table_create @@ -64,7 +64,7 @@ You will need to provide the above information in the "put-pipeline-definition" # @ResourceId_idL0Y_2015-11-06T23:52:04 2015-11-06T23:52:11 ``` -#Redshift To RDS with RDS Table create Sample +# Redshift To RDS with RDS Table create Sample This sample demonstrates how you can use Data Pipeline to copy data from a Redshift table to a RDS table. The template assumes that the table is already created in RDS. To make this sample to work, you must ensure you have the following: @@ -84,7 +84,7 @@ This sample demonstrates how you can use Data Pipeline to copy data from a Redsh You will need to provide the above information in the "put-pipeline-definition" command below. -##Running this sample +## Running this sample ```sh $> aws datapipeline create-pipeline --name redshift_to_rds_with_table_create --unique-id redshift_to_rds_with_table_create diff --git a/samples/S3TsvFilesToRedshiftTablesIfReady/readme.md b/samples/S3TsvFilesToRedshiftTablesIfReady/readme.md index 80b8bbe..3456e62 100644 --- a/samples/S3TsvFilesToRedshiftTablesIfReady/readme.md +++ b/samples/S3TsvFilesToRedshiftTablesIfReady/readme.md @@ -1,9 +1,9 @@ -#Data Pipeline Load Tab Separated Files in S3 to Redshift if file exists +# Data Pipeline Load Tab Separated Files in S3 to Redshift if file exists -##About the sample +## About the sample This pipeline definition when imported would instruct Redshift to load two TSV files from given two S3 location, into two different Redshift Table. Two copy activities are independent, each will start once the input s3 file exists. Table insert mode is OVERWRITE_EXISTING. -##Running this sample +## Running this sample The pipeline requires the following user input point: 1. Redshift connection info diff --git a/samples/SparkPiMaximizeResourceAllocation/readme.md b/samples/SparkPiMaximizeResourceAllocation/readme.md index 952755d..0ce78fd 100644 --- a/samples/SparkPiMaximizeResourceAllocation/readme.md +++ b/samples/SparkPiMaximizeResourceAllocation/readme.md @@ -1,12 +1,12 @@ -#EMRActivity SparkPi example with maximizeResourceAllocation +# EMRActivity SparkPi example with maximizeResourceAllocation -##About the sample +## About the sample This Pipeline definition launches an EmrCluster (emr-4.x.x) with [maximizeResourceAllocation](http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html#d0e17386) with simple [SparkPi](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala) example in yarn-client mode. Also, it runs on [ONDEMAND](https://aws.amazon.com/about-aws/whats-new/2016/02/now-run-your-aws-data-pipeline-on-demand/) schedule. -##Running this sample +## Running this sample The pipeline requires one input point from the customer: 1. The log folder for the pipeline. -##Result +## Result You can view the output (stdout) under 'Emr Step Logs' under EmrActivity. Pi is roughly 3.141716 \ No newline at end of file diff --git a/samples/diagnose/README.md b/samples/diagnose/README.md index 7acc503..4146835 100644 --- a/samples/diagnose/README.md +++ b/samples/diagnose/README.md @@ -10,7 +10,7 @@ It can be done in two different ways: 2. Using the AWS Data Pipeline Console -###Using the terminal +### Using the terminal 1. Download the diagnostics jar file: https://s3.amazonaws.com/data-pipeline-samples/diagnose-sample/Diagnose.jar 2. Run the following command (The config option takes in the path and file name of your credentials.json file) @@ -20,7 +20,7 @@ NOTE: If you are running it from an AWS CLI that has been configured with your c `$> java -jar /Diagnose.jar` -###Using the AWS Data Pipeline Console +### Using the AWS Data Pipeline Console 1. Download the pipeline definition json file:https://s3.amazonaws.com/data-pipeline-samples/diagnose-sample/diagnose_pipeline.json. 3. Use the AWS Data Pipeline console to create a new pipeline and import the definition from the downloaded json file. diff --git a/samples/hadoop-activity/README.md b/samples/hadoop-activity/README.md index cec9ba7..409a40a 100644 --- a/samples/hadoop-activity/README.md +++ b/samples/hadoop-activity/README.md @@ -1,9 +1,9 @@ -#Hadoop Activity word count example with Fair Scheduler queues +# Hadoop Activity word count example with Fair Scheduler queues -##About the sample +## About the sample This pipeline definition when imported would run a word count splitter program (s3://elasticmapreduce/samples/wordcount/wordSplitter.py) on the public data set s3://elasticmapreduce/samples/wordcount/input/. There are two Hadoop Activities in the definition each of which run the splitter program and output to two s3 different folders with the format <s3Prefix>/scheduledStartTime/queue_(1|2). Each of the activities run a hadoop job on using Hadoop Fair Scheduler which is configured with two queues. -##Running this sample +## Running this sample The pipeline requires three input points from the customer: 1. The s3 prefix folder where the output of the word splitter would be stored. diff --git a/samples/kinesis/README.md b/samples/kinesis/README.md index 39a4b18..8a23bd5 100644 --- a/samples/kinesis/README.md +++ b/samples/kinesis/README.md @@ -6,7 +6,7 @@ This sample sets up a Data Pipeline to run an analysis on a kinesis stream every # Running the sample -##Setting up your resources +## Setting up your resources The setup script will: - create a Kinesis stream named AccessLogStream @@ -17,7 +17,7 @@ The setup script will: ```sh $> setup/setup-script.sh ``` -##Populating your stream +## Populating your stream You can push sample data to your stream by running @@ -25,7 +25,7 @@ You can push sample data to your stream by running $> setup/append-to-stream.sh ``` -##Setting up the pipeline +## Setting up the pipeline The instructions at https://github.com/awslabs/data-pipeline-samples tell you how to create, setup, and activate a pipeline. diff --git a/samples/oracle-backup/README.md b/samples/oracle-backup/README.md index cc03492..2576496 100644 --- a/samples/oracle-backup/README.md +++ b/samples/oracle-backup/README.md @@ -28,7 +28,7 @@ It features usage of parameters and expressions for easy pipeline definition re- `aws datapipeline activate-pipeline --pipeline-id ` -##Parameters +## Parameters myBackupLocation: S3 backup location (i.e. `s3://mybucket/backups/oracle`)