Skip to content

Latest commit

 

History

History
157 lines (117 loc) · 3.98 KB

README.md

File metadata and controls

157 lines (117 loc) · 3.98 KB

GluePipeline1

  1. Creating the IAM roles:
    • Create a role for the code pipeline.
    • Set the use case case as EC2.
    • Attach the following policies: (AmazonS3FullAccess, AWSCloudFormationFullAccess, AWSCodeBuildAdminAccess, AWSCodePipeline_FullAccess, IAMFullAccess)
    • Edit the trust policy, and replace "ec2" with "codepipeline":

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }

| | v

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "codepipeline.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }

  • Create a role for the cloud formation.
  • Set the use case case as cloud formation. (Or can also edit the trust policy to select cloud formation?)
  • Attach the following policies: (AmazonS3FullAccess, IAMFullAccess, AWSGlueServiceRole, AWSCloudFormationFullAccess)
  1. Create a S3 bucket.

  2. Create 4 folders within the S3 bucket: (script, temp, input, output) -Upload the .csv file into the input folder. -Upload the Python script into the script folder (ensure the name of the bucket and the path to the folder are correct):

| | v

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)

Read data directly from S3

source_path = "s3://mycodepipelinebucket13/input/product_data.csv" df = glueContext.create_dynamic_frame.from_options( connection_type="s3", format="csv", connection_options={"paths": [source_path]}, format_options={"withHeader": True} )

Convert to JSON format

json_path = "s3://mycodepipelinebucket13/output/target-data.json" glueContext.write_dynamic_frame.from_options( frame=df, connection_type="s3", connection_options={"path": json_path}, format="json" )

job.commit()

  1. Create a Github repository.

  2. Create a new file in the repository, named "name".yml:

| | v

Resources: GlueJob: Type: AWS::Glue::Job Properties: Name: MyGlueJob
Role: !GetAtt GlueServiceRole.Arn Command: Name: glueetl ScriptLocation: s3://mycodepipelinebucket13/script/glue_script.py DefaultArguments: "--TempDir": "s3://mycodepipelinebucket13/temp/" GlueVersion: "2.0" WorkerType: G.1X NumberOfWorkers: 3

GlueServiceRole: Type: AWS::IAM::Role Properties: RoleName: !Sub "${AWS::StackName}-MyRoleForGlue" AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: glue.amazonaws.com Action: sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/AmazonS3FullAccess - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

  1. Create a code pipeline.

  2. Select a existing role within the pipeline, the code pipeline role created in step 1.

  3. Select Github as the source for the pipeline.

  4. Select the correct repository and branch containing the template file from step 6.

  5. Select cloud formation for the deploy stage.

  6. Select "Create or update stack" under "Action Mode".

  7. Name the new stack.

  8. Under "Artifact Name", select "Source Artifact" and input the name of the .yml file from step 6.

  9. Under "Capabilities", select "CAPABILITY_NAMED_IAM".

  10. Under "Role name", select the cloud formation role created in step 1.

  11. Create the pipeline, it will automatically deploy.

  12. Go to AWS Glue, the new Glue job should have been automatically created. It can now be run.