Skip to content

Latest commit

 

History

History
60 lines (45 loc) · 2.52 KB

[T15.00] Glue Spark transforming and processing data.md

File metadata and controls

60 lines (45 loc) · 2.52 KB

[T15.00] Glue Spark: transforming and processing data

Documentation:

Description:

Create AWS Glue Spark Job that copies go_daily_sales data from AWS S3 Raw bucket to AWS S3 Stage bucket.

Files for pushing to CodeComet:

  • Script name: fact_go_daily_sales_stage_t10.py
  • CloudFormation template: {user_id}_fact_go_daily_sales_stage_t10.yaml

Configuration:

  1. Path: configurations/fact_go_daily_sales.yaml
  2. Use raw-to-stage section
  3. Read data from [data-source][database]
  4. Write data to [data-target][s3][keys]
  5. Replace {user_id} with your aws account name for target path

Glue Parameters:

  1. To represent Glue Flow developing locally you need to set parameters as it would on the cluster.
  2. You can do this with sys.argv.append command before using getResolvedOptions function from awsglue folder in your repository.
  3. Parameters that you will need to set:
  • UserId = {user_id}
  • ConfigBucketKey = configurations/{configuration_file_name}.yaml
  • ConfigBucketName = aws-data-engineering-course-{user-id}-assets

Local Development:

  1. Ingest config file from S3 using commons package in your repository.
  2. Extract data from raw S3 bucket and stage S3 bucket using DynamicFrame.
  3. Join Raw data and Stage data and get only new records by ingest_dt.
  4. Drop duplicates.
  5. Add column 'audit_ts' that is equal to current timestamp.
  6. Load Data as partitioned by 'ingest_dt' in target bucket as parquet.

Amazon Development:

  1. Create a Glue Spark Job on cluster with the name: {user_id}_fact_go_daily_sales_stage_t10.
  2. Mandatory Steps:
    1. Choose Python 3.9 for Python version
    2. Choose 1/16 DPU for data processing units
    3. Choose 0 retries
    4. Set timeout for 5
    5. Set your bucket path for Script path
    6. Set parameters
    7. Set Tags Owner=aws-edu-iba-gomel and StudentName={user_id}
  3. Add necessary options to create a Glue Spark Job, test a Glue Spark Job and check logs in AWS CloudWatch.
  4. Create AWS CloudFormation template for Glue Spark Job with the name: {user_id}_fact_go_daily_sales_stage_t10.yaml.
  5. Test CloudFormation Template:
    1. Deploy (check that your Glue Python Shell has been created using CloudFormation)
    2. Delete your CloudFormation Stack
  6. Open Pull Request as described in Common Info (Task Workflow) section.