[ETL-636] Raw Lambda #138

philerooski · 2024-08-30T00:59:43Z

Adds resources:

Lambda: The raw Lambda, which polls the dispatch-to-raw SQS queue and writes compressed JSON to the raw S3 bucket.
IAM Role: An IAM role for the above Lambda.

Raw Lambda compresses JSON data contained in an export (zip archive) from the input S3 bucket and uploads it to the raw S3 bucket. It makes heavy use of Python file objects and multipart uploads and can download/compress/upload with a relatively low, fixed memory overhead with respect to the size of the uncompressed JSON. For example, during testing I was able to process a 450 MB (uncompressed) JSON file with a max memory used of 345 MB.

sonarqubecloud · 2024-08-30T01:00:04Z

Quality Gate passed

Issues
33 New issues
0 Accepted issues

Measures
1 Security Hotspot
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

thomasyu888 · 2024-08-30T22:09:51Z

config/develop/namespaced/sqs-dispatch-to-raw.yaml

@@ -3,7 +3,7 @@ template:
 parameters:
  MessageRetentionPeriod: "1209600"
  ReceiveMessageWaitTimeSeconds: "20"
-  VisibilityTimeout: "120"
+  VisibilityTimeout: "900"


Can you speak to the increase here?

We don't know yet how long it could take for this Lambda to run on prod-sized data. Especially given the relatively small memory I've allotted. So I've set the Lambda timeout to it's maximum (900). Since the Lambda could take up to 900 seconds to complete, we also need the associated SQS event to show up in the queue again even if the Lambda fails at 899 seconds, hence we set the visibility timeout on the SQS queue to the timeout of the Lambda. AWS actually enforces this server side, so the deployment will fail if we use a smaller value.

thomasyu888 · 2024-08-30T22:12:07Z

src/lambda_function/raw/app.py

+        - The data type is derived from the first underscore-delimited component of the file basename.
+    """
+    key_components = key.split("/")
+    # input bucket keys are formatted like `{namespace}/{cohort}/{export_basename}`


Just checking, but is it always formatted like this?

@thomasyu888 It could change in the future, but that would also mean changing every other stack that tries to read data from the input bucket, so no harm in hard coding this format.

Lets hope it doesn't change in the future.

thomasyu888 · 2024-08-30T22:13:04Z

src/lambda_function/raw/app.py

+        # }
+
+    Notes:
+        - Parts must be larger than AWS minimum requirements (typically 5 MB),


Did you need to configure anything to make sure the parts are > 5MB?

That's taken care of by yield_compressed_data. The part_threshold is set to 8 MB by default.

thomasyu888

🔥 LGTM! Going to pre-approve, but not sure if @rxu17 had any last comments?

BryanFauble · 2024-09-04T20:10:51Z

src/lambda_function/raw/app.py

+        with io.BytesIO() as object_stream:
+            s3_client.download_fileobj(
+                Bucket=sns_message["Bucket"],
+                Key=sns_message["Key"],
+                Fileobj=object_stream,
+            )
+            object_stream.seek(0)


This is super cool how you can do this

BryanFauble

LGTM!

philerooski added 5 commits August 27, 2024 14:12

raw lambda initial commit

0aa8f6a

yield compressed data up to part threshold

a3bf5cb

complete implementation and add tests

efb79fe

minor update to dispatch lambda module docstring

6ac1233

Add analogous prod stacks

29e77ac

philerooski requested a review from a team as a code owner August 30, 2024 00:59

philerooski temporarily deployed to develop August 30, 2024 01:07 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 30, 2024 01:10 — with GitHub Actions Inactive

philerooski temporarily deployed to develop August 30, 2024 01:11 — with GitHub Actions Inactive

thomasyu888 reviewed Aug 30, 2024

View reviewed changes

thomasyu888 approved these changes Aug 30, 2024

View reviewed changes

BryanFauble reviewed Sep 4, 2024

View reviewed changes

BryanFauble approved these changes Sep 4, 2024

View reviewed changes

philerooski merged commit a68065b into main Sep 6, 2024
18 checks passed

thomasyu888 deleted the etl-636 branch October 30, 2024 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-636] Raw Lambda #138

[ETL-636] Raw Lambda #138

philerooski commented Aug 30, 2024

sonarqubecloud bot commented Aug 30, 2024

thomasyu888 Aug 30, 2024

philerooski Sep 4, 2024

thomasyu888 Aug 30, 2024

philerooski Sep 4, 2024

thomasyu888 Sep 4, 2024

thomasyu888 Aug 30, 2024

philerooski Sep 4, 2024

thomasyu888 left a comment

BryanFauble Sep 4, 2024

BryanFauble left a comment

[ETL-636] Raw Lambda #138

[ETL-636] Raw Lambda #138

Conversation

philerooski commented Aug 30, 2024

sonarqubecloud bot commented Aug 30, 2024

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanFauble left a comment

Choose a reason for hiding this comment