You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a collection of step function failures over the last large weather event that showed holes in our workflow that need to be revisited. Subtasks should be made for these issues and delegated out to the team.
Log Write Failure
Issue
Step Functions will occasionally fail to write logs to Cloudwatch due to internal networking disruption.
Solution
Add Step Function State Retry mechanisms to all log-writing states so that the whole pipeline doesn't fail just because of rare connectivity issues.
Example Errors
DB Deadlocks
Issue
Multiple pipelines trying to update the same hand_id,rc_stage_ft tuples at the same time.
Solution
Add Step Function State retry or Python retry logic on DeadlockDetected Exception.
Example Errors
srf_18hr_max_inundation
ana_past_14day_max_inundation
DB Connection Issue
Issue
Lambda failing to connect to Ingest RDS Instance for multiple database tables in multiple service pipelines. This is happening specifically when the rds-viz query is accessing the rds-ingest via a foreign data wrapper.
Solution
TBD
Example Errors
Python Preprocess - 10GB
Issue
Lambda timed out on the ppp_mrf_mem1 pipeline due to the amount and size of files being downloaded and processed.
Solution
Need to rethink this Lambda to accommodate very large weather events. Possibly chunking the list of files to be downloaded and running multiple Lambda Invocations.
The text was updated successfully, but these errors were encountered:
This is a collection of step function failures over the last large weather event that showed holes in our workflow that need to be revisited. Subtasks should be made for these issues and delegated out to the team.
Log Write Failure
Issue
Step Functions will occasionally fail to write logs to Cloudwatch due to internal networking disruption.
Solution
Add Step Function State Retry mechanisms to all log-writing states so that the whole pipeline doesn't fail just because of rare connectivity issues.
Example Errors
DB Deadlocks
Issue
Multiple pipelines trying to update the same
hand_id,rc_stage_ft
tuples at the same time.Solution
Add Step Function State retry or Python retry logic on
DeadlockDetected
Exception.Example Errors
srf_18hr_max_inundation

ana_past_14day_max_inundation

DB Connection Issue
Issue
Lambda failing to connect to Ingest RDS Instance for multiple database tables in multiple service pipelines. This is happening specifically when the
rds-viz
query is accessing therds-ingest
via a foreign data wrapper.Solution
TBD
Example Errors
Python Preprocess - 10GB
Issue
Lambda timed out on the
ppp_mrf_mem1
pipeline due to the amount and size of files being downloaded and processed.Solution
Need to rethink this Lambda to accommodate very large weather events. Possibly chunking the list of files to be downloaded and running multiple Lambda Invocations.
The text was updated successfully, but these errors were encountered: