Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race between layer and Lambda update #5927

Open
achave11-ucsc opened this issue Feb 5, 2024 · 7 comments
Open

Race between layer and Lambda update #5927

achave11-ucsc opened this issue Feb 5, 2024 · 7 comments
Assignees
Labels
-- [priority] Low bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost groomed [process] Issue was recently looked at during backlog grooming infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team spike:5 [process] Spike estimate of five points

Comments

@achave11-ucsc
Copy link
Member

achave11-ucsc commented Feb 5, 2024

The example below is from anvildev, however the same errors occurred on dev and anvilprod.

During the deploy job for the merge commit of PR #5909 (which updated the elasticsearch client from 7.10.1 to 7.17.9), an UnsupportedProductError occurred for both the indexercachehealth and servicecachehealth lambdas.

CloudWatch Insights logs:

[ERROR] UnsupportedProductError: The client noticed that the server is not a supported distribution of Elasticsearch
Traceback (most recent call last):
  File "/var/task/azul/chalice.py", line 166, in patched_event_source_handler
    return old_handler(self_, event, context)
  File "/var/task/chalice/app.py", line 1756, in __call__
    return self.handler(event_obj)
  File "/var/task/app.py", line 212, in update_health_cache
    app.health_controller.update_cache()
  File "/var/task/azul/health.py", line 138, in update_cache
    health_object = dict(time=time.time(), health=self._health.as_json_fast())
  File "/var/task/azul/health.py", line 308, in as_json_fast
    return self.as_json(p.key for p in self.fast_properties[self.lambda_name])
  File "/var/task/azul/health.py", line 181, in as_json
    json = {k: getattr(self, k) for k in sorted(keys)}
  File "/var/task/azul/health.py", line 181, in <dictcomp>
    json = {k: getattr(self, k) for k in sorted(keys)}
  File "/var/task/azul/health.py", line 73, in __get__
    return super().__get__(obj, objtype=objtype)
  File "/var/task/azul/caching.py", line 189, in __get__
    value = obj.__dict__[self.fget.__name__] = self.fget(obj)
  File "/var/task/azul/health.py", line 273, in elasticsearch
    'up': ESClientFactory.get().ping(),
  File "/opt/python/elasticsearch/client/utils.py", line 347, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/opt/python/elasticsearch/client/__init__.py", line 280, in ping
    return self.transport.perform_request(
  File "/opt/python/elasticsearch/transport.py", line 421, in perform_request
    _ProductChecker.raise_error(self._verified_elasticsearch)
  File "/opt/python/elasticsearch/transport.py", line 638, in raise_error
    raise UnsupportedProductError(message)
@achave11-ucsc achave11-ucsc added the orange [process] Done by the Azul team label Feb 5, 2024
@achave11-ucsc
Copy link
Member Author

Assignee to populate description with symptoms.

@dsotirho-ucsc dsotirho-ucsc removed their assignment Feb 5, 2024
@achave11-ucsc
Copy link
Member Author

Assignee to consider next steps.

@hannes-ucsc
Copy link
Member

Turns out that the function code and the layer are updated with different API actions:

image

The screenshot is of a spreadsheet that was imported from CloudWatch Insights rsults and massaged (the @timestamp of trail events is not the event time). The query used was

fields eventTime, @timestamp, eventName, requestParameters.functionName, @message
| filter @message like /ERROR|Task|INIT_START/ or eventSource = 'lambda.amazonaws.com'
| limit 1000

This shows that the errors occurred after the layer was updated with UpdateFunctionConfiguration20150331v2 and before the function code was updated with UpdateFunctionCode20150331v2. During that time, the new ES client library was used by the old code that didn't contain the monkey patch for disabling the server version check in the ES client library.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Mar 11, 2024

Spike to try publish. Use the trail to show the API actions used by Terraform. It should include a call to PublishVersion after UpdateFunctionConfiguration and UpdateFunctionCode.

@hannes-ucsc hannes-ucsc added bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts spike:5 [process] Spike estimate of five points labels Mar 11, 2024
@hannes-ucsc hannes-ucsc removed their assignment Mar 11, 2024
@dsotirho-ucsc dsotirho-ucsc self-assigned this Mar 11, 2024
@dsotirho-ucsc dsotirho-ucsc added the - [priority] Medium label Mar 11, 2024
@dsotirho-ucsc
Copy link
Contributor

dsotirho-ucsc commented Apr 18, 2024

Spike to try publish. Use the trail to show the API actions used by Terraform. It should include a call to PublishVersion after UpdateFunctionConfiguration and UpdateFunctionCode.

The PublishVersion action occurred after UpdateFunctionConfiguration and after (or at the same time) as UpdateFunctionCode.

Index: src/azul/terraform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/terraform.py b/src/azul/terraform.py
--- a/src/azul/terraform.py	(revision 79a96535d237122c76dbec6fd50932f24d2b3cf4)
+++ b/src/azul/terraform.py	(date 1713459301678)
@@ -708,6 +708,7 @@
         for resource in resources['aws_lambda_function'].values():
             assert 'layers' not in resource
             resource['layers'] = ['${aws_lambda_layer_version.dependencies.arn}']
+            resource['publish'] = True
             env = config.es_endpoint_env(
                 es_endpoint=(
                     aws.es_endpoint
# log-group-names: azul-trail-dev    

fields @timestamp, eventType, eventName, requestParameters.functionName
| filter @message like /PublishVersion|UpdateFunctionConfiguration|UpdateFunctionCode/
| filter userIdentity.arn like /dsotirho/
| filter eventName != 'StartQuery'
| sort @timestamp asc
| limit 1000

@timestamp eventType eventName requestParameters.functionName
2024-04-18 16:45:01.710 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-indexercachehealth
2024-04-18 16:45:41.463 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-aggregate
2024-04-18 16:46:26.276 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-contribute_retry
2024-04-18 16:46:26.277 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel
2024-04-18 16:46:26.277 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-service-daniel
2024-04-18 16:46:26.278 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-service-daniel-manifest
2024-04-18 16:46:26.278 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-contribute
2024-04-18 16:46:26.279 AwsApiCall PublishVersion20150331 azul-indexer-daniel
2024-04-18 16:46:26.279 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel
2024-04-18 16:46:26.280 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-service-daniel-servicecachehealth
2024-04-18 16:47:11.819 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-service-daniel-servicecachehealth
2024-04-18 16:47:11.820 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-aggregate_retry
2024-04-18 16:47:11.823 AwsApiCall PublishVersion20150331 azul-service-daniel-servicecachehealth
2024-04-18 16:47:51.553 AwsApiCall PublishVersion20150331 azul-indexer-daniel-aggregate_retry
2024-04-18 16:48:36.308 AwsApiCall PublishVersion20150331 azul-service-daniel-manifest
2024-04-18 16:48:36.309 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-service-daniel-manifest
2024-04-18 16:48:36.315 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-aggregate_retry
2024-04-18 16:48:36.316 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-contribute
2024-04-18 16:48:36.316 AwsApiCall PublishVersion20150331 azul-indexer-daniel-contribute
2024-04-18 16:48:36.320 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-contribute_retry
2024-04-18 16:50:01.636 AwsApiCall PublishVersion20150331 azul-indexer-daniel-aggregate
2024-04-18 16:50:01.637 AwsApiCall PublishVersion20150331 azul-indexer-daniel-contribute_retry
2024-04-18 16:50:01.637 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-indexercachehealth
2024-04-18 16:50:25.032 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-aggregate
2024-04-18 16:50:46.957 AwsApiCall PublishVersion20150331 azul-indexer-daniel-indexercachehealth
2024-04-18 16:50:46.957 AwsApiCall PublishVersion20150331 azul-service-daniel
2024-04-18 16:50:46.958 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-service-daniel

@dsotirho-ucsc dsotirho-ucsc removed their assignment Apr 18, 2024
@achave11-ucsc
Copy link
Member Author

Assignee to consider next steps.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Apr 30, 2024

Assignee to move forward with publish.

@achave11-ucsc achave11-ucsc added -- [priority] Low and removed - [priority] Medium labels Sep 12, 2024
@hannes-ucsc hannes-ucsc added the groomed [process] Issue was recently looked at during backlog grooming label Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-- [priority] Low bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost groomed [process] Issue was recently looked at during backlog grooming infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team spike:5 [process] Spike estimate of five points
Projects
None yet
Development

No branches or pull requests

3 participants