Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STREAM-437 - Add serverless-plugin-aws-alerts as npm dependency #30

Merged
merged 4 commits into from
Apr 12, 2022

Conversation

taro-bh
Copy link

@taro-bh taro-bh commented Apr 6, 2022

This PR adds serverless-plugin-aws-alerts plugin to the npm dependency, modifying the existing images.

The need for the new dependency arose due to https://github.com/sleepio/flows-cli/pull/33 in the context of https://bighealth.atlassian.net/browse/STREAM-437, where we are adding baseline configuration for CloudWatch metrics alerts.

This implementation may be preferred over #29, if adding the dependency causes no issues in production.

Copy link

@ipanousis ipanousis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can approve code-wise, though please note that

(a) the build is failing and
(b) we should make sure this works together with the https://github.com/sleepio/flows-cli/pull/33 changes prior to merging

Copy link

@kerr-bighealth kerr-bighealth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though the build in dev is failing.



npm notice
--
582 | npm notice New major version of npm available! 7.0.14 -> 8.6.0
583 | npm notice Changelog: <https://github.com/npm/cli/releases/tag/v8.6.0>
584 | npm notice Run `npm install -g npm@8.6.0` to update!
585 | npm notice
586 | npm ERR! code ERR_SOCKET_TIMEOUT
587 | npm ERR! errno ERR_SOCKET_TIMEOUT
588 | npm ERR! request to https://registry.npmjs.org/file-type/-/file-type-4.4.0.tgz failed, reason: Socket timeout
589 |  
590 | npm ERR! A complete log of this run can be found in:
591 | npm ERR!     /root/.npm/_logs/2022-04-06T21_20_03_170Z-debug.log
592 | The command '/bin/sh -c npm install -g serverless@${SERVERLESS_VERSION} --ignore-scripts spawn-sync' returned a non-zero code: 1

@taro-bh
Copy link
Author

taro-bh commented Apr 6, 2022

LGTM, though the build in dev is failing.

I do notice that all the builds in dev have been failing with the same error.

Interestingly, building the images locally with the ./build script is successful.

Is this a kind of timeout error that only happens in the dev environment? It's still a concern given that these appear repeatable and deterministic.

I'll try to see if I can get this fixed, but let me know if you have seen this issue before and, if so, how it was resolved then.

@kerr-bighealth
Copy link

LGTM, though the build in dev is failing.

I do notice that all the builds in dev have been failing with the same error.

Interestingly, building the images locally with the ./build script is successful.

Is this a kind of timeout error that only happens in the dev environment? It's still a concern given that these appear repeatable and deterministic.

I'll try to see if I can get this fixed, but let me know if you have seen this issue before and, if so, how it was resolved then.

I haven't seen this specific issue before, and when I do a wget https://registry.npmjs.org/file-type/-/file-type-4.4.0.tgz locally it works fine. There was a file in the build /root/.npm/_logs/2022-04-06T21_20_03_170Z-debug.log. Maybe try catting the logs in that directory cat /root/.npm/_logs/*.log to get more info.

@ipanousis
Copy link

ipanousis commented Apr 7, 2022

@kerr-bighealth The failure happens on the master branch as well (link). Any networking change that might be affecting this? Is it worth running this build in another environment (QA or Stage) to see if it's dev specific?

I can't download the artifacts to look at the log you're mentioning, the artifacts link doesn't show up any file?

@taro-bh
Copy link
Author

taro-bh commented Apr 7, 2022

@kerr-bighealth

I haven't seen this specific issue before, and when I do a wget https://registry.npmjs.org/file-type/-/file-type-4.4.0.tgz locally it works fine. There was a file in the build /root/.npm/_logs/2022-04-06T21_20_03_170Z-debug.log. Maybe try catting the logs in that directory cat /root/.npm/_logs/*.log to get more info.

Do you know how to cat the logs generated during a build but likely not kept in the artifact and such? I'm not sure how the build environment persists after it ends with an error. If you already know how, I'd appreciate a tip for debugging in CodeBuild.

@nmguse-bighealth
Copy link

I would try doing a build locally to reproduce

@kerr-bighealth
Copy link

This is not occurring locally at all? Trying another env would give us maybe a bit more signal on "it's a general issue in CodeBuild", vs. "it's only in dev".

@taro-bh I would start with trying to do a

cat /root/.npm/_logs/*.log

Typically in these scenarios I will find myself doing single line edits in the scripts for a job in order to debug > commit > push. It's just kind of how these remote based platforms work.

@taro-bh
Copy link
Author

taro-bh commented Apr 7, 2022

@kerr-bighealth @nmguse-bighealth

As far as locally building the images, it's consistently successful for me and I see no network timeout issues (I do see all the same deprecation warnings, but not timeout/network connectivity issues). That's why I think I have to get on to the build container for investigating build environment specific issues.

@taro-bh I would start with trying to do a

cat /root/.npm/_logs/*.log

Typically in these scenarios I will find myself doing single line edits in the scripts for a job in order to debug > commit > push. It's just kind of how these remote based platforms work.

In order to cat the specific npm log, I need to be on the CodeBuild container somehow, right? That's something I don't know how to do at the moment, and it appears this Session Manager feature allows me to peek deeper into a build container, but it looks like the lack of privilege prevents me from actually using this. (I tried this in dev and stage so far.) In dev, I get the following Session Manager Error message (+/- 3 lines for better context):

10 | [Container] 2022/04/07 18:51:27 Moving to directory /codebuild/output/src968090650/src/github.com/sleepio/docker-serverless
11 | [Container] 2022/04/07 18:51:27 Configuring ssm agent with target id: codebuild:6cd40637-a9c6-472e-ba2c-b46c11c295d0
12 | [Container] 2022/04/07 18:51:27 Successfully updated ssm agent configuration
13 | [Container] 2022/04/07 18:51:30 Session Manager Error: Failed to get controlchannel token, error: CreateControlChannel failed with error: createControlChannel request failed: unexpected response from the service <AccessDeniedException>
14 | [Container] 2022/04/07 18:51:30 Please make sure your image has SSM Agent installed and service role has correct Session Manager permissions
15 | [Container] 2022/04/07 18:51:30 Registering with agent
16 | [Container] 2022/04/07 18:51:30 Phases found in YAML: 3

In stage, I cannot initiate the build if I check the "Enable session connection" checkbox.

EDIT: Potentially related ticket: npm/cli#3078

@kerr-bighealth
Copy link

@taro-bh Add the cat line into the script in the CodeBuild buildspec.yml itself.

@taro-bh
Copy link
Author

taro-bh commented Apr 8, 2022

UPDATE: It appears we are bitten by an ongoing issue with the npm install, in which a socket concurrency issue leads to a rather frequent time outs in certain, still unidentified environments; the socket timeout error appears to happen in any package concurrently downloaded, but which package is not deterministic. The ticket, npm/cli#3078, is marked “closed” but you will notice in the lengthy thread that we are yet to see a solution. The npm team has problem reproducing the issue deterministically, and hence the issue is still open.

  • To avoid making this a blocker, I’ve implemented retry on npm install, such that it uses cached versions of packages whenever possible to reduce network round trips that may increase the chance of socket timeout errors.
  • Added inline NOTEs and a section in README to alert other devs about the issue.
  • The build has been tried in the dev and stage environments, both successful now.

The infinite retries may be a very bad taste, but the install usually succeeds on the second retry, so this may be a good enough work around until the npm team fixes the root cause.

@ipanousis
Copy link

@taro-bh So I suppose the new setting you added --fetch-retries 5 didn't do the trick by itself?

Are we sure we want infinite retries instead of something which can fail faster in the case where it is not a Socket timeout type of issue? We would run the risk of obscuring other errors, flooding logs and having long-running builds for other types of issues. Perhaps set a fixed number of retries, e.g. 5?

@taro-bh
Copy link
Author

taro-bh commented Apr 8, 2022

@taro-bh So I suppose the new setting you added --fetch-retries 5 didn't do the trick by itself?

Are we sure we want infinite retries instead of something which can fail faster in the case where it is not a Socket timeout type of issue? We would run the risk of obscuring other errors, flooding logs and having long-running builds for other types of issues. Perhaps set a fixed number of retries, e.g. 5?

Apparently it's happening in a different layer than what we could control with the --fetch-retries option (as well as the timeout option for that matter). (You can see the npm ticket that I cited earlier for detail.)

I've removed infinite loop and instead provided max retry as a configurable parameter.

Dockerfile Outdated

COPY . /var

RUN cd /var && npm install
# See the note on `npm install` above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to write all this code twice, could we instead extract it into a Bash script? Since a multi-line Docker RUN command of this level of complexity is rather tricky to manage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to spent that much energy to provide what is at best a patch solution to an issue that actually needs to be solved upstream? As have been noted, this is not at all a permanent solution, and as soon as the npm package maintainers figure out how to properly fix the issue, we should simply remove the retry loop.

I'm not sure if keeping alive this blocker just for aesthetic reasons is necessary. We are already having to push all the blocking tickets to the next Sprint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the issue PR you linked seems closed
  2. duplication of logic & maintainability aren't really aesthetic reasons?

However, I'm not the owner of this repo so I'm just giving my advice here. As I'm also the only one weighing in on this, and the code review approvals are already in place, you can proceed with these changes if you're happy and they're working

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mentioned earlier that the npm ticket is marked closed but the issue really hasn't been solved. Hence we are having to implement a workaround on our side.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm just wondering how "temporary" this workaround really is given a closed Github issue.

Should we also reference npm/cli#4028 perhaps? To ensure the next person who comes along this code can find the information they need easily in order to potentially remove the workaround.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think npm/cli#4028 is a related issue at all? Connection hanging is not the issue we have.

I'll update the inline note to provide more detail as to the anticipated life time of the retry flow and why and when to remove it. Meanwhile I'll use this branch to do integration tests in non-production environments for STREAM-437.

@taro-bh
Copy link
Author

taro-bh commented Apr 12, 2022

The integration tests (performed in dev) were performed and the image with the new package installed is working as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants