Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2535 - Crash Loop Back offs Error #2577

Merged
merged 21 commits into from
Dec 12, 2023
Merged

2535 - Crash Loop Back offs Error #2577

merged 21 commits into from
Dec 12, 2023

Conversation

guru-aot
Copy link
Collaborator

@guru-aot guru-aot commented Dec 8, 2023

Note: The branch name is created without a # in the front, to test my build and deployments from my local.

  • Investigate the issue and apply the fix on all our custom Dockerfile (Api, web, workers, queue-consumers, load-test-gateway & db-migration)
  • Most updated image in node has to be deployed

What does EACCES error means in node - https://betterstack.com/community/guides/scaling-nodejs/nodejs-errors/#13-eacces

On Analysing the issue, i was able to understand that this issue happens in certain version of node & npm libraries, some related bugs reported in the years are given below for reference.

The possible solutions given by the people are alwauys these 2

  • give sudo permission and access for the user 1001 for the /.npm directory/ - which is not always right to have a sudo persmission for a user in dockerfile
  • upgrade the node version as it states to a newer version that this issue is not happening - going with this route as we have a newer version of redhat that contains the new version from the one deplyoed already with some security issues.

https://catalog.redhat.com/software/containers/ubi8/nodejs-18/6278e5c078709f5277f26998?architecture=amd64&image=65302e01ec5935b621691d22&container-tabs=packages
image

https://catalog.redhat.com/software/containers/ubi8/nodejs-18/6278e5c078709f5277f26998?architecture=amd64&image=6543c3d67371c4bd3014291a&container-tabs=packages
image

Analyzing the changelogs of npm js, there has been bugs related to cache that has been fixed.
https://docs.npmjs.com/cli/v9/using-npm/changelog#981-2023-07-18
https://docs.npmjs.com/cli/v9/using-npm/changelog#967-2023-05-17

image
npm/cli#6464
The bug reported may not state the same issue we are facing but its related to the cache error that is happening in the version we were using.

So updating the redhat image to the latest version is what was taken as an action to solve this issue.

Note: https://app.zenhub.com/workspaces/student-information-management-system-5fce9df5aa1b45000e937014/issues/gh/bcgov/sims/2453 is also done as part of this PR.

  • Remove variables from Makefile
  • Remove references from docker-build.yml

As suggested by @andrewsignori-aot #2577 (comment) changed the permission of users in the group 0 for the folder ./.npm to have write access

Container before assigning the write permission
image

Container after assigning the write permission
image

Addded only the permissions for the ./.npm folder as in the past month the failed logs show this error happening only in the ./.npm folder.
https://kibana-openshift-logging.apps.silver.devops.gov.bc.ca/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'8841c680-a15b-11eb-a4dc-e5bf19f04239',key:kubernetes.namespace_name,negate:!f,params:(query:'0c27fb-test',type:phrase),type:phrase,value:'0c27fb-test'),query:(match:(kubernetes.namespace_name:(query:'0c27fb-test',type:phrase))))),index:'8841c680-a15b-11eb-a4dc-e5bf19f04239',interval:auto,query:(language:lucene,query:'%22sudo%20chown%20-R%20%22'),sort:!('@timestamp',desc))

image

@guru-aot guru-aot added the Devops Devops label Dec 8, 2023
@guru-aot guru-aot self-assigned this Dec 8, 2023
@guru-aot guru-aot marked this pull request as ready for review December 8, 2023 20:23
@@ -5,28 +5,6 @@ kind: Template
metadata:
name: ${NAME}
objects:
- apiVersion: image.openshift.io/v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, what was the effect of having it before and not having it now?

Copy link
Collaborator Author

@guru-aot guru-aot Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your question straight, the configuration was not configured properly and it was not at all used in our project.
More on the tags:
The imagestream is usually used to mention or represent what is the container's base image and the imagestreamtag is used to associate a specific tag version of the image.
These configurations(only imagestreamtag) inturn helped us to override the base image defined in the Dockerfile which was defined at the top of it "FROM".
In our case, for our deployments we used the same base image for (Api, web, workers, queue-consumers, load-test-gateway & db-migration), so there is no need to override the once in the dockerfile, as we run all our builds using Github action. Even in the future if we wanted to specifically change the base image for a particular container, we can change the dockerfile for the particular base image and run the Github action, it should eventually work.

@@ -1,5 +1,5 @@
# Base Image
FROM artifacts.developer.gov.bc.ca/redhat-access-docker-remote/ubi8/nodejs-18:1-71.1697652955
FROM artifacts.developer.gov.bc.ca/redhat-access-docker-remote/ubi8/nodejs-18:1-81
Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I support the node upgrade either way and the research looks great, we are still missing the root cause.
Can you please evaluate if the below assumptions make sense?

1 - Even the Openshift docker image defining a non-root user 1001, the container will be executed with a random user, as the error in the ticket also points out.
image
The same is also supported by the below documentation.
https://docs.openshift.com/container-platform/4.11/openshift_images/create-images.html#use-uid_create-images

2 - Checking the BC git I found at least one entry applying the recommended solution due to a npm 9 issue.
image
Image source: https://github.com/bcgov/common-hosted-form-service/blob/master/Dockerfile

The above would also explain why the error started when we moved from node 16 to 18 (npm 8 to 10).
Should we consider also applying the same?

Copy link
Collaborator

@dheepak-aot dheepak-aot Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see and I agree what @andrewsignori-aot about the OpenShift container running with arbitrarily assigned user.

image

I am 100% on same page that we need to run a fix permission command to have the highest level of certainty that the issue is taken care at it's root cause.

When I look at the openshift, see that write permissions are not present outside the owner for the directory.

image

There is one more thing which I want to share here @guru-aot @andrewsignori-aot @cditcher . It may also be a possible solution or may be not. But I would recommend to try.

Please go through this thread, sclorg/s2i-nodejs-container#396

There is a mention about same error (we are using npm ci)

Following screenshots are the highlights linking to our issue

image

image

Docker s2i example:(May be by this way we make the container to run with 1001 user instead of default arbitrarily assigned one)

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions @andrewsignori-aot and @dheepak-aot , as suggested, i have updated the group permissions for user in group 0 to have write access to the folder./.npm and tested the build and deployments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the further verifications @dheepak-aot 😉

@@ -242,47 +242,47 @@ build-db-migrations:
test -n "$(BUILD_REF)"
test -n "$(DB_MIGRATIONS_BUILD_REF)"
@echo "+\n++ BUILDING DB migrations with tag: $(BUILD_REF)\n+"
@oc -n $(BUILD_NAMESPACE) process -f $(BUILD_TEMPLATE_PATH) -p TAG=$(BUILD_REF) -p SOURCE_REPOSITORY_REF=$(BUILD_REF) -p BASE_IMAGE_NAME="nodejs-18" -p BASE_IMAGE_TAG="1-71.1697652955" -p BASE_IMAGE_REPO="artifacts.developer.gov.bc.ca/redhat-access-docker-remote/ubi8/" -p SOURCE_CONTEXT_DIR=$(SOURCE_CONTEXT_DIR)backend -p DOCKER_FILE_PATH=apps/db-migrations/Dockerfile -p NAME=$(DB_MIGRATIONS_BUILD_REF) | oc -n $(BUILD_NAMESPACE) apply -f -
@oc -n $(BUILD_NAMESPACE) process -f $(BUILD_TEMPLATE_PATH) -p TAG=$(BUILD_REF) -p SOURCE_REPOSITORY_REF=$(BUILD_REF) -p SOURCE_CONTEXT_DIR=$(SOURCE_CONTEXT_DIR)backend -p DOCKER_FILE_PATH=apps/db-migrations/Dockerfile -p NAME=$(DB_MIGRATIONS_BUILD_REF) | oc -n $(BUILD_NAMESPACE) apply -f -
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link

Please retry analysis of this Pull-Request directly on SonarCloud.

Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link

Backend Unit Tests Coverage Report

Totals Coverage
Statements: 19.52% ( 2643 / 13537 )
Methods: 10.34% ( 178 / 1721 )
Lines: 22.17% ( 2276 / 10268 )
Branches: 12.21% ( 189 / 1548 )

Copy link

E2E Workflow Workers Coverage Report

Totals Coverage
Statements: 51.87% ( 360 / 694 )
Methods: 46.15% ( 42 / 91 )
Lines: 56.52% ( 299 / 529 )
Branches: 25.68% ( 19 / 74 )

Copy link

E2E Queue Consumers Coverage Report

Totals Coverage
Statements: 79.43% ( 695 / 875 )
Methods: 73.79% ( 76 / 103 )
Lines: 81.04% ( 607 / 749 )
Branches: 52.17% ( 12 / 23 )

Copy link

E2E SIMS API Coverage Report

Totals Coverage
Statements: 55.14% ( 4090 / 7418 )
Methods: 51.96% ( 503 / 968 )
Lines: 59.84% ( 3314 / 5538 )
Branches: 29.93% ( 273 / 912 )

Copy link
Collaborator

@dheepak-aot dheepak-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. 👍
Looks good.

Copy link
Contributor

@ann-aot ann-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great research and thanks for the walk-though @guru-aot 👍

Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the changes, looks good 👍

Copy link
Collaborator

@andrepestana-aot andrepestana-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for the explanations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Devops Devops
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants