Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code running in Cloud Build gets stuck trying to fetch metadata in v5.1.0 #548

Closed
theneva opened this issue Dec 16, 2022 · 12 comments · Fixed by #562
Closed

Code running in Cloud Build gets stuck trying to fetch metadata in v5.1.0 #548

theneva opened this issue Dec 16, 2022 · 12 comments · Fixed by #562
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@theneva
Copy link

theneva commented Dec 16, 2022

Hi! 👋

First things first:

  1. Is this a client library issue or a product issue? Client library issue.
  2. Did someone already solve this? Don't think so.
  3. Do you have a support contract? Nope!

Problem

I run builds on Cloud Build, which does roughly the following:

  1. Build & test the code
  2. Package the artefacts inside a Docker image
  3. Using docker-compose:
    a. Start the newly built Docker image
    b. Start a different image running Cypress with a bunch of UI tests

After upgrading from 5.0.1 to 5.1.0, step 3a started failing: The application launches, but gets immediately stuck when various Google client libraries try fetching data on startup.

This only happens on Cloud Build, and I believe the culprit to be the MAC address check introduced in #528 (though I have not done a thorough investigation). Running the exact same Docker image locally works fine.

I was pretty surprised when this changed, given that my code runs inside its own Docker container – but I suppose my container shares the host's network interface(s), so there's some blurring of the container boundaries there.

Environment details

  • OS: Cloud Build in the europe-west1 region (the default regional pool, not a private pool) with options.machineType: E2_HIGHCPU_32
  • Node.js version: v16.19.0 (The Docker image is based on node:16.19.0-alpine3.17)
  • npm version: 8.19.3 (I use Yarn 3.3.0 with the node-modules nodeLinker – either way, I believe this to be irrelevant)
  • gcp-metadata version: 5.1.0 (5.0.1 works)

Steps to reproduce

  1. Inside a Cloud Build build, run a Docker image that tries to fetch something using a Google client library, with no credential config
  2. See it hang as it (presumably) tries to get metadata because it detects that it is on a GCE node

(I believe this can be reproduced by running any code that triggers metadata fetching inside Cloud Build without any configuration that extends or short-circuits the GCE residency detection, but I haven't tested it.)

Let me know if you need additional details!

@theneva theneva added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Dec 16, 2022
@bcoe
Copy link
Contributor

bcoe commented Dec 21, 2022

@theneva thank you for the detailed write up, I've added this to the backlog and we will dig into it soon.

@danielbankhead danielbankhead added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Dec 22, 2022
@danielbankhead
Copy link
Contributor

Status update: still investigating this issue. Today, many customers enjoy this behavior for workloads where it may take time for the metadata server to spin-up, however I can see how this can be an issue for certain Cloud Build configurations. I'll post another update on this issue within the coming days.

@danielbankhead danielbankhead removed the status: investigating The issue is under investigation, which is determined to be non-trivial. label Feb 14, 2023
@danielbankhead
Copy link
Contributor

Hey @theneva,

Could you send over reproduction steps and/or a bit more information about your docker-compose config? In a clean directory, adding a Dockerfile with the following content & submitting works fine on my end:

FROM node:16.19.0-alpine3.17

WORKDIR /usr/src/app

RUN npm init -y
RUN npm i @google-cloud/storage
RUN npm ls gcp-metadata
RUN node -p "const {Storage} = require('@google-cloud/storage'); new Storage().getBuckets().then(console.log, console.error)"

CMD [ "node", "-p", "true" ]

Submitted via:

gcloud builds submit --tag gcr.io/`gcloud config get-value project`/test

@theneva
Copy link
Author

theneva commented Feb 15, 2023

Hi @danielbankhead!

Edit: I added a step-by-step guide to run the reproduction at the bottom.

Your example works for me too. It seems like when you just submit a Docker image to build, Cloud Build injects some credentials (presumably the build's service account). I think this prevents gcp-metadata from reaching the code path that triggers my issue.

The issue still occurs for me If I run a Docker container inside a Cloud Build step without giving my container any kind of credentials.

I put together a tiny reproduction at https://github.com/theneva/gcp-metadata-repro. It builds a Docker image with code that closely resembles your example, and then attempts to run that image without any credentials.

I expect the image to log that it couldn't list the buckets because it has no credentials (from the onrejected function), and then exit cleanly (with code 0), letting the build succeed.

The repro doesn't push any images, so you can test it with $ gcloud builds submit . from the root of the repo.

At the initial commit (https://github.com/theneva/gcp-metadata-repro/tree/48856f35e7f7bd9e7ec74642c22b3adfb8755761) the build consistently times out at step 1 (where I run the container).

At the latest commit (https://github.com/theneva/gcp-metadata-repro/tree/762d3b23bdcc5a409990fca467b8d31cd0b0f251) I've pinned gcp-metadata to version 5.0.1 in the Dockerfile. Here, the build consistently completes after ~49 seconds, with the expected result: Error: Could not load the default credentials. Browse to https://… gets logged, and the build succeeds.

So, to repro:

  1. Clone https://github.com/theneva/gcp-metadata-repro
  2. Move into the repo & verify that building the latest commit succeeds: gcloud builds submit . (should log a message about missing credentials, and the build should succeed)
  3. Check out the previous commit (git checkout HEAD~)
  4. Verify that building fails: gcloud builds submit . (should get stuck on step 1, and the build should time out)

Please let me know if you need anything else from me to look into this!

@danielbankhead
Copy link
Contributor

Thanks! This repo helps a lot - adding ENV DEBUG_AUTH=TRUE to the Dockerfile shows the root issue that was relied on (v5.0.1 and earlier):

FetchError: network timeout at: http://169.254.169.254/computeMetadata/v1/instance

Which throws the Error: Could not load the default credentials. error in google-auth-library.

In the latest releases we remove that timeout as it could take seconds to minutes for certain GCP workloads to connect to the metadata server. Additionally, it's detecting it's running on Google Cloud Linux rather than it's MAC address - this can be verified by checking isGoogleComputeEngineLinux() & isGoogleComputeEngineMACAddress() in the latest version of gcp-metadata.

I'll work on a few strategies to remedy this for you, hopefully without any config changes.

@danielbankhead
Copy link
Contributor

Additional note: Setting --network=cloudbuild exposes the ADC to an underlying container https://cloud.google.com/build/docs/build-config-file-schema#network

Leaving this here in case you want to pass creds in the build, however we shouldn't hang when no creds are available.

@theneva
Copy link
Author

theneva commented Feb 16, 2023

Aha! That makes sense. I guess there isn't too much I can do about it detecting the operating system.

Setting --network=cloudbuild is really neat, thanks for the tip! In this particular case, though, I don't want the container to have access to the metadata server when I run my tests 😄

As for the remedy, I think either of these suggestions would solve my problem (without requiring any configuration or unexpected-to-me behvaiour):

  1. Re-add a relatively strict timeout for the metadata server connection (which can be overridden using a config value and/or environment variable)
  2. Treat connection refused-type errors to the metadata server (where it's not available on the network at all) differently from other failures, and use a timeout if that's the case

I'm sure I don't see the full picture, but those are two things I would expect the library to do instead the current behaviour 😄

Thanks a lot for checking out this issue – I'm looking forward to the solution!

@danielbankhead
Copy link
Contributor

Hey @theneva, thanks for your patience; we have alignment internally on this and I'll prepare a PR shortly.

@SimenB
Copy link
Contributor

SimenB commented May 23, 2023

Any news here?

@danielbankhead
Copy link
Contributor

@SimenB we have a few items to tackle before resolving this. I’m estimating a PR should be ready and merged within 1-1.5 months.

@danielbankhead
Copy link
Contributor

Prepared a PR: #562

We've landed on creating a new environment variable, METADATA_SERVER_DETECTION, to resolve this issue.

While it isn't as seamless as automatically determining the GCE environment (which is very complicated to do across environments), it does present an opportunity to make it 'fail fast' via the METADATA_SERVER_DETECTION: none option.

@SimenB
Copy link
Contributor

SimenB commented Jun 29, 2023

5.3.0 with METADATA_SERVER_DETECTION=none works for us 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants