Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests made via aws-api on an EC2 instance never complete (stuck on fetching metadata / token) #267

Open
jumarko opened this issue Feb 13, 2025 · 2 comments
Labels
bug Something isn't working cannot reproduce

Comments

@jumarko
Copy link

jumarko commented Feb 13, 2025

I reported this problem on Slack: https://clojurians.slack.com/archives/C09N0H1RB/p1739347946253059

The gist of the problem is an application that hangs because aws-api is not able to fetch metadata (based on examining thread stacks).
This is happening after an upgrade from com.cognitect.aws/api {:mvn/version "0.8.692"} to com.cognitect.aws/api {:mvn/version "0.8.730-beta01"}.

The problem only manifests when the code runs on an EC2 instance, where it uses the default credentials provider.

  • running it on my laptop (using profile credentials provider) works fine.

At least one other person mentioned seeing a similar problem.

Reproducer

(require '[cognitect.aws.client.api :as aws])
(require '[cognitect.aws.credentials :as credentials])

;; running this on an ec2 instance hangs
(def my-client (aws/client {:api :license-manager :credentials-provider (credentials/default-credentials-provider http-client)))
;; - using (credentials/profile-credentials-provider aws-profile) on my laptop works fine

(aws/invoke license-manager-api {:op :ListReceivedLicenses})
;; ... never completes ...

Stacktraces

When I run the above code in socket repl, it gets stuck

"Clojure Connection repl 1" #52 daemon prio=5 os_prio=0 cpu=272.22ms elapsed=1398.25s tid=0x0000ffff4c00e520 nid=0x165 waiting on condition  [0x0000ffff23ffc000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.14/Native Method)
        - parking to wait for  <0x00000000c8d8e290> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(java.base@17.0.14/LockSupport.java:211)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.14/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.14/AbstractQueuedSynchronizer.java:1047)
        at java.util.concurrent.CountDownLatch.await(java.base@17.0.14/CountDownLatch.java:230)
        at clojure.core$promise$reify__8621.deref(core.clj:7257)
        at clojure.core$deref.invokeStatic(core.clj:2337)
        at clojure.core$deref.invoke(core.clj:2323)
        at clojure.core.async$fn__43145.invokeStatic(async.clj:138)
        at clojure.core.async$fn__43145.invoke(async.clj:127)
        at cognitect.aws.client.impl.Client._invoke(impl.clj:123)
        at cognitect.aws.client.api$invoke.invokeStatic(api.clj:131)
        at cognitect.aws.client.api$invoke.invoke(api.clj:112)
...

After careful examination, I found another thread suggesting a problem with metadata

"async-thread-macro-1" #58 daemon prio=5 os_prio=0 cpu=1.90ms elapsed=1013.14s tid=0x0000ffff600582f0 nid=0x169 waiting on condition  [0x0000ffff19dfd000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.14/Native Method)
        - parking to wait for  <0x00000000d0ec73a8> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(java.base@17.0.14/LockSupport.java:211)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.14/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.14/AbstractQueuedSynchronizer.java:1047)
        at java.util.concurrent.CountDownLatch.await(java.base@17.0.14/CountDownLatch.java:230)
        at clojure.core$promise$reify__8621.deref(core.clj:7257)
        at clojure.core$deref.invokeStatic(core.clj:2337)
        at clojure.core$deref.invoke(core.clj:2323)
        at clojure.core.async$fn__43145.invokeStatic(async.clj:138)
        at clojure.core.async$fn__43145.invoke(async.clj:127)
        at cognitect.aws.ec2_metadata_utils$get_response_data.invokeStatic(ec2_metadata_utils.clj:63)
        at cognitect.aws.ec2_metadata_utils$get_response_data.invoke(ec2_metadata_utils.clj:62)
        at cognitect.aws.ec2_metadata_utils$IMDSv2_token.invokeStatic(ec2_metadata_utils.clj:157)
        at cognitect.aws.ec2_metadata_utils$IMDSv2_token.invoke(ec2_metadata_utils.clj:148)
        at cognitect.aws.region$instance_region_IMDS_v2_provider$reify__49949.fetch(region.clj:112)
        at cognitect.aws.region$fn__49916$G__49912__49918.invoke(region.clj:24)
        at cognitect.aws.region$fn__49916$G__49911__49921.invoke(region.clj:24)
        at clojure.core$some.invokeStatic(core.clj:2718)
        at clojure.core$some.invoke(core.clj:2709)
        at cognitect.aws.region$chain_region_provider$reify__49927.fetch(region.clj:37)
        at cognitect.aws.region$fn__49916$G__49912__49918.invoke(region.clj:24)
        at cognitect.aws.region$fn__49916$G__49911__49921.invoke(region.clj:24)
        at cognitect.aws.util$fetch_async$fn__49627$fn__49628.invoke(util.clj:297)
        - locked <0x00000000d20cf868> (a cognitect.aws.region$chain_region_provider$reify__49927)
        at cognitect.aws.util$fetch_async$fn__49627.invoke(util.clj:296)
        at clojure.core.async$thread_call$fn__43264.invoke(async.clj:487)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.14/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.14/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(java.base@17.0.14/Thread.java:840)

Workaround

Downgrading to the prior version of aws-api solves the problem.

@marcobiscaro2112
Copy link
Collaborator

I tried to reproduce the issue without success on an EC2 instance running Amazon Linux 2023.6.20250211, using IMDSv2 to resolve credentials:

$ java --version
openjdk 21.0.6 2025-01-21 LTS
OpenJDK Runtime Environment Corretto-21.0.6.7.1 (build 21.0.6+7-LTS)
OpenJDK 64-Bit Server VM Corretto-21.0.6.7.1 (build 21.0.6+7-LTS, mixed mode, sharing)

$ cat deps.edn
{:deps {com.cognitect.aws/api {:mvn/version "0.8.730-beta01"}
        com.cognitect.aws/endpoints {:mvn/version "871.2.30.11"}
        com.cognitect.aws/license-manager {:mvn/version "871.2.29.35"}}}

$ cat src/repro.clj
(ns repro
  (:require [cognitect.aws.client.api :as aws]))

(def my-client (aws/client {:api :license-manager}))

(defn -main [& _args]
  (prn (aws/invoke my-client {:op :ListReceivedLicenses})))

$ clojure -M -m repro
{:Licenses [{:Entitlements [....

Is there anything special about your setup? I see in the snipped you shared you create the credentials provider with an explicit http-client. How is your http client created?

@marcobiscaro2112
Copy link
Collaborator

@jumarko Also, any chance there is some unhandled exception being printed to stderr?

Looking at the thread dump stack trace it seems that nothing is ever delivered to the promise-chan used to fetch instance metadata, which may indicate some core.async block is throwing an exception that is not being handled (and this would print to stderr by default, assuming no UncaughtExceptionHandler is set).

@marcobiscaro2112 marcobiscaro2112 added bug Something isn't working cannot reproduce labels Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cannot reproduce
Projects
None yet
Development

No branches or pull requests

2 participants