Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus: Failed to stop: java something #18065

Closed
edsantiago opened this issue Apr 5, 2023 · 11 comments
Closed

Cirrus: Failed to stop: java something #18065

edsantiago opened this issue Apr 5, 2023 · 11 comments
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@edsantiago
Copy link
Member

Seeing this a lot lately, multiple times per day:

Failed to stop: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1b21f19b[
   Not completed, task = java.util.concurrent.Executors$RunnableAdapter@2a087959[
      Wrapped task = TrustedListenableFutureTask@5768a9f6[
         status=PENDING, info=[
            task=[running=[NOT STARTED YET],
            com.google.api.gax.rpc.CheckingAttemptCallable@5137f960]
         ]
      ]
   ]
] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@728a8117[
   Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4438
]

e.g. here, here. Unfortunately my flake logger has a bug, it isn't preserving log links for these. I'll track that down when I have time. ITM this is still worth filing.

Obvious first thought is that it's a problem on the Cirrus end, but I see nothing reported there.

We're also seeing lots of Failed to start, but those aren't as bad, because those happen within a few minutes. The stop ones happen after the task has finished, and all tests have run (possibly even passed), so it's a big waste of energy and time.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Apr 5, 2023
@cevich
Copy link
Member

cevich commented Apr 5, 2023

it's a problem on the Cirrus end

Could also be on the google-cloud end. Not only does Cirrus-CI (the mother-ship) run there, but so do most of the VMs. I noticed off-hand there were some problems reported there the other day. There's an outage history page but there are so many services, I have no idea what to look for 😞

so it's a big waste of energy and time.

And money: I'd be willing to bet we see a few VMs turn up on the orphan report in a day or so.

@cevich
Copy link
Member

cevich commented Apr 5, 2023

I'll pass this on to Cirrus-support, they'll know better what the java traceback actually means.

@cevich
Copy link
Member

cevich commented Apr 5, 2023

No current orphans on GCP, but I remember we did have a handful on EC2. Unfortunately there's no easy way to associate EC2 VMs with the task that spawned them. So this is just speculation.

@edsantiago
Copy link
Member Author

"EC2" means "aarch64", doesn't it? I don't see any aarch64 failures in my flake history. (I do have flake history, just, no logs).

@cevich
Copy link
Member

cevich commented Apr 5, 2023

EC2 is for the aarch64, podman-machine, and windows tests. IIRC, the orphaned VMs I killed the other day were for all three (one of each).

@fkorotkov
Copy link

Unfortunately there's no easy way to associate EC2 VMs with the task that spawned them.

If the IAM Cirrus is using has ec2:CreateTags permission, you can enable experimental flag on the ec2_instance tasks:

task:
  experimental: true
  ec2_instance:
    # ...

This way you'll see pretty names for the VMs on AWS.

As for the reported two tasks. These are on GCP and seems the error happens after Cirrus successfully requests deletion of an instance so things should be cleanedup. I've added handling of such error so tasks won't get in failed state.

The orphan AWS instances are concerning and I'll appreciate if you can provide some info about the VMs if you have. I might try to find some information in internal logs. Ideally I'll appreciate if you can add the permissions and enable experimental flag so we can catch task ids for such VMs.

@cevich
Copy link
Member

cevich commented Apr 5, 2023

@fkorotkov thanks. IIRC, I already setup the EC2 tagging permissions so I think all that's needed is turning on the experimental flag. I'll open a PR to enable that for podman, which has the heaviest usage.

The other thing that occurred to me is: I/we can examine the EC2 instance user-data (script). I think the task or build ID is passed on the Cirrus-agent command-line, no?

In any case, I'll keep an eye out for these and let you know next time it happens.

cevich added a commit to cevich/podman that referenced this issue Apr 5, 2023
In GCP, user specified VM names are required upon creation.  Cirrus-CI
generates helpful names containing the task-ID.  Unfortunately in EC2
the VM ID's are auto-generated, and special permissions are required
to allow secondary setting of a `Name` tag.  Since this permission has
been granted, enable the `experimental` flag on EC2 tasks so that cirrus
can update VM name-tags.  This is especially useful in troubleshooting
orphaned VMs.

Ref:
containers#18065 (comment)

Signed-off-by: Chris Evich <cevich@redhat.com>
@fkorotkov
Copy link

@cevich good catch! Yes, the user data contains a bootstrap script that has a task id and credentials for reporting updates and logs.

@cevich
Copy link
Member

cevich commented Apr 5, 2023

Thanks for confirming. I'll keep my eyes out and let you know.

@github-actions
Copy link

github-actions bot commented May 6, 2023

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

I think this is fixed -- we have not seen "failed stop stop java something" since April 5. "Failed to start" continues to hit us, but that's an old classic that's been with us forever and ever, and since it fails quickly it's not as big a deal.

[Cirrus] Failed to start

[Cirrus] Instance failed to start!

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

3 participants