Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux logs chooses wrong pod with aks flux-extension #3944

Closed
1 task done
eff917 opened this issue Jun 1, 2023 · 6 comments
Closed
1 task done

flux logs chooses wrong pod with aks flux-extension #3944

eff917 opened this issue Jun 1, 2023 · 6 comments

Comments

@eff917
Copy link

eff917 commented Jun 1, 2023

Describe the bug

flux logs command tries to get the manager container from the fluxconfig-agent pod, however when installing the flux extension in AKS, the manager container is in the fluxconfig-controller pod.

Versions:
azure AKS cluster 1.24.10
microsoft.flux extension: 1.7.3 (https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/azure-arc/kubernetes/extensions-release.md#173-april-2023)
flux client: flux version 2.0.0-rc.5

running flux logs returns the following result:

✗ container manager is not valid for pod fluxconfig-agent-....

Steps to reproduce

  1. create and configure aks with flux extension
  2. configure authentication
  3. run flux logs

Expected behavior

The logs should be streamed from the fluxconfig-controller pod instead of the fluxconfig-agent pod

Screenshots and recordings

No response

OS / Distro

arch linux

Flux version

2.0.0-rc.5

Flux check

► checking prerequisites
✔ Kubernetes 1.24.10 >=1.20.6-0
► checking controllers
✔ fluxconfig-agent: deployment ready
► mcr.microsoft.com/azurek8sflux/fluxconfig-agent:1.7.3
► mcr.microsoft.com/azurek8sflux/fluent-bit:1.7.3
✔ fluxconfig-controller: deployment ready
► mcr.microsoft.com/azurek8sflux/fluxconfig-controller:1.7.3
► mcr.microsoft.com/azurek8sflux/fluent-bit:1.7.3
✔ helm-controller: deployment ready
► mcr.microsoft.com/oss/fluxcd/helm-controller:v0.31.2
✔ image-automation-controller: deployment ready
► mcr.microsoft.com/oss/fluxcd/image-automation-controller:v0.31.0
✔ image-reflector-controller: deployment ready
► mcr.microsoft.com/oss/fluxcd/image-reflector-controller:v0.26.1
✔ kustomize-controller: deployment ready
► mcr.microsoft.com/oss/fluxcd/kustomize-controller:v0.35.1
✔ notification-controller: deployment ready
► mcr.microsoft.com/oss/fluxcd/notification-controller:v0.33.0
✔ source-controller: deployment ready
► mcr.microsoft.com/oss/fluxcd/source-controller:v0.36.1
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta2
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ fluxconfigs.clusterconfig.azure.com/v1alpha1
✔ gitrepositories.source.toolkit.fluxcd.io/v1beta2
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ imagepolicies.image.toolkit.fluxcd.io/v1beta2
✔ imagerepositories.image.toolkit.fluxcd.io/v1beta2
✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta2
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta2
✔ receivers.notification.toolkit.fluxcd.io/v1beta2
✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@hiddeco
Copy link
Member

hiddeco commented Jun 1, 2023

Both pods mentioned are not part of the official Flux distribution, but rather part of the custom AKS distribution maintained by Microsoft.

They should either ensure that they are not labeled as "part of Flux", which is what the flux logs command filters on. Or ensure that the containers within the Pod specification match the further requirements of the command.

@hiddeco
Copy link
Member

hiddeco commented Jun 1, 2023

In addition to this, we could make the command more graceful when it encounters Deployments and/or Pods that are apparently doing weird things. Instead of failing with a hard stop.

@makkes
Copy link
Member

makkes commented Jun 1, 2023

In addition to this, we could make the command more graceful when it encounters Deployments and/or Pods that are apparently doing weird things. Instead of failing with a hard stop.

I agree. We should log an error and carry on here and here.

makkes pushed a commit that referenced this issue Jun 1, 2023
In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
`manager` (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The `logs` command is now more forgiving and merely logs an error in
these situations instead of completely bailing out.

For the parallel log streaming with `-f` the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what `asyncCopy` is for.

refs #3944

Signed-off-by: Max Jonas Werner <mail@makk.es>
makkes pushed a commit that referenced this issue Jun 1, 2023
In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
`manager` (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The `logs` command is now more forgiving and merely logs an error in
these situations instead of completely bailing out.

For the parallel log streaming with `-f` the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what `asyncCopy` is for.

A behavioral difference now is that the command will not exit with an
error, anymore, except for the case where it can't write to
stdout/stderr in the parallel case.

refs #3944

Signed-off-by: Max Jonas Werner <mail@makk.es>
makkes pushed a commit that referenced this issue Jun 2, 2023
In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
`manager` (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The `logs` command is now more forgiving and merely logs an error in
these situations instead of completely bailing out.

For the parallel log streaming with `-f` the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what `asyncCopy` is for.

refs #3944

Signed-off-by: Max Jonas Werner <mail@makk.es>
makkes pushed a commit that referenced this issue Jun 2, 2023
In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
`manager` (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The `logs` command is now more forgiving and merely logs an error in
these situations instead of completely bailing out.

For the parallel log streaming with `-f` the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what `asyncCopy` is for.

refs #3944

Signed-off-by: Max Jonas Werner <mail@makk.es>
makkes pushed a commit that referenced this issue Jun 2, 2023
In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
`manager` (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The `logs` command is now more forgiving and merely logs an error in
these situations instead of completely bailing out. It still returns a
non-zero exit code.

For the parallel log streaming with `-f` the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what `asyncCopy` is for.

refs #3944

Signed-off-by: Max Jonas Werner <mail@makk.es>
makkes pushed a commit that referenced this issue Jun 5, 2023
UX changes:

- Only print an error when a pod doesn't have a matching container
  instead of exiting early.
- Return a non-zero status code when no pod is found at all.

Details:

In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
`manager` (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The `logs` command is now more forgiving and merely logs an error in
these situations instead of completely bailing out. It still returns a
non-zero exit code.

For the parallel log streaming with `-f` the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what `asyncCopy` is for.

refs #3944

Signed-off-by: Max Jonas Werner <mail@makk.es>
makkes pushed a commit that referenced this issue Jun 5, 2023
UX changes:

- Only print an error when a pod doesn't have a matching container
  instead of exiting early.
- Return a non-zero status code when no pod is found at all.

Details:

In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
`manager` (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The `logs` command is now more forgiving and merely logs an error in
these situations instead of completely bailing out. It still returns a
non-zero exit code.

For the parallel log streaming with `-f` the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what `asyncCopy` is for.

refs #3944

Signed-off-by: Max Jonas Werner <mail@makk.es>
@makkes
Copy link
Member

makkes commented Jun 5, 2023

@eff917 a PR addressing your issue has been merged and the change will be released with the next Flux version that I'd love you to try out if you can and provide your feedback on it and if it solves your issue.

@eff917
Copy link
Author

eff917 commented Jun 7, 2023

@eff917 a PR addressing your issue has been merged and the change will be released with the next Flux version that I'd love you to try out if you can and provide your feedback on it and if it solves your issue.

Thank you, it seems to be working, with all the errors visible at the start and at the end:
eff@eff-tp:~/polirepo/flux2/bin (main =)$ ./flux version

fluent-bit: 1.7.3
flux: v0.0.0-dev.0
fluxconfig-agent: 1.7.3
fluxconfig-controller: 1.7.3
helm-controller: v0.31.2
image-automation-controller: v0.31.0
image-reflector-controller: v0.26.1
kustomize-controller: v0.35.1
notification-controller: v0.33.0
source-controller: v0.36.1

eff@eff-tp:~/polirepo/flux2/bin (main =)$ ./flux logs

failed getting logs: container manager is not valid for pod fluxconfig-agent-65886ccfdf-wwhlf
2023-06-07T12:00:43.085Z info Kustomization/... - Source is not ready, artifact not found 
2023-06-07T12:00:45.116Z info Kustomization/... - server-side apply completed 
2023-06-07T12:00:45.863Z info Kustomization/... - Dependencies do not meet ready condition, retrying in 30s 
2023-06-07T12:00:45.892Z info Kustomization/... - Dependencies do not meet ready condition, retrying in 30s 
2023-06-07T12:00:45.991Z info Kustomization/... - server-side apply completed 
...
2023-06-07T12:12:58.401Z info GitRepository/... 
✗ failed to collect logs from all Flux pods

@makkes
Copy link
Member

makkes commented Jun 7, 2023

@eff917 a PR addressing your issue has been merged and the change will be released with the next Flux version that I'd love you to try out if you can and provide your feedback on it and if it solves your issue.

Thank you, it seems to be working, with all the errors visible at the start and at the end:

Great. Just to clarify: Don't expect the failed getting logs messages to always appear at the start of the output. They can be anywhere inbetween the actual logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants