Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow logs button does not fetch archived logs but node summary logs button does #12759

Closed
3 of 4 tasks
duc00 opened this issue Mar 7, 2024 · 11 comments · Fixed by #12932
Closed
3 of 4 tasks

Workflow logs button does not fetch archived logs but node summary logs button does #12759

duc00 opened this issue Mar 7, 2024 · 11 comments · Fixed by #12932
Assignees
Labels
area/archive-logs Archive Logs feature area/ui P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@duc00
Copy link

duc00 commented Mar 7, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Hi!

The LOGS button on a workflow page does not load the archived logs when archiveLogs is set to True and workflow pods have been GCd. The LOGS buttons on the node summary does work properly, ie. loads and display logs from the archive.

Don't know if this is relevant but my artifact repository is set in the workflow-controller-configmap and is an AWS S3 bucket.

Screenshot 2024-03-07 at 10 42 59

From the workflow LOGS button:
Screenshot 2024-03-07 at 11 59 07

A GET is performed with URI /api/v1/workflows/mwaa-data-prod/omniscient-dragon-zkc9v/log?logOptions.container=main&grep=&logOptions.follow=true&podName=omniscient-dragon-zkc9v

From the node summary LOGS button:
Screenshot 2024-03-07 at 11 59 19

A GET is performed with URI /artifact-files/mwaa-data-prod/workflows/omniscient-dragon-zkc9v/omniscient-dragon-zkc9v/outputs/main-logs which properly fetches archived logs.

Version

v3.5.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

{
  "metadata": {
    "generateName": "lovely-whale-",
    "namespace": "mwaa-data-prod"
  },
  "spec": {
    "entrypoint": "argosay",
    "archiveLogs": true,
    "serviceAccountName": "argo-executor",
    "templates": [
      {
        "name": "argosay",
        "container": {
          "name": "main",
          "image": "argoproj/argosay:v2",
          "command": [
            "/argosay"
          ],
          "args": [
            "echo",
            "This is a test"
          ]
        }
      }
    ],
    "podGC": {
      "strategy": "OnPodCompletion"
    }
  }
}

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}


time="2024-03-07T11:08:22.123Z" level=info msg="Processing workflow" Phase= ResourceVersion=399612776 namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.132Z" level=info msg="Task-result reconciliation" namespace=mwaa-data-prod numObjs=0 workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.132Z" level=info msg="Updated phase  -> Running" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.133Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.133Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.133Z" level=info msg="Pod node lovely-whale-rbdpx initialized Pending" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.186Z" level=info msg="Created pod: lovely-whale-rbdpx (lovely-whale-rbdpx)" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.186Z" level=info msg="TaskSet Reconciliation" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.186Z" level=info msg=reconcileAgentPod namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:22.200Z" level=info msg="Workflow update successful" namespace=mwaa-data-prod phase=Running resourceVersion=399612781 workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.185Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=399612781 namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.185Z" level=info msg="Task-result reconciliation" namespace=mwaa-data-prod numObjs=1 workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.185Z" level=info msg="task-result changed" namespace=mwaa-data-prod nodeID=lovely-whale-rbdpx workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.185Z" level=info msg="node changed" namespace=mwaa-data-prod new.message= new.phase=Succeeded new.progress=0/1 nodeID=lovely-whale-rbdpx old.message= old.phase=Pending old.progress=0/1 workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.185Z" level=info msg="TaskSet Reconciliation" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.185Z" level=info msg=reconcileAgentPod namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.186Z" level=info msg="Updated phase Running -> Succeeded" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.186Z" level=info msg="Marking workflow completed" namespace=mwaa-data-prod workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:32.191Z" level=info msg="cleaning up pod" action=deletePod key=mwaa-data-prod/lovely-whale-rbdpx-1340600742-agent/deletePod
time="2024-03-07T11:08:32.201Z" level=info msg="Queueing Succeeded workflow mwaa-data-prod/lovely-whale-rbdpx for delete in 48h0m0s due to TTL"
time="2024-03-07T11:08:32.202Z" level=info msg="Workflow update successful" namespace=mwaa-data-prod phase=Succeeded resourceVersion=399612901 workflow=lovely-whale-rbdpx
time="2024-03-07T11:08:37.235Z" level=info msg="cleaning up pod" action=deletePod key=mwaa-data-prod/lovely-whale-rbdpx/deletePod

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Did a run by removing podGC.strategy and could output the following logs:

time="2024-03-07T11:19:41.293Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-03-07T11:19:41.293Z" level=info msg="No output parameters"
time="2024-03-07T11:19:41.293Z" level=info msg="No output artifacts"
time="2024-03-07T11:19:41.293Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: lovely-whale-dz4f7/lovely-whale-dz4f7/main.log"
time="2024-03-07T11:19:41.298Z" level=info msg="Creating minio client using IAM role"
time="2024-03-07T11:19:41.298Z" level=info msg="Saving file to s3" bucket=argo-workflows-artifacts-f617ed6 endpoint=s3.amazonaws.com key=lovely-whale-dz4f7/lovely-whale-dz4f7/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-03-07T11:19:41.370Z" level=info msg="Save artifact" artifactName=main-logs duration=76.573752ms error="<nil>" key=lovely-whale-dz4f7/lovely-whale-dz4f7/main.log
time="2024-03-07T11:19:41.370Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-03-07T11:19:41.370Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-03-07T11:19:41.409Z" level=info msg="Alloc=12094 TotalAlloc=16904 Sys=23397 NumGC=3 Goroutines=12"
@agilgur5
Copy link
Contributor

agilgur5 commented Mar 8, 2024

This sounds like a duplicate of #12590. I haven't had a chance to review the PR yet

@agilgur5 agilgur5 closed this as completed Mar 8, 2024
@agilgur5 agilgur5 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/regression Regression from previous behavior (a specific type of bug) solution/duplicate This issue or PR is a duplicate of an existing one solution/workaround There's a workaround, might not be great, but exists labels Mar 8, 2024
@duc00
Copy link
Author

duc00 commented Mar 8, 2024

Hi @agilgur5, while issues appear to be similar, they are not. #12590 is about missing archived workflow logs. This one is about missing archived logs, but from non-archived workflows (only pods have been GCd).

+ I don't have the Should specify a node when we get archived logs error message.

@agilgur5 agilgur5 removed area/workflow-archive solution/duplicate This issue or PR is a duplicate of an existing one labels Mar 8, 2024
@agilgur5 agilgur5 reopened this Mar 8, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Mar 8, 2024

  • I don't have the Should specify a node when we get archived logs error message.

I did notice this one missing, but the URI differences were the same.

This one is about missing archived logs, but from non-archived workflows (only pods have been GCd).

The existing logic should cover this case 🤔 This also would theoretically impact 3.4.x as well, before the Archived/Live merge, meaning this is not a regression either.

Are there any errors in your browser or in the Server logs? Could you check against 3.4.x?

@agilgur5 agilgur5 removed the type/regression Regression from previous behavior (a specific type of bug) label Mar 8, 2024
@duc00
Copy link
Author

duc00 commented Mar 8, 2024

This one is about missing archived logs, but from non-archived workflows (only pods have been GCd).

The existing logic should cover this case 🤔 This also would theoretically impact 3.4.x as well, before the Archived/Live merge, meaning this is not a regression either.

As mentioned in the issue, this case is properly covered when opening the log viewer from the LOGS button in the node summary view. It is not (or not properly) covered when opening the log viewer from the LOGS button in the workflow view. Maybe arguments passed to getContainerLogs differ in both cases? I am unfortunately not confortable enough in React to be able to tell.

Are there any errors in your browser or in the Server logs? Could you check against 3.4.x?

The only error I see is one appearing even when I don't click on any LOGS button. I don't think it is related:
Screenshot 2024-03-08 at 18 11 20

And the log reflecting that from the server:

time="2024-03-08T09:05:09.054Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = getting archived workflows not supported" grpc.code=Internal grpc.method=GetArchivedWorkflow grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2024-03-08T09:05:09Z" grpc.time_ms=0.129 span.kind=server system=grpc

@agilgur5
Copy link
Contributor

agilgur5 commented Mar 8, 2024

Maybe arguments passed to getContainerLogs differ in both cases? I am unfortunately not confortable enough in React to be able to tell. [sic]

Yes that would be suspected as one is correct and the other is not. Unfortunately they are the same (note that archived in that case solely refers to whether the Workflow was archived, as workflow-service itself handles archive logs, abstracted from the components). The node view actually sets the logs viewer's props (React's version of arguments).

The only difference is that in the node view, you're getting logs for a specific Pod, as opposed to all logs. But you did specify a Pod in your screenshot 🤔

The only error I see is one appearing even when I don't click on any LOGS button. I don't think it is related:

No, that just means you don't have archived workflows enabled. It really shouldn't 500 though (that's a separate bug, EDIT: tracking in #12814)

Could you check against 3.4.x?

What about this part? You didn't seem to reply to that and the 500 error would not occur in 3.4.x

@duc00
Copy link
Author

duc00 commented Mar 8, 2024

The only difference is that in the node view, you're getting logs for a specific Pod, as opposed to all logs. But you did specify a Pod in your screenshot 🤔

Yes, indeed. I tried all combinations and nothing shows when coming from workflow LOGS button..

Could you check against 3.4.x?

What about this part? You didn't seem to reply to that and the 500 error would not occur in 3.4.x

I missed this part, will check in the coming days. Were you able to reproduce against 3.5.5?

@agilgur5 agilgur5 changed the title Workflow LOGS button does not fetch archived logs but node summary LOGS button does Workflow logs button does not fetch archived logs but node summary logs button does Mar 8, 2024
@duc00
Copy link
Author

duc00 commented Mar 16, 2024

Hey @agilgur5, I was able to test against v3.4.0 with a minimal deploy and the issue exists as well. So I guess it's not a regression?
Thanks!

@agilgur5
Copy link
Contributor

agilgur5 commented Apr 13, 2024

v3.4.0

I'd be curious if this is reproduced in latest 3.4.x.

Specifically, I did some backtracking and found #9434 that was fixed prior to 3.4.0 (in 3.4.0-rc4 it would seem) as well as #9644 that was fixed in 3.4.1.

It's possible there's a remaining bug from 3.4, although I'd be a little surprised it hadn't been reproduced and reported earlier 🤔. Particularly by the original reporters of those two issues; #9434 in particular had a lot of follow-up and is from an active contributor.

Whereas 3.5 has had more changes (#11121 in particular) and so wouldn't be as surprising; #12590 might actually have the same root cause as this issue, though with different symptoms.

@agilgur5
Copy link
Contributor

agilgur5 commented Apr 13, 2024

I'm able to reproduce this on latest, so should be fixing it shortly, though will need to know whether this should be backported to 3.4.x.

Another note, if you click on the same Pod name in the drop-down, the archive logs work fine. So it's just the initial click of the "Logs" button that doesn't work. EDIT: Nevermind, that's because I had the WIP fix for #12590 in my codebase

@agilgur5
Copy link
Contributor

agilgur5 commented Apr 13, 2024

#12590 might actually have the same root cause as this issue

Yea exactly the same root cause, just that the initial fix in #12591 was not quite at root cause and so missed the non-archived workflows case. But the root cause was just for archive logs, and not specific to archived workflows -- I fixed that in #12932

though will need to know whether this should be backported to 3.4.x.

As far as I can tell, this bug does seem to exist in 3.4.x; my root cause fix is effectively a follow-up to #9464, which partially fixed #9434. Overall, this seems to be due to the change to POD_NAMES=v2 in 3.4 (#7279 et al), which had several bugs.

This also would theoretically impact 3.4.x as well, before the Archived/Live merge, meaning this is not a regression either.

So this is a regression, but not a 3.5 regression, actually a 3.4 regression that somehow hadn't been caught till now

It's possible there's a remaining bug from 3.4, although I'd be a little surprised it hadn't been reproduced and reported earlier 🤔. Particularly by the original reporters of those two issues; #9434 in particular had a lot of follow-up and is from an active contributor.

I'm still surprised by this, but maybe people don't use the top-level logs button as often? Or, perhaps more likely, when it doesn't work, they click a node and get logs from there which works and so no one really thought too much about it or filed a bug. I think I may have done something similar subconsciously while testing actually 😅

@agilgur5 agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Apr 13, 2024
@agilgur5 agilgur5 self-assigned this Apr 13, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 13, 2024
@duc00
Copy link
Author

duc00 commented Apr 13, 2024

I'm still surprised by this, but maybe people don't use the top-level logs button as often? Or, perhaps more likely, when it doesn't work, they click a node and get logs from there which works and so no one really thought too much about it or filed a bug. I think I may have done something similar subconsciously while testing actually 😅

I was surprised too, this was pretty important to me and thought I had just misconfigured something for some time.. good thing you found the root cause.

And thank you for looking into this 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/archive-logs Archive Logs feature area/ui P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants