-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upper: no more rows in this result set
#2333
Comments
The GC should never delete any UIDVersion that is currently live. I.e. if a UID+version can be found in etcd, then we should never delete it. For e2e tests, this is set as follows:
I.e. every 30s we try and delete any wf older than 30s. This means that any running workflow should be GC after 1m, i.e. any problems should appear after 1m. However, at some point greater than 1m, they appear to be getting deleted. I think this maybe happening after some kind of restart - maybe the lister returns zero records?
If |
Bug in tests. |
@alexec we're still seeing this issue in 2.10.0 |
Please upgrade to v2.11.8 |
I have been seeing this issue in v2.12.2, it continued happening after increasing |
Interesting... You should NEVER see this error.
My question for @markterm is - did you see this in the controller or argo-server? |
I got the error from the argo-server, but I have confirmed that the workflow-controller deleted the record prematurely. I did a run with some extra logging, and in this case saw that 'fnv:900512843' was deleted while the workflow was still running, and the liveOffloadNodeStatusVersions for that workflow in the workflowGarbageCollector function was set to an empty string. |
It appears that if I change workflow/controller/controller.go:442 to: if !ok || (nodeStatusVersion != record.Version && nodeStatusVersion != "") { Then the problem doesn't occur. Also I have logged the liveOffloadNodeStatusVersions value for one of the workflows in progress, here it is:
However I did kubectl -o yaml on this same workflow at 02:38, and got: apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
...
uid: 88e182ad-d01b-4e30-8b16-34f9a311f364
...
status:
finishedAt: null
offloadNodeStatusVersion: fnv:2438327164
phase: Running
progress: 2/2
resourcesDuration:
cpu: 46
memory: 28
startedAt: "2021-01-11T02:11:22Z" As you can see, the problem is it has an offload version in etcd, but the workflow-controller doesn't seem to be seeing it. This is from the argo_workflows table in the DB: |
I'm not sure that is the right fix. It is possible (but very unlikely) for a workflow to have a non-empty value for a But that doesn't explain to me why we would delete the current version. For the controller, we only ever care about the most current version. If we can't get it, we'll error the workflow. Do you see errored workflows? |
I agree it’s a patch, not the right fix.
I’m not sure if I saw the workflows going into error state with the status deleted, but they certainly got stuck.
One note is that we are configured to always offload, so the nodeStatusVersion should never be empty.
Mark.
…On Mon, 11 Jan 2021 at 16:52, Alex Collins ***@***.***> wrote:
I'm not sure that is the right fix.
It is possible (but very unlikely) for a workflow to have a non-empty
value for a nodeStatusVersion, and subsequently for that to be set to
empty. This might happen if there was an update conflict. That maybe what
happened to you.
But that doesn't explain to me why we would delete the current version.
For the controller, we only ever care about the most current version. If
we can't get it, we'll error the workflow.
Do you see errored workflows?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2333 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LXYI2P6JNDBB6BVSWLSZMUFVANCNFSM4K5XTH4A>
.
|
This is only intended for dev environments - you might be better off using the workflow archive.
so there presumably is a. bug |
Yes, I agree that indicates a bug - it seems like it’s seeing the nodeStatusVersion as empty when it isn’t.
We are using node offload as otherwise there are intermittent times that calling the ‘node set’ api when a workflow is big enough to be using compression, doesn’t resume the selected node (or maybe it does and it gets overwritten). I’ve not been able to track down more details for now.
…On Mon, 11 Jan 2021 at 19:57, Alex Collins ***@***.***> wrote:
always offload,
This is only intended for dev environments - you might be better off using
the workflow archive.
nodeStatusVersion should never be empty.
so there presumably is a. bug
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2333 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LT5SSBNZ3D76L7TITTSZNJ35ANCNFSM4K5XTH4A>
.
|
ah - so maybe you're using |
That’s correct, we did.
…On Tue, 12 Jan 2021 at 01:11, Alex Collins ***@***.***> wrote:
ah - so maybe you're using argo node set and there is a bug there? you
must used that with ARGO_SERVER
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2333 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LVXEZHUXS3USBGKV43SZOOUTANCNFSM4K5XTH4A>
.
|
This sounds like an edge case bug to me. Can I ask why you need to have |
We're using it to avoid the bug in |
It is expensive to run using That one change may stop the issue is most cases - and reduce your bills. I think your problem could be caused by this in fact. Do you have @simster7 I've inspected |
The 'argo node set' API is working when ALWAYS_OFFLOAD_NODE_STATUS is true, and not working when ALWAYS_OFFLOAD_NODE_STATUS is false and the node status is compressed.
We always have ALWAYS_OFFLOAD_NODE_STATUS consistent between the API Server and the workflow-controller.
…On Tue, 12 Jan 2021 at 18:00, Alex Collins ***@***.***> wrote:
It is expensive to run using ALWAYS_OFFLOAD_NODE_STATUS
CPU+memory+network+disk cost will all be much higher, so your AWS bills
will be higher too.
That one change may stop the issue is most cases - and reduce your bills.
I think your problem could be caused by this in fact. Do you have
ALWAYS_OFFLOAD_NODE_STATUS when you call argo set?
@simster7 <https://github.com/simster7> I've inspected SetWorkflow and I
can see a minor bug in SetWorkflow on line 480: we do not hydrate the
workflow on this line.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2333 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LVKVAIZRBI5FC5OZALSZSE2XANCNFSM4K5XTH4A>
.
|
You must use
But as I said, you should not be using this option. |
…rgoproj#2333 Signed-off-by: Alex Collins <alex_collins@intuit.com>
I’m not using the CLI, we only do the node set via the api server.
…On Tue, 12 Jan 2021 at 18:20, Alex Collins ***@***.***> wrote:
You must use ALWAYS_OFFLOAD_NODE_STATUS with the CLI too:
env ALWAYS_OFFLOAD_NODE_STATUS=true argo node set ...
But as I said, you should not be using this option.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2333 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LWTKUOWMR2QB25UZBLSZSHGPANCNFSM4K5XTH4A>
.
|
upper: no more rows in this result set
Still seeing this issue locally. We should never see this for offloaded workflows.
The text was updated successfully, but these errors were encountered: