Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent "InvalidTaskResultReference" when running tasks with big results #4529

Closed
skaegi opened this issue Jan 27, 2022 · 9 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@skaegi
Copy link
Contributor

skaegi commented Jan 27, 2022

We've been seeing an increasing number of teams reporting InvalidTaskResultReference problems in pipelines that might otherwise run successfully. We are seeing this infrequently (1 in 40) but we've now seen it a number of different unrelated pipelines. Something like this...


image


The "results" in question are generally large-ish -- 2K+ (correction here thanks @pritidesai) and are present when we look ;)

We suspect that what's happening is that there is a race somewhere. Perhaps a race between the "next" task running and the current TaskRun being updated with the Result. That's just a guess, but does this seem possible/likely?

I'll try to create a good test-case but this sort of race condition is not easy to get to happen on demand although it clearly is occurring with some regularity.

@skaegi skaegi added the kind/bug Categorizes issue or PR as related to a bug. label Jan 27, 2022
@pritidesai
Copy link
Member

pritidesai commented Feb 7, 2022

I think its very likely possible and surprised to see this happening intermittently.

The taskRun status is updated once all containers in the pod have terminated (pod succeeded or failed).

pipeline/pkg/pod/status.go

Lines 109 to 115 in 6cb0f4c

complete := areStepsComplete(pod) || pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodFailed
if complete {
updateCompletedTaskRunStatus(logger, trs, pod)
} else {
updateIncompleteTaskRunStatus(trs, pod)
}

updateCompletedTaskRunStatus includes updating the condition and competition time.

But there is more updates done after updating the status to completed.

In the end, the taskRun status is updated with the results:

trs.TaskRunResults = removeDuplicateResults(trs.TaskRunResults)

So if its possible for the pipelineRun controller to pick up the completed status before the task results are populated, its also possible to run into this issue.

@lbernick
Copy link
Member

lbernick commented Feb 7, 2022

/priority important-soon

@tekton-robot tekton-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2022
@pritidesai
Copy link
Member

@lbernick thanks for adding the priority 👍 Let me know if you want to take a stab at it else I am happy to try fixing it 🙏

/assign

@chitrangpatel
Copy link
Contributor

/assign

@vdemeester
Copy link
Member

@skaegi @pritidesai if you manage to reproduce, can you look at the container's termination message. We do have some similar failures that happens because in the case it fails, the termination message content (json) is cut in the middle (and thus invalid).

@vdemeester
Copy link
Member

/assign
I think I know why this is happening, I'll create an additional issue around this, but in a gist, it's because of the termination message behavior in k8s with multiple containers.

The total size of termination messages of all containers in a pod cannot exceed 12 KB. If the total size exceeds 12 KB, the state manager of Kubernetes sets a limit on the termination message sizes. For example, if a pod contains four InitContainers and eight application containers, the state manager limits the termination message of each container to 1 KB. This indicates that only the first 1 KB of the termination message of each container is intercepted.

@vdemeester
Copy link
Member

@skaegi @pritidesai @chitrangpatel see #4808

@vdemeester
Copy link
Member

@skaegi #4826 has a "regression" test that reproduce this constantly between 0.32 and 0.35 😉

@dibyom
Copy link
Member

dibyom commented Jun 14, 2022

Closing in favor of #4808 which tracks larger results in general

@dibyom dibyom closed this as completed Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants