Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work results stream sometimes exits before all data has been received #597

Closed
shanemcd opened this issue Apr 21, 2022 · 2 comments
Closed
Assignees

Comments

@shanemcd
Copy link
Member

shanemcd commented Apr 21, 2022

Background

We've seen several reports of AWX jobs entering the "error" state even though the playbook actually runs to completion.

See:

There may be more. If I find them I'll update the list above.


The problem

AWX uses Receptor to launch remote work. We utilize ansible-runner's streaming interface to convert both the job's payload and results into a format that can be delivered over a network.

The flow is something like this:

Now with this context, let's take another look at ansible/ansible-runner#998. You can see that for some reason, the data it's trying to decode is not valid base64 data.

With that in mind, I added some debug logging to see what was going on when ansible-runner blew up. What I found was that subsequent reads from the socket were returning an empty byte string:

>>> newline = self._input.readline()
None
>>> newline
b''

If you go and try it out, this is simply the behavior when trying to read from a socket.makefile() after the writer has closed the connection.

So, why is the connection getting closed?

You can see that Receptor closes the connection when it thought all results had been written to a channel whose contents were being streamed back to the client:

resultChan, err := c.w.GetResults(ctx, unitid, startPos)
if err != nil {
return nil, err
}
err = cfo.WriteToConn(fmt.Sprintf("Streaming results for work unit %s\n", unitid), resultChan)
if err != nil {
return nil, err
}
err = cfo.Close()

If we look inside GetResults, it's a goroutine writing to the channel being read and sent to the client. The loop ejects when it sees an EOF event, if the work unit is complete, it considers the stream to finished too:

if IsComplete(unit.Status().State) && stdoutSize >= unit.Status().StdoutSize {

Let's add some debug logging to inspect the variables when this condition hits:

bash-4.4$ git diff pkg/workceptor/workceptor.go 
bash-4.4$ git diff pkg/workceptor/workceptor.go 
diff --git a/pkg/workceptor/workceptor.go b/pkg/workceptor/workceptor.go
index 41bde23..f4fdd38 100644
--- a/pkg/workceptor/workceptor.go
+++ b/pkg/workceptor/workceptor.go
@@ -575,6 +575,10 @@ func (w *Workceptor) GetResults(ctx context.Context, unitID string, startPos int
                                unitStatus := unit.Status()
                                if IsComplete(unitStatus.State) && stdoutSize >= unitStatus.StdoutSize && filePos >= unitStatus.StdoutSize {
                                        logger.Debug("Stdout complete - closing channel for: %s \n", unitID)
+                                       logger.Debug("filePos for %s: %d", unitID, filePos)
+                                       logger.Debug("unit.Status().StdoutSize for %s: %d", unitID, unit.Status().StdoutSize)
+                                       logger.Debug("stdoutSize %s: %d", unitID, stdoutSize)
+                                       logger.Debug("Last bytes for %s: %d", unitID, lastBytes)
 
                                        return
                                }

For a job that blew up, I saw:

DEBUG 2022/04/20 15:27:25 Stdout complete - closing channel for: EIlF5ofA
DEBUG 2022/04/20 15:27:25 filePos for EIlF5ofA: 20394593
DEBUG 2022/04/20 15:27:25 unit.Status().StdoutSize for EIlF5ofA: 20418334
DEBUG 2022/04/20 15:27:25 stdoutSize EIlF5ofA: 20418334

Wait... why is filePos less than stdoutSize?

If you look closely, the file size is obtained directly above the suspect conditional:

stdoutSize := stdoutSize(unitdir)

So now we're getting to the root of the problem:

  • The loop breaks because we saw an EOF event.
  • More data was still coming across the network
  • In between the loop breaking and where we are stat'ing the file, the stdout file continued to grow slightly.

Another potential bug that may be another race condition waiting to bite us is the fact that we make 2 calls to unit.Status() on the same line

if IsComplete(unit.Status().State) && stdoutSize >= unit.Status().StdoutSize {

Putting all of these pieces together, the fix looks something like this:

bash-4.4$ git diff pkg/workceptor/workceptor.go 
diff --git a/pkg/workceptor/workceptor.go b/pkg/workceptor/workceptor.go
index a38594b..41bde23 100644
--- a/pkg/workceptor/workceptor.go
+++ b/pkg/workceptor/workceptor.go
@@ -572,7 +572,8 @@ func (w *Workceptor) GetResults(ctx context.Context, unitID string, startPos int
                        }
                        if err == io.EOF {
                                stdoutSize := stdoutSize(unitdir)
-                               if IsComplete(unit.Status().State) && stdoutSize >= unit.Status().StdoutSize {
+                               unitStatus := unit.Status()
+                               if IsComplete(unitStatus.State) && stdoutSize >= unitStatus.StdoutSize && filePos >= unitStatus.StdoutSize {
                                        logger.Debug("Stdout complete - closing channel for: %s \n", unitID)
 
                                        return
@stanislav-zaprudskiy
Copy link

I just tried to reproduce ansible/ansible-runner#998 (comment) using the latest binary 1.2.0+gd5c6315 in both AWX' ee container and EE Pod container, but AWX' task container still throws the same zipfile.BadZipFile: File is not a zip file error.

@hesmithrh
Copy link

related PR: #600 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants