Work results stream sometimes exits before all data has been received #597

shanemcd · 2022-04-21T12:22:56Z

Background

We've seen several reports of AWX jobs entering the "error" state even though the playbook actually runs to completion.

See:

Runner seems to have difficulties to process large outputs ansible-runner#998
automation-job completes successfully but the ops-awx-task and UI show an error and stop producing output awx#11511

There may be more. If I find them I'll update the list above.

The problem

AWX uses Receptor to launch remote work. We utilize ansible-runner's streaming interface to convert both the job's payload and results into a format that can be delivered over a network.

The flow is something like this:

Now with this context, let's take another look at ansible/ansible-runner#998. You can see that for some reason, the data it's trying to decode is not valid base64 data.

With that in mind, I added some debug logging to see what was going on when ansible-runner blew up. What I found was that subsequent reads from the socket were returning an empty byte string:

>>> newline = self._input.readline()
None
>>> newline
b''

If you go and try it out, this is simply the behavior when trying to read from a socket.makefile() after the writer has closed the connection.

So, why is the connection getting closed?

You can see that Receptor closes the connection when it thought all results had been written to a channel whose contents were being streamed back to the client:

receptor/pkg/workceptor/controlsvc.go

Lines 427 to 436 in 5f89364

    
           resultChan, err := c.w.GetResults(ctx, unitid, startPos) 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           err = cfo.WriteToConn(fmt.Sprintf("Streaming results for work unit %s\n", unitid), resultChan) 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           err = cfo.Close()

If we look inside GetResults, it's a goroutine writing to the channel being read and sent to the client. The loop ejects when it sees an EOF event, if the work unit is complete, it considers the stream to finished too:

receptor/pkg/workceptor/workceptor.go

Line 568 in 5f89364

if IsComplete(unit.Status().State) && stdoutSize >= unit.Status().StdoutSize {

Let's add some debug logging to inspect the variables when this condition hits:

bash-4.4$ git diff pkg/workceptor/workceptor.go 
bash-4.4$ git diff pkg/workceptor/workceptor.go 
diff --git a/pkg/workceptor/workceptor.go b/pkg/workceptor/workceptor.go
index 41bde23..f4fdd38 100644
--- a/pkg/workceptor/workceptor.go
+++ b/pkg/workceptor/workceptor.go
@@ -575,6 +575,10 @@ func (w *Workceptor) GetResults(ctx context.Context, unitID string, startPos int
                                unitStatus := unit.Status()
                                if IsComplete(unitStatus.State) && stdoutSize >= unitStatus.StdoutSize && filePos >= unitStatus.StdoutSize {
                                        logger.Debug("Stdout complete - closing channel for: %s \n", unitID)
+                                       logger.Debug("filePos for %s: %d", unitID, filePos)
+                                       logger.Debug("unit.Status().StdoutSize for %s: %d", unitID, unit.Status().StdoutSize)
+                                       logger.Debug("stdoutSize %s: %d", unitID, stdoutSize)
+                                       logger.Debug("Last bytes for %s: %d", unitID, lastBytes)
 
                                        return
                                }

For a job that blew up, I saw:

DEBUG 2022/04/20 15:27:25 Stdout complete - closing channel for: EIlF5ofA
DEBUG 2022/04/20 15:27:25 filePos for EIlF5ofA: 20394593
DEBUG 2022/04/20 15:27:25 unit.Status().StdoutSize for EIlF5ofA: 20418334
DEBUG 2022/04/20 15:27:25 stdoutSize EIlF5ofA: 20418334

Wait... why is filePos less than stdoutSize?

If you look closely, the file size is obtained directly above the suspect conditional:

receptor/pkg/workceptor/workceptor.go

Line 567 in a96b855

stdoutSize := stdoutSize(unitdir)

So now we're getting to the root of the problem:

The loop breaks because we saw an EOF event.
More data was still coming across the network
In between the loop breaking and where we are stat'ing the file, the stdout file continued to grow slightly.

Another potential bug that may be another race condition waiting to bite us is the fact that we make 2 calls to unit.Status() on the same line

receptor/pkg/workceptor/workceptor.go

Line 568 in a96b855

if IsComplete(unit.Status().State) && stdoutSize >= unit.Status().StdoutSize {

Putting all of these pieces together, the fix looks something like this:

bash-4.4$ git diff pkg/workceptor/workceptor.go 
diff --git a/pkg/workceptor/workceptor.go b/pkg/workceptor/workceptor.go
index a38594b..41bde23 100644
--- a/pkg/workceptor/workceptor.go
+++ b/pkg/workceptor/workceptor.go
@@ -572,7 +572,8 @@ func (w *Workceptor) GetResults(ctx context.Context, unitID string, startPos int
                        }
                        if err == io.EOF {
                                stdoutSize := stdoutSize(unitdir)
-                               if IsComplete(unit.Status().State) && stdoutSize >= unit.Status().StdoutSize {
+                               unitStatus := unit.Status()
+                               if IsComplete(unitStatus.State) && stdoutSize >= unitStatus.StdoutSize && filePos >= unitStatus.StdoutSize {
                                        logger.Debug("Stdout complete - closing channel for: %s \n", unitID)
 
                                        return

The text was updated successfully, but these errors were encountered:

ansible#597

stanislav-zaprudskiy · 2022-04-25T13:51:07Z

I just tried to reproduce ansible/ansible-runner#998 (comment) using the latest binary 1.2.0+gd5c6315 in both AWX' ee container and EE Pod container, but AWX' task container still throws the same zipfile.BadZipFile: File is not a zip file error.

hesmithrh · 2022-04-29T15:48:18Z

related PR: #600 ?

github-actions bot added the needs_triage label Apr 21, 2022

shanemcd mentioned this issue Apr 21, 2022

Runner seems to have difficulties to process large outputs ansible/ansible-runner#998

Closed

keithjgrant assigned shanemcd Apr 22, 2022

keithjgrant removed the needs_triage label Apr 22, 2022

shanemcd added a commit to shanemcd/receptor that referenced this issue Apr 22, 2022

Ensure all bytes are written to result stream

e7ef918

ansible#597

shanemcd mentioned this issue Apr 22, 2022

Ensure all bytes are written to result stream #600

Merged

shanemcd added a commit to shanemcd/receptor that referenced this issue Apr 25, 2022

Ensure all bytes are written to result stream

4c9f5ae

ansible#597

AlanCoding mentioned this issue Apr 28, 2022

project update hook fail with many jobs no playbook found ansible/awx#11991

Closed

6 tasks

jay-steurer self-assigned this Jun 7, 2022

jay-steurer removed their assignment Jun 21, 2022

shanemcd closed this as completed Nov 15, 2022

shanemcd mentioned this issue Jul 19, 2024

Implement CLI / Config with Cobra / Viper #1043

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work results stream sometimes exits before all data has been received #597

Work results stream sometimes exits before all data has been received #597

shanemcd commented Apr 21, 2022 •

edited

Loading

stanislav-zaprudskiy commented Apr 25, 2022

hesmithrh commented Apr 29, 2022

Work results stream sometimes exits before all data has been received #597

Work results stream sometimes exits before all data has been received #597

Comments

shanemcd commented Apr 21, 2022 • edited Loading

Background

The problem

stanislav-zaprudskiy commented Apr 25, 2022

hesmithrh commented Apr 29, 2022

shanemcd commented Apr 21, 2022 •

edited

Loading