Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result_traceback should not include job stdout #12961

Merged
merged 1 commit into from
Jan 3, 2023

Conversation

fosterseth
Copy link
Member

@fosterseth fosterseth commented Sep 27, 2022

related #12644

SUMMARY

If a job fails, we do receptor work results and put that output into result_traceback.

We should only do this if

  1. Receptor unit has failed
  2. Runner callback processed 0 events

Otherwise we risk putting too much data into this field.

Would require this receptor change ansible/receptor#672

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API
AWX VERSION
awx: 21.5.1.dev221+gaeb614e45d

# massive, only ask for last 1000 bytes
startpos = max(stdout_size - 1000, 0)
resultsock = receptor_ctl.get_work_results(self.unit_id, startpos=startpos, return_sockfile=True)
lines = resultsock.readlines()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about this hanging if the stream doesn't end with a \n. We may need to put the socket in non-blocking mode before doing this and just read the raw bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some local testing and readlines() seems to be okay with files that don't end in \n. If it gets an EOF it just stops there.

we could also just do read() or read(1000)

as for non-blocking I added resultsock.setblocking(False). Hard to test this exactly, but I will make sure this codepath doesn't throw errors

@fosterseth fosterseth force-pushed the fix_results_traceback branch from aeb614e to 94410bf Compare September 27, 2022 16:43
Copy link
Member

@AlanCoding AlanCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shanemcd's comment may be a valid concern, but this doesn't seem to make it any worse than it was before in that respect, and should prevent some other disaster cases.

@fosterseth fosterseth force-pushed the fix_results_traceback branch 2 times, most recently from dd8458a to f6cffa7 Compare September 28, 2022 04:51
# contain useful information about why the job failed. In case stdout is
# massive, only ask for last 1000 bytes
startpos = max(stdout_size - 1000, 0)
resultsock, resultfile = receptor_ctl.get_work_results(self.unit_id, startpos=startpos, return_socket=True, return_sockfile=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's are the version compatibility issues with this? Does this requires an updated receptor, and does that make it unreasonable to backport? Is there any idea to work on a different fix for backports?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the receptor change is minimal, so backporting shouldn't be too hard.

a more minimal version of this that doesn't require a receptor change is to not do get_work_results at all if stdout size > 1000.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so we could merge that and develop that solution you mention and test it separately. Fine by me.

@fosterseth fosterseth force-pushed the fix_results_traceback branch from f6cffa7 to 7d7458f Compare December 9, 2022 21:33
@AlanCoding
Copy link
Member

I really want to get this merged!

If a job fails, we do receptor work results and put that output
into result_traceback.

We should only do this if
1. Receptor unit has failed
2. Runner callback processed 0 events

Otherwise we risk putting too much data into this field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants