Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local retry behavior keeps previous run's (now duplicate) entries #10916

Closed
MattGal opened this issue Sep 19, 2022 · 16 comments
Closed

Local retry behavior keeps previous run's (now duplicate) entries #10916

MattGal opened this issue Sep 19, 2022 · 16 comments
Assignees

Comments

@MattGal
Copy link
Member

MattGal commented Sep 19, 2022

Related to https://portal.microsofticm.com/imp/v3/incidents/details/335170304/home

I first noticed that the problem was, suspiciously, held to a relatively low number of flaky tests:

let ids=
Messages
| where ingestion_time() > ago(2d)
| where MessageText contains "(Caused by ResponseError('too many 500 error responses'))"
| project WorkItemId;
WorkItems
| where WorkItemId in (ids)
| summarize  count() by FriendlyName

(14 in the past 2 days, out of about 1M work items hit this)

I observed this only happens in tests where a local retry had occurred.

The code is basically

while(should_execute):
  <run some tests>
  if (some results got made):
    run_results += [results]
    if (we need to retry)
      should_execute = true

if (we have an azdo config and should_execute= false)
  Submit (run_results)

I think this is a one-liner fix and will make a PR but will absolutely wait for @ChadNedzlek approval before merging. As he's on FR this week, preemptively assigning to him.

@MattGal
Copy link
Member Author

MattGal commented Sep 19, 2022

Created https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25935 to change the += to an =
EDIT This was wrong

@ChadNedzlek
Copy link
Member

This is definitely intentional. What bad behavior are you seeing?

@ChadNedzlek
Copy link
Member

We are gathering up all the runs so that we can analyze the set of them together to see, for any given test, did it pass in at least one of the result sets. If you remove the previous results from the set, flaky test analysis will break.

@ChadNedzlek
Copy link
Member

Let me double check the code path to make sure it's doing what I think it's doing. But this code has worked correctly for years, so I'm skeptical it's this fundamentally incorrect.

@ChadNedzlek
Copy link
Member

I double checked. The code is definitely doing what it was intended to do, it creating a List[List[TestResults]] where each List[TestResults] is for a single execution. It is 100% intentional that we are gathering up all the various independent runs together to do aggregate analysis on.

@ChadNedzlek
Copy link
Member

ChadNedzlek commented Sep 19, 2022

It's possible the just broke reporting of sub-tests with the flaky attributes on them, since these would be the only tests that do that. We are using a bunch of undocumented behaviors there (in particular, we add "IsTestResultFlaky:True" and "AttemptId:1" to those tests, and se the result_group_type to "rerun"), which is pretty unusual (but it's the only way I could make the AzDO UI do what I wanted).

So it sounds like they just broke flaky tests with their latest change. Or if they want us to report them in some other way, we can do that too, so that it gets the little "flaky" designator in AzDO.

@ChadNedzlek
Copy link
Member

I'm considering a change to the python module that will use the... mutable... nature of python to save the request body if we get a failed response form AzDO. They are blaming us for doing something wrong, but I don't believe that's actually happening, so I need access to one of the requests we are sending.

@ChadNedzlek
Copy link
Member

I pushed a PR that should save the body of the requests when we get this Exception: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26163

Hopefully that will gather some data we can use to figure out what's going on.

@ChadNedzlek
Copy link
Member

I tried catching the broken body, but I failed, because it wasn't a string yet, so I needed to add another PR to capture it right: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26555

@ChadNedzlek
Copy link
Member

Once we have the broken body, we can go back to the IcM showing them what we uploaded.

@ChadNedzlek ChadNedzlek removed their assignment Oct 6, 2022
@missymessa
Copy link
Member

Followed up with Chad on next steps:

Look up workitems that hit this error with this query:

WorkItems 
| where Queued > datetime(2022-10-12 19:00:00) and ExitCode == -4 
| join kind=inner(Messages | where MessageText has "too many 500") on WorkItemId 
| join kind=inner(Logs) on WorkItemId

Check the logs for run_client.py and verify the request was uploaded.

@missymessa missymessa self-assigned this Oct 12, 2022
@missymessa
Copy link
Member

Provided the IcM with logs to show what we're uploading to AzDO, and verified that the tests in that log only contained IsTestResultFlaky once if it contains it at all.

@missymessa
Copy link
Member

AzDO replied on the IcM yesterday: "We have synced the configurations of 'dnceng' org to 'dnceng-public' org, will monitor if this solves the issue or not." We'll have to verify if that actually fixed things or not.

@missymessa
Copy link
Member

According to our Kusto log, the last instance of the "too many 500" occurred on 2022-10-19T08:34:32.98Z. If we start seeing this issue crop up again, we can reopen this issue.

@ChadNedzlek
Copy link
Member

So it was something on their end the whole time?

@missymessa
Copy link
Member

It seems that way. Their last message was that they were syncing the config between dnceng and dnceng-public, which MattGal pointed out was them setting the same feature flags in denceng-public as were set in dnceng.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants