Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: flaky test - ToListAsync_with_canceled_token #27033

Closed
maumar opened this issue Dec 17, 2021 · 22 comments · Fixed by #27036
Closed

Query: flaky test - ToListAsync_with_canceled_token #27033

maumar opened this issue Dec 17, 2021 · 22 comments · Fixed by #27036
Assignees
Labels
area-query area-test closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-bug
Milestone

Comments

@maumar
Copy link
Contributor

maumar commented Dec 17, 2021

Message: 
Assert.DoesNotContain() Failure
Found: Microsoft.EntityFrameworkCore.Query.QueryIterationFailed
In value: SelectListIterator<ValueTuple<LogLevel, EventId, String, Object, Exception>, EventId> [Microsoft.EntityFrameworkCore.Query.QueryCompilationStarting, Microsoft.EntityFrameworkCore.Query.QueryExecutionPlanned, Microsoft.EntityFrameworkCore.Query.QueryCompilationStarting, Microsoft.EntityFrameworkCore.Query.QueryExecutionPlanned, Microsoft.EntityFrameworkCore.Query.QueryCompilationStarting, ...]

fails on my box when I run all tests at once, passes if ran individually

@smitpatel
Copy link
Member

I also ran into this.

@roji
Copy link
Member

roji commented Dec 17, 2021

Will look into it.

BTW does it fail consistently for you (when running all tests at once)?

@vonzshik
Copy link
Contributor

That's to be expected, with it not clearing the log (unlike ToListAsync_can_be_canceled does) and potentially running in parallel...

[ConditionalFact]
public virtual async Task ToListAsync_with_canceled_token()
{
using var context = CreateContext();
await Assert.ThrowsAsync<OperationCanceledException>(() => context.Employees.ToListAsync(new CancellationToken(true)));
Assert.Contains(CoreEventId.QueryCanceled, Fixture.ListLoggerFactory.Log.Select(l => l.Id));
Assert.DoesNotContain(CoreEventId.QueryIterationFailed, Fixture.ListLoggerFactory.Log.Select(l => l.Id));
}

@roji
Copy link
Member

roji commented Dec 18, 2021

@vonzshik can you elaborate? Fixture (along with ListLoggerFactory) is an xunit class fixture, and should never get used concurrently (that's how it works in xunit, not the same as nunit). The clearing in ToListAsync_can_be_canceled is because that test performs the same operation in a loop, and we don't want log messages from different iterations to leak - it's not there to guard against concurrency issues.

@ajcvickers
Copy link
Member

ajcvickers commented Dec 18, 2021

@roji Agreed there are no concurrency issues. However, not clearing the log means that it will contain arbitrary context depending on the order that the tests in the fixture are run. And that can be any order, and different from run to run.

@roji
Copy link
Member

roji commented Dec 18, 2021

@ajcvickers but we clear the logs in the constructor of the test class, which gets executed before every test, right? Put another way, our tests almost never clear the logs themselves - that's handled by our infrastructure, no?

@roji
Copy link
Member

roji commented Dec 18, 2021

BTW does it fail consistently for you (when running all tests at once)?

@maumar @smitpatel BTW on which provider does it fail (SQL Server/SQLite)? It doesn't repro for me here (though Linux...).

@ajcvickers
Copy link
Member

@roji Only on SQL Server, right?

@roji
Copy link
Member

roji commented Dec 18, 2021

We do this consistently everywhere AFAIK. If some provider neglects to do this in their test classes, wouldn't all their tests start failing immediately, as logs from previous tests appear for later ones?

@ajcvickers
Copy link
Member

Never mind, we do it in SQLite too.

But no, I don't think tests will generally fail. We don't always look at logs.

@roji
Copy link
Member

roji commented Dec 18, 2021

OK. Certainly most query test classes assert on SQL (on SQL Server at least).

One thing I'm not clear about, is why we don't simply clear the logs in the test base classes (Fixture.ListLoggerFactory.Clear) instead of doing it on each concrete test class... @maumar any idea?

@roji
Copy link
Member

roji commented Dec 18, 2021

Ahhh, maybe retries again? If the first attempt fails because of some transient problem, the second attempt will already have a QueryIterationFailed message in the logs. Since we can't clear the logs between execution retries, best to just remove the assertion against QueryIterationFailed.

roji added a commit to roji/efcore that referenced this issue Dec 18, 2021
Remove assertions that QueryIterationFailed doesn't appear in the logs,
since it may be there because of transient failures from previous
retries.

Fixes dotnet#27033
@roji roji self-assigned this Dec 18, 2021
@roji roji added this to the 7.0.0 milestone Dec 18, 2021
@roji roji added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Dec 18, 2021
@vonzshik
Copy link
Contributor

@vonzshik can you elaborate? Fixture (along with ListLoggerFactory) is an xunit class fixture, and should never get used concurrently (that's how it works in xunit, not the same as nunit). The clearing in ToListAsync_can_be_canceled is because that test performs the same operation in a loop, and we don't want log messages from different iterations to leak - it's not there to guard against concurrency issues.

@ajcvickers but we clear the logs in the constructor of the test class, which gets executed before every test, right? Put another way, our tests almost never clear the logs themselves - that's handled by our infrastructure, no?

Welp, I was wrong on both accounts, teaches me to always check beforehand whether everything works exactly as I thought :D

Ahhh, maybe retries again? If the first attempt fails because of some transient problem, the second attempt will already have a QueryIterationFailed message in the logs.

But what's the reason for a retry, especially locally? This might indicate some other test suite issue...

@ajcvickers
Copy link
Member

@vonzshik

But what's the reason for a retry, especially locally? This might indicate some other test suite issue...

Spoken like a true PostgreSQLer. 😜 SQL Server/SqlClient sometimes just fails...

@roji
Copy link
Member

roji commented Dec 18, 2021

Welp, I was wrong on both accounts, teaches me to always check beforehand whether everything works exactly as I thought :D

Very normal for someone not yet very familiar with EF Core internals, xunit, etc... That's how one learns!

Spoken like a true PostgreSQLer.

Ahahaha indeed :)

@roji
Copy link
Member

roji commented Dec 18, 2021

BTW I wonder if all the SQL Server/SqlClient reliability issues are actually LocalDB issues... it's possible that it wasn't meant for the kind of massive, concurrent load that our tests do (it's hard to imagine that real SQL Server is this unreliable in the world). We could simply use real SQL Server in our CI pipeline (and also on local dev machines), though there's also an advantage in having these reliability issues in the CI pipeline - it's good coverage for our retry logic etc. (chaos monkey!).

@ajcvickers
Copy link
Member

See #25448 (comment)

But LocalDb is easy for contributors, and in the past we have had trouble getting images for Helix/AzDo with SQL Server set up appropriately. That may not be an issue anymore.

@roji
Copy link
Member

roji commented Dec 18, 2021

Yeah. We can still leave the default test setup to be LocalDB (for contributors), but configure our CI tests to run on real SQL Server (assuming Helix/AzDo can handle that today). But I still there's some value in the chaos flakiness introduced by LocalDB.

@ErikEJ
Copy link
Contributor

ErikEJ commented Dec 18, 2021

I wonder why LocalDB should be flaky, it is the same engine as SQL Server Full.

I assume attachdbfilename is not used.

Only difference would be named pipes communication.

Also instant file initialization rights could play a role.

@ajcvickers
Copy link
Member

@ErikEJ

Only difference would be named pipes communication.

I have suspected for many years that this might be the issue. Which also explains why it never get fixed; nobody would use named pipes for anything real.

roji added a commit that referenced this issue Dec 18, 2021
Remove assertions that QueryIterationFailed doesn't appear in the logs,
since it may be there because of transient failures from previous
retries.

Fixes #27033
@vonzshik
Copy link
Contributor

@roji even taking into account SqlClient/SqlServer issues, the flaky test passes an already cancelled token. At no point there should be any IO or retries at all...

@roji
Copy link
Member

roji commented Dec 18, 2021

That should be true, although... whenever EF executes a query, it opens and closes the underlying DbConnection. As you yourself pointed out, the login sequence seems like it's synchronous, so it's possible the token isn't even consulted at that point...

Basically at this point I don't want to make any assumptions about the driver which I don't have to 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-query area-test closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants