Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable Transparent Compiler tests #17933

Closed
wants to merge 47 commits into from
Closed

Conversation

majocha
Copy link
Contributor

@majocha majocha commented Oct 28, 2024

Experiment branched out from #17872

Async flavor of graph processing is simplified a bit and runs with Async.Parallel, given that it is async already.

If there are any remaining instabilities in Transparent Compiler tests they should be easier to weed out in a parallel test run, where the thread pool is way more stressed.

All the tests tests from #16766 but two should be unskipped, unless I missed some.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Copy link
Contributor

github-actions bot commented Oct 28, 2024

❗ Release notes required

@majocha,

Caution

No release notes found for the changed paths (see table below).

Please make sure to add an entry with an informative description of the change as well as link to this pull request, issue and language suggestion if applicable. Release notes for this repository are based on Keep A Changelog format.

The following format is recommended for this repository:

* <Informative description>. ([PR #XXXXX](https://github.com/dotnet/fsharp/pull/XXXXX))

See examples in the files, listed in the table below or in th full documentation at https://fsharp.github.io/fsharp-compiler-docs/release-notes/About.html.

If you believe that release notes are not necessary for this PR, please add NO_RELEASE_NOTES label to the pull request.

You can open this PR in browser to add release notes: open in github.dev

Change path Release notes path Description
src/Compiler docs/release-notes/.FSharp.Compiler.Service/9.0.200.md No release notes found or release notes format is not correct

@majocha
Copy link
Contributor Author

majocha commented Oct 31, 2024

EDIT: Nope, it's not this.

Ok, I think I see where the bug is. AsyncMemoize does not do cache updates atomically but posts them with Task.Run. There is no guarantee of order of execution so under stress it's possible to proceed to checkFile in the test workflow while the cache is not yet updated, despite the fact Get method returned. I'll try to verify this later.

I think there's a potential to simplify things here by introducing a AsyncLazy type similar to the one Roslyn uses internally. We could just cache values of cancellable AsyncLazy reducing risk of such errors.

@0101
Copy link
Contributor

0101 commented Oct 31, 2024

Ok, I think I see where the bug is. AsyncMemoize does not do cache updates atomically but posts them with Task.Run. There is no guarantee of order of execution so under stress it's possible to proceed to checkFile in the test workflow while the cache is not yet updated, despite the fact Get method returned. I'll try to verify this later.

Get should be atomic and "blocking". It should be fully completed when it returns (in terms of what's in the cache). Then when jobs either complete or are canceled that's sent via Task.Run whenever it happens and there is no defined order of execution. It's just serialized in some order by the lock.

But maybe, under stress, by the time control was returned to whoever called Get, some further updates already happened on the requested job that were not expected? 🤔

@majocha
Copy link
Contributor Author

majocha commented Oct 31, 2024

Ok, I think I see where the bug is. AsyncMemoize does not do cache updates atomically but posts them with Task.Run. There is no guarantee of order of execution so under stress it's possible to proceed to checkFile in the test workflow while the cache is not yet updated, despite the fact Get method returned. I'll try to verify this later.

Get should be atomic and "blocking". It should be fully completed when it returns (in terms of what's in the cache). Then when jobs either complete or are canceled that's sent via Task.Run whenever it happens and there is no defined order of execution. It's just serialized in some order by the lock.

But maybe, under stress, by the time control was returned to whoever called Get, some further updates already happened on the requested job that were not expected? 🤔

Ah, yes you're right. I was looking at this line

post (key, (JobCompleted(result, cachingLogger.CapturedDiagnostics)))

Which does a fire-and-forget task, but now I see the cache already has a value at this moment by earlier GetOrCompute.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@majocha
Copy link
Contributor Author

majocha commented Nov 5, 2024

TryGetRecentCheckResultsForFile returns result after first call to ParseAndCheckFileInProject this fails when running in parallel?

That's suspicious, first thought would be either a shared instance of Checker or using the filesystem. Although neither of those should really be the case. Maybe the options cache but that looks like it should work too. Also if these were the problem other tests would probably be failing. Maybe you found an actual bug? 🤔

This showed up again in CI. I'm thinking now it's a genuine bug, not just test case being flaky. Too bad I can only repro it locally on an underpowered laptop.

@majocha
Copy link
Contributor Author

majocha commented Nov 5, 2024

Ok, I think I see where the bug is. AsyncMemoize does not do cache updates atomically but posts them with Task.Run. There is no guarantee of order of execution so under stress it's possible to proceed to checkFile in the test workflow while the cache is not yet updated, despite the fact Get method returned. I'll try to verify this later.

Get should be atomic and "blocking". It should be fully completed when it returns (in terms of what's in the cache). Then when jobs either complete or are canceled that's sent via Task.Run whenever it happens and there is no defined order of execution. It's just serialized in some order by the lock.

But maybe, under stress, by the time control was returned to whoever called Get, some further updates already happened on the requested job that were not expected? 🤔

So, I was onto someithng after all. The test in question, TryGetRecentCheckResultsForFile returns result after first call to ParseAndCheckFileInProject
hits AsyncMemoize.TryGet immediately after checking the project. TryGet scans the cache for Completed jobs:

member _.TryGet(key: 'TKey, predicate: 'TVersion -> bool) : 'TValue option =
let versionsAndJobs = cache.GetAll(key)
versionsAndJobs
|> Seq.tryPick (fun (version, job) ->
match predicate version, job with
| true, Completed(completed, _) -> Some completed
| _ -> None)

The problem is that there are none yet, despite the fact previous AsyncMemoize.Get already finished. Why? because Get does not do cache.Set, it just queues it with a Task.Run:

post (key, (JobCompleted(result, cachingLogger.CapturedDiagnostics)))

let rec post msg =
Task.Run(fun () -> processStateUpdate post msg :> Task) |> ignore

Effectively a state is allowed where the overall async computation continues but the cache is not keeping up.

Ideally, the moment Get completes the cache should be up to date with the current state of the given job.

I do have a fix for this but it involves some refactoring 😟

@psfinaki
Copy link
Member

psfinaki commented Nov 5, 2024

Thanks for the further investigations :)

Refactoring is okay as long as it is justified and not too crazy.

@majocha
Copy link
Contributor Author

majocha commented Nov 5, 2024

Refactoring is okay as long as it is justified and not too crazy.

Here's the gist of it: https://gist.github.com/majocha/6c075dcdabf45c3fd92b03e2cf3a07eb

@0101
Copy link
Contributor

0101 commented Nov 5, 2024

Good catch @majocha !

I don't know if this is a real issue in practice though. In this situation I think it's ok to say we don't have recent results yet. Alternatively it means some extra delay for the original caller waiting for cache update to be processed.

If it's used in the usual async way, you'd get back the Running job which will be completed very soon.

Unless I'm missing some other scenarios where this could be an issue, we could just update the test (maybe wait for cache event or something).

I didn't study your refactoring in depth yet but it's much less code, so if it does the same thing I'm all for it! But maybe could be in a separate PR.

@majocha
Copy link
Contributor Author

majocha commented Nov 5, 2024

Good catch @majocha !

I don't know if this is a real issue in practice though. In this situation I think it's ok to say we don't have recent results yet. Alternatively it means some extra delay for the original caller waiting for cache update to be processed.

If it's used in the usual async way, you'd get back the Running job which will be completed very soon.

Unless I'm missing some other scenarios where this could be an issue, we could just update the test (maybe wait for cache event or something).

Yes, in practice fixing the particular test would be just to add a wait for Completed event.

I didn't study your refactoring in depth yet but it's much less code, so if it does the same thing I'm all for it! But maybe could be in a separate PR.

That's what I'm thinking, first I want to get the tests up and running again.

@majocha
Copy link
Contributor Author

majocha commented Nov 6, 2024

Unless I'm missing some other scenarios where this could be an issue, we could just update the test (maybe wait for cache event or something).

This actually cleared a huge blindspot for me. The whole thing I've been fighting here regarding the flakiness of these tests is that events can occur with some unpredictable delay. I did a lot of "clever" stuff here awaiting on the events but now it occurred to me:
A bool prop is needed on the AyncMemoize that indicates it has updates "in flight".
Then we can simply do a spinwait to ensure it is in a consistent state before asserting anything about the recorded events.

@0101
Copy link
Contributor

0101 commented Nov 6, 2024

Good idea! Wonder if it takes care of all situations. It just might!

We could potentially also consider waiting until there are 0 Running jobs.

@majocha
Copy link
Contributor Author

majocha commented Nov 6, 2024

Good idea! Wonder if it takes care of all situations. It just might!

Yes it does, I just redid the AsyncMemoize tests also with it. Now I should probably crosscheck the test cases with their historical versions. They kinda went full circle and back with all the changes as I was fighting (having fun) with it.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@majocha
Copy link
Contributor Author

majocha commented Nov 6, 2024

We could potentially also consider waiting until there are 0 Running jobs.

Yes, I considered this, but the cache state only gives no guarantee that there is no pending state update queued in the thread pool.

@0101
Copy link
Contributor

0101 commented Nov 6, 2024

We could potentially also consider waiting until there are 0 Running jobs.

Yes, I considered this, but the cache state only gives no guarantee that there is no pending state update queued in the thread pool.

Well yes, but also no state update (coming through post) can increase the number of Running jobs.

@majocha
Copy link
Contributor Author

majocha commented Nov 6, 2024

We could potentially also consider waiting until there are 0 Running jobs.

Yes, I considered this, but the cache state only gives no guarantee that there is no pending state update queued in the thread pool.

Well yes, but also no state update (coming through post) can increase the number of Running jobs.

I've been thinking in terms of testing, where we expect some exact list of events to be triggered. if there is a update posted, we have to wait because an event is coming. Once there are no more updates posted and we also don't have any Get running, we're safe that we have all the events.

wip
@majocha
Copy link
Contributor Author

majocha commented Nov 6, 2024

This is now cherrypicked as PR to main #17966

@T-Gro
Copy link
Member

T-Gro commented Nov 7, 2024

Thanks Jakub ==> I am closing this one then.

@T-Gro T-Gro closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants