Reenable Transparent Compiler tests #17933

majocha · 2024-10-28T17:33:21Z

Experiment branched out from #17872

Async flavor of graph processing is simplified a bit and runs with Async.Parallel, given that it is async already.

If there are any remaining instabilities in Transparent Compiler tests they should be easier to weed out in a parallel test run, where the thread pool is way more stressed.

All the tests tests from #16766 but two should be unskipped, unless I missed some.

github-actions · 2024-10-28T17:33:59Z

❗ Release notes required

@majocha,

Caution

No release notes found for the changed paths (see table below).

Please make sure to add an entry with an informative description of the change as well as link to this pull request, issue and language suggestion if applicable. Release notes for this repository are based on Keep A Changelog format.

The following format is recommended for this repository:

* <Informative description>. ([PR #XXXXX](https://github.com/dotnet/fsharp/pull/XXXXX))

See examples in the files, listed in the table below or in th full documentation at https://fsharp.github.io/fsharp-compiler-docs/release-notes/About.html.

If you believe that release notes are not necessary for this PR, please add NO_RELEASE_NOTES label to the pull request.

You can open this PR in browser to add release notes: open in github.dev

Change path	Release notes path	Description
`src/Compiler`	docs/release-notes/.FSharp.Compiler.Service/9.0.200.md	No release notes found or release notes format is not correct

tests/FSharp.Compiler.Service.Tests/MultiProjectAnalysisTests.fs

majocha · 2024-10-31T16:41:45Z

EDIT: Nope, it's not this.

Ok, I think I see where the bug is. AsyncMemoize does not do cache updates atomically but posts them with Task.Run. There is no guarantee of order of execution so under stress it's possible to proceed to checkFile in the test workflow while the cache is not yet updated, despite the fact Get method returned. I'll try to verify this later.

I think there's a potential to simplify things here by introducing a AsyncLazy type similar to the one Roslyn uses internally. We could just cache values of cancellable AsyncLazy reducing risk of such errors.

0101 · 2024-10-31T18:48:31Z

Ok, I think I see where the bug is. AsyncMemoize does not do cache updates atomically but posts them with Task.Run. There is no guarantee of order of execution so under stress it's possible to proceed to checkFile in the test workflow while the cache is not yet updated, despite the fact Get method returned. I'll try to verify this later.

Get should be atomic and "blocking". It should be fully completed when it returns (in terms of what's in the cache). Then when jobs either complete or are canceled that's sent via Task.Run whenever it happens and there is no defined order of execution. It's just serialized in some order by the lock.

But maybe, under stress, by the time control was returned to whoever called Get, some further updates already happened on the requested job that were not expected? 🤔

majocha · 2024-10-31T19:00:10Z

Ok, I think I see where the bug is. AsyncMemoize does not do cache updates atomically but posts them with Task.Run. There is no guarantee of order of execution so under stress it's possible to proceed to checkFile in the test workflow while the cache is not yet updated, despite the fact Get method returned. I'll try to verify this later.

Get should be atomic and "blocking". It should be fully completed when it returns (in terms of what's in the cache). Then when jobs either complete or are canceled that's sent via Task.Run whenever it happens and there is no defined order of execution. It's just serialized in some order by the lock.

But maybe, under stress, by the time control was returned to whoever called Get, some further updates already happened on the requested job that were not expected? 🤔

Ah, yes you're right. I was looking at this line

fsharp/src/Compiler/Facilities/AsyncMemoize.fs

Line 523 in b34e992

post (key, (JobCompleted(result, cachingLogger.CapturedDiagnostics)))

Which does a fire-and-forget task, but now I see the cache already has a value at this moment by earlier GetOrCompute.

majocha · 2024-11-05T12:13:22Z

TryGetRecentCheckResultsForFile returns result after first call to ParseAndCheckFileInProject this fails when running in parallel?

That's suspicious, first thought would be either a shared instance of Checker or using the filesystem. Although neither of those should really be the case. Maybe the options cache but that looks like it should work too. Also if these were the problem other tests would probably be failing. Maybe you found an actual bug? 🤔

This showed up again in CI. I'm thinking now it's a genuine bug, not just test case being flaky. Too bad I can only repro it locally on an underpowered laptop.

majocha · 2024-11-05T12:37:05Z

Ok, I think I see where the bug is. AsyncMemoize does not do cache updates atomically but posts them with Task.Run. There is no guarantee of order of execution so under stress it's possible to proceed to checkFile in the test workflow while the cache is not yet updated, despite the fact Get method returned. I'll try to verify this later.

Get should be atomic and "blocking". It should be fully completed when it returns (in terms of what's in the cache). Then when jobs either complete or are canceled that's sent via Task.Run whenever it happens and there is no defined order of execution. It's just serialized in some order by the lock.

But maybe, under stress, by the time control was returned to whoever called Get, some further updates already happened on the requested job that were not expected? 🤔

So, I was onto someithng after all. The test in question, TryGetRecentCheckResultsForFile returns result after first call to ParseAndCheckFileInProject
hits AsyncMemoize.TryGet immediately after checking the project. TryGet scans the cache for Completed jobs:

fsharp/src/Compiler/Facilities/AsyncMemoize.fs

Lines 550 to 557 in ae618aa

    
           member _.TryGet(key: 'TKey, predicate: 'TVersion -> bool) : 'TValue option = 
        
               let versionsAndJobs = cache.GetAll(key) 
        
               versionsAndJobs 
        
               |> Seq.tryPick (fun (version, job) -> 
        
                   match predicate version, job with 
        
                   | true, Completed(completed, _) -> Some completed 
        
                   | _ -> None)

The problem is that there are none yet, despite the fact previous AsyncMemoize.Get already finished. Why? because Get does not do cache.Set, it just queues it with a Task.Run:

fsharp/src/Compiler/Facilities/AsyncMemoize.fs

Line 523 in ae618aa

post (key, (JobCompleted(result, cachingLogger.CapturedDiagnostics)))

fsharp/src/Compiler/Facilities/AsyncMemoize.fs

Lines 474 to 475 in ae618aa

    
           let rec post msg = 
        
               Task.Run(fun () -> processStateUpdate post msg :> Task) |> ignore

Effectively a state is allowed where the overall async computation continues but the cache is not keeping up.

Ideally, the moment Get completes the cache should be up to date with the current state of the given job.

I do have a fix for this but it involves some refactoring 😟

psfinaki · 2024-11-05T12:48:21Z

Thanks for the further investigations :)

Refactoring is okay as long as it is justified and not too crazy.

majocha · 2024-11-05T13:07:41Z

Refactoring is okay as long as it is justified and not too crazy.

Here's the gist of it: https://gist.github.com/majocha/6c075dcdabf45c3fd92b03e2cf3a07eb

0101 · 2024-11-05T13:49:43Z

Good catch @majocha !

I don't know if this is a real issue in practice though. In this situation I think it's ok to say we don't have recent results yet. Alternatively it means some extra delay for the original caller waiting for cache update to be processed.

If it's used in the usual async way, you'd get back the Running job which will be completed very soon.

Unless I'm missing some other scenarios where this could be an issue, we could just update the test (maybe wait for cache event or something).

I didn't study your refactoring in depth yet but it's much less code, so if it does the same thing I'm all for it! But maybe could be in a separate PR.

majocha · 2024-11-05T14:04:52Z

Good catch @majocha !

I don't know if this is a real issue in practice though. In this situation I think it's ok to say we don't have recent results yet. Alternatively it means some extra delay for the original caller waiting for cache update to be processed.

If it's used in the usual async way, you'd get back the Running job which will be completed very soon.

Unless I'm missing some other scenarios where this could be an issue, we could just update the test (maybe wait for cache event or something).

Yes, in practice fixing the particular test would be just to add a wait for Completed event.

I didn't study your refactoring in depth yet but it's much less code, so if it does the same thing I'm all for it! But maybe could be in a separate PR.

That's what I'm thinking, first I want to get the tests up and running again.

majocha · 2024-11-06T12:38:43Z

Unless I'm missing some other scenarios where this could be an issue, we could just update the test (maybe wait for cache event or something).

This actually cleared a huge blindspot for me. The whole thing I've been fighting here regarding the flakiness of these tests is that events can occur with some unpredictable delay. I did a lot of "clever" stuff here awaiting on the events but now it occurred to me:
A bool prop is needed on the AyncMemoize that indicates it has updates "in flight".
Then we can simply do a spinwait to ensure it is in a consistent state before asserting anything about the recorded events.

0101 · 2024-11-06T13:23:02Z

Good idea! Wonder if it takes care of all situations. It just might!

We could potentially also consider waiting until there are 0 Running jobs.

majocha · 2024-11-06T14:33:16Z

Good idea! Wonder if it takes care of all situations. It just might!

Yes it does, I just redid the AsyncMemoize tests also with it. Now I should probably crosscheck the test cases with their historical versions. They kinda went full circle and back with all the changes as I was fighting (having fun) with it.

majocha · 2024-11-06T14:36:35Z

We could potentially also consider waiting until there are 0 Running jobs.

Yes, I considered this, but the cache state only gives no guarantee that there is no pending state update queued in the thread pool.

0101 · 2024-11-06T15:12:17Z

We could potentially also consider waiting until there are 0 Running jobs.

Yes, I considered this, but the cache state only gives no guarantee that there is no pending state update queued in the thread pool.

Well yes, but also no state update (coming through post) can increase the number of Running jobs.

majocha · 2024-11-06T15:30:35Z

We could potentially also consider waiting until there are 0 Running jobs.

Yes, I considered this, but the cache state only gives no guarantee that there is no pending state update queued in the thread pool.

Well yes, but also no state update (coming through post) can increase the number of Running jobs.

I've been thinking in terms of testing, where we expect some exact list of events to be triggered. if there is a update posted, we have to wait because an event is coming. Once there are no more updates posted and we also don't have any Get running, we're safe that we have all the events.

majocha · 2024-11-06T20:09:01Z

This is now cherrypicked as PR to main #17966

T-Gro · 2024-11-07T10:22:45Z

Thanks Jakub ==> I am closing this one then.

majocha added 17 commits October 21, 2024 14:55

rebase

0ed0442

Merge branch 'main' into parallel-tests

72ae0c6

Merge branch 'main' into parallel-tests

612a7b3

merge main

f2795ee

Merge branch 'main' into parallel-tests

d384a06

Merge remote-tracking branch 'dotnet/main' into parallel-tests

7a94530

orchestrate instead of Async.Sleep

e93a33e

xunit customizations behind conditional compilation

47d9cdc

Merge branch 'main' into parallel-tests

4a5ba8f

Merge branch 'main' into parallel-tests

7359bdf

make console capture explicit

83c5b98

diff

1ef5c8d

try fix test

e665e79

lazy pattern

efc4c28

reenable extras

1b00731

process graph with Async.Parallel

8f82c8d

reenable transparent compiler tests

c8567d3

majocha added 7 commits October 29, 2024 07:38

service tests had parallel disabled

dc139f4

merge

db139aa

improve

17964e5

exclude perf test

8d6290e

format

323ebed

wip

0ffca1f

disable one problematic case pending investigation

40296d9

majocha commented Oct 29, 2024

View reviewed changes

tests/FSharp.Compiler.Service.Tests/MultiProjectAnalysisTests.fs Outdated Show resolved Hide resolved

majocha commented Oct 29, 2024

View reviewed changes

tests/FSharp.Compiler.Service.Tests/MultiProjectAnalysisTests.fs Show resolved Hide resolved

majocha added 3 commits October 29, 2024 12:42

run ProjectAnalysisTests in sequence

7b709b5

wip

3168420

run ModuleReaderCancellationTests in sequence

e159de9

majocha added 5 commits October 31, 2024 20:21

Merge branch 'main' into reenable-tests

c63feac

unskip stress test

ff53813

test temptath

0f1f90b

Merge branch 'main' into reenable-tests

0608088

cleaup test dir

d51df4d

majocha force-pushed the reenable-tests branch from f779f42 to d51df4d Compare November 5, 2024 09:54

Merge branch 'main' into reenable-tests

77e0309

this change was not guilty after all

20e7d19

add Updating property to AsyncMemoize

3cc673b

redo AsyncMemoize tests also

47b09fb

Merge branch 'main' into reenable-tests

182c150

wip

6598988

T-Gro closed this Nov 7, 2024

Reenable Transparent Compiler tests #17933

Reenable Transparent Compiler tests #17933

Uh oh!

Conversation

majocha commented Oct 28, 2024

Uh oh!

github-actions bot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❗ Release notes required

Uh oh!

Uh oh!

Uh oh!

majocha commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0101 commented Oct 31, 2024

Uh oh!

majocha commented Oct 31, 2024

Uh oh!

majocha commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

majocha commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psfinaki commented Nov 5, 2024

Uh oh!

majocha commented Nov 5, 2024

Uh oh!

0101 commented Nov 5, 2024

Uh oh!

majocha commented Nov 5, 2024

Uh oh!

majocha commented Nov 6, 2024

Uh oh!

0101 commented Nov 6, 2024

Uh oh!

majocha commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

majocha commented Nov 6, 2024

Uh oh!

0101 commented Nov 6, 2024

Uh oh!

majocha commented Nov 6, 2024

Uh oh!

majocha commented Nov 6, 2024

Uh oh!

T-Gro commented Nov 7, 2024

Uh oh!

Uh oh!

github-actions bot commented Oct 28, 2024 •

edited

Loading

majocha commented Oct 31, 2024 •

edited

Loading

majocha commented Nov 5, 2024 •

edited

Loading

majocha commented Nov 5, 2024 •

edited

Loading

majocha commented Nov 6, 2024 •

edited

Loading