Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize initialization in ServiceFabric instrumentation #105

Merged
merged 2 commits into from
Apr 7, 2021
Merged

Synchronize initialization in ServiceFabric instrumentation #105

merged 2 commits into from
Apr 7, 2021

Conversation

pellared
Copy link
Member

@pellared pellared commented Apr 6, 2021

Why

initialized field is not synchronized at all which can lead to loss of traces.

What

Initialization is the only use-case where I find using volatile to be OK.
http://www.albahari.com/threading/part4.aspx#_Memory_Barriers_and_Volatility

@pellared pellared requested a review from a team April 6, 2021 10:52
Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle, but it is true :)

@@ -13,7 +13,7 @@ internal static class ServiceRemotingClient
private static readonly Logging.IDatadogLogger Log = Logging.DatadogLogging.GetLoggerFor(typeof(ServiceRemotingClient));

private static int _firstInitialization = 1;
private static bool _initialized;
private static volatile bool _initialized;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we need to make this change. Initialization is protected by _firstInitialization

Copy link
Member Author

@pellared pellared Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not enough. The methods checking if it is initialized are checking _initialized field. _firstInitialization only makes sure that it would be initialized at most once. It does not guarantee that other threads see that it is already initialized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you're saying that the read of _initialized may be stale. I asked @kevingosse about this and it doesn't actually fix the issue you're trying to solve because the volatile read doesn't flush the cache. To truly fix this, a full fence would be needed, but that seems like high overhead for a scenario that has a very low probability to happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q.: is it safe to call ServiceRemotingHelpers.FinishSpan if the corresponding start span was not called?

Copy link
Member Author

@pellared pellared Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zacharycmontoya @kevingosse
IMO it should be enough

The volatile keyword instructs the compiler to generate an acquire-fence on every read from that field, and a release-fence on every write to that field. An acquire-fence prevents other reads/writes from being moved before the fence; a release-fence prevents other reads/writes from being moved after the fence.

Because _initilized = true is the last line in initialization therefore I think this PR makes it thread-safe.

@pjanotti
I quickly looked at ServiceRemotingHelpers.FinishSpan and looked correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjanotti I do not think it would be possible

Copy link
Contributor

@kevingosse kevingosse Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify what volatile does and doesn't:

  • Volatile ensures that the operation is translated into an actual read/write to memory. Without volatile, .NET is allowed to just read/write from a register and never commit the changes to memory. Note that all writes in .NET 2.0+ are implicitly volatile (in the official implementation at least, I don't know about Mono), so we should only worry about reads. Given that _initialized is read only once in the method, there is no opportunity to cache the value in a register so this isn't a concern. Unless the method is inlined into its caller, but this is an event so it can't be inlined.
  • Volatile ensures that operations are consistent related to one another. Basically, it there's is an operation A followed by an operation B, proper usage of volatile ensures that B is visible only if A is visible. But, and that's the subtlety here, it does not ensure that A is visible to begin with. With a concrete example, imagine a thread is writing true to A then true to B. From another thread, without volatile you could in theory read A false and B true. With volatile, you will either see A and B both false, or A true and B false, or A true and B true, but never A false and B true.

But volatile has no impact on how soon the value is published to other threads. If that's important, you need an operation that forces the cache to be flushed, that is Thread.MemoryBarrier or any Interlocked operation.
If you're skeptical, you can try playing with this code (taken from https://stackoverflow.com/q/49203936/869621) :

class Program
    {
        class Foo
        {
            public bool a;
            public bool b;
            public bool c;

            public void SetA()
            {
                a = true;
                TestAB();
            }

            public void SetB()
            {
                b = true;
                TestAB();
            }

            public void TestAB()
            {
                if (a && b)
                {
                    c = true;
                }
            }
        }

        static void Main(string[] args)
        {
            int timesCWasFalse = 0;
            for (int i = 0; i < 100000; i++)
            {
                var f = new Foo();
                var t1 = Task.Run(() => f.SetA());
                var t2 = Task.Run(() => f.SetB());
                Task.WaitAll(t1, t2);
                if (!f.c)
                {
                    timesCWasFalse++;
                }
            }
            Console.WriteLine($"timesCWasFalse: {timesCWasFalse}");
            Console.WriteLine("Finished. Press Enter to exit");
            Console.ReadLine();
        }
    }

You'll see that no amount of volatile will prevent timesCWasFalse to fail. Only inserting a memory barrier (or an interlocked operation) in TestAB does.

Copy link
Member Author

@pellared pellared Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevingosse @zacharycmontoya
TL;DR;
What do you think about adding Thread.MemoryBarrier(); after assigning _initialized and before reading _initialized for sake of "simplicity" and more trust. I added volatile because often in other PRs you were extremely concerned about performance 😄

@kevingosse (if you have a few minutes) 😉

As I have written before

Because _initilized = true is the last line in initialization, therefore, I think this PR makes it thread-safe.

Regarding:

But volatile has no impact on how soon the value is published to other threads.

It depends on how it is used. If done correctly we can make sure that it is published like doing a memory barrier. We just need to make sure that the same field is used and used in the correct manner (pattern).

The code you presented is different b/c there are loops and volatile writes before volatile reads. And it uses different fields

This is beautifully explained here: http://www.albahari.com/threading/part4.aspx#_Memory_Barriers_and_Volatility

Notice that applying volatile doesn’t prevent a write followed by a read from being swapped

But here it is not a problem as we have only one field initialized which is assigned to true only once.

in the official implementation at least, I don't know about Mono

And also we have .NET Core and .NET >= 5.0.... dotnet/runtime#6257

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I don't think it's worth debating hours on this. Adding volatile won't affect the codegen (except on ARM but I don't see ServiceFabric coming to ARM anytime soon), so if adding it makes you feel better I really have no reason to oppose

Copy link
Member Author

@pellared pellared Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can affect codegen.

IL_0007: volatile.
IL_0009: ldsfld requiredmodifier Datadog.Trace.ServiceFabric.ServiceRemotingService::_initialized

vs

IL_0007: ldsfld bool Datadog.Trace.ServiceFabric.ServiceRemotingService::_initialized

The interesting thing is that on my dev container with .NET 5.0 when I build with Release configuration. Then the .volatile is ALWAYS added to the IL 😄

I am almost certain that volatile can help on x86/x64 from practice. It solved an issue on Intel Atom (which is x86/x64) (AFAIK AMD processors also have different caching architecture) . It was a very similar issue to the current one here 😄

Found in background: dotnet/fsharp#122

@pellared pellared requested a review from zacharycmontoya April 6, 2021 17:34
@pellared
Copy link
Member Author

pellared commented Apr 7, 2021

Unstable test failure:

[xUnit.net 00:08:08.72] Datadog.Trace.ClrProfiler.IntegrationTests: FAILURE: Datadog.Trace.ClrProfiler.IntegrationTests.CallTargetNativeTests.MethodArgumentsInstrumentation(1, True, True) (2.1781022s)
[xUnit.net 00:08:08.72]     Datadog.Trace.ClrProfiler.IntegrationTests.CallTargetNativeTests.MethodArgumentsInstrumentation(numberOfArguments: 1, fastPath: True, inlining: True) [FAIL]
[xUnit.net 00:08:08.72] Datadog.Trace.ClrProfiler.IntegrationTests: STARTED: Datadog.Trace.ClrProfiler.IntegrationTests.Elasticsearch5Tests.SubmitsTraces(5.5.0, False, False)
  X Datadog.Trace.ClrProfiler.IntegrationTests.CallTargetNativeTests.MethodArgumentsInstrumentation(numberOfArguments: 1, fastPath: True, inlining: True) [2s 178ms]
  Error Message:
   Process exited with code 139
Expected: True
Actual:   False
  Stack Trace:
     at Datadog.Trace.ClrProfiler.IntegrationTests.CallTargetNativeTests.MethodArgumentsInstrumentation(Int32 numberOfArguments, Boolean fastPath, Boolean inlining) in /project/test/Datadog.Trace.ClrProfiler.IntegrationTests/CallTargetNativeTests.cs:line 42

@pjanotti
Copy link
Contributor

pjanotti commented Apr 7, 2021

Merging per earlier conversation on SIG meeting (also not waiting on CI since it was triggered just because branch update).

@pjanotti pjanotti merged commit 716974d into open-telemetry:main Apr 7, 2021
@pellared pellared deleted the fix-race-condition-in-servicefabric branch April 7, 2021 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants