Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Event Hubs] Fix memory leak in EHBufferedProducerClient #26748

Merged
merged 55 commits into from
Sep 22, 2023

Conversation

HarshaNalluru
Copy link
Member

@HarshaNalluru HarshaNalluru commented Aug 9, 2023

Issue

First reported at #25426, thanks to @pkraj0 for reporting the issue.
EventHubBufferedProducerClient leaks memory when there are no packets to send to event-hubs.

EventHubBufferedProducerClient

  • EventHubBufferedProducerClient is used to publish events to Event Hub
  • It does not publish events immediately. Instead, events are buffered so they can be efficiently batched and published when the batch is full or the maxWaitTimeInMs has elapsed with no new events enqueued.

Stress Testing

The test sends about 5000 events and stops for a long duration, revealing the memory blowing up.

Investigation

Studying the heap snapshots, and comparing them at various timestamps during the test run hinted at the issue.
The problem is at the BatchingPartitionChannel#_startPublishLoop and the AwaitableQueue#shift implementations.

BatchingPartitionChannel#_startPublishLoop

  • Starts the loop that creates batches and sends them to the Event Hub
  • Runs forever or until the abortSignal is invoked

AwaitableQueue#shift

  • Returns a Promise that will resolve with the next item in the queue.
  • If there are no items in the queue, the pending Promise stays forever.

Problem

BatchingPartitionChannel#_startPublishLoop relies on a while loop to run forever.

  • The root cause of the leak in the #_startPublishLoop is a race between two promises(AwaitableQueue#shift() and delay())
  • AwaitableQueue#shift() never settles because there is no activity.
  • Promise.race resolves as soon as one of the promises is fulfilled.
  • Promise.race though resolves, keeps a reference for the pending promise from AwaitableQueue#shift(), references get accumulated because of the while loop causing the leak.

Solution

The fix is to abort the pending promise from AwaitableQueue#shift() once the race has been won by the other delay() promise.

What's in the PR?

@azure/core-util

  • Added a helper method racePromisesAndAbortLosers, an abstraction that leverages "Promise.race()" and aborts the losers of the race as soon as the first promise fulfills.
  • Changelog

@azure/event-hubs

  • Updated BatchingPartitionChannel#_startPublishLoop to use racePromisesAndAbortLosers instead of Promise.race.
  • Updated AwaitableQueue#shift() to return a promise that is cancellable so that the pending promise cancels once the first promise from Promise.race resolves .

Moved the stress testing to #27184

@azure-sdk
Copy link
Collaborator

azure-sdk commented Aug 9, 2023

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-core-util

@HarshaNalluru HarshaNalluru changed the title [Stress Test] EH 5.11.1 mem leak - no activity after sending for a while [Event Hubs] Fix memory leak in EHBufferedProducerClient Sep 8, 2023
@HarshaNalluru HarshaNalluru marked this pull request as draft September 8, 2023 08:52
@jeremymeng
Copy link
Member

  • So these pending promises from AwaitableQueue#shift() keep accumulating because of the while loop causing the leak.

In each loop iteration we only get the next one when there's an event; otherwise futureEvent is still the same one. How could it cause the accumulation?

eventToAddToBatch = event;
// We received an event, so get a promise for the next one.
futureEvent = this._eventQueue.shift();

@HarshaNalluru
Copy link
Member Author

/azp run js - event-hubs - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@deyaaeldeen deyaaeldeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for the awesome fix!

Copy link
Member

@xirzec xirzec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern with the new core-util code, but otherwise good

);
} finally {
aborter.abort();
options?.abortSignal?.removeEventListener("abort", () => aborter.abort());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work since the anonymous function () => aborter.abort() will have a different identity than the one registered above. I believe you need to pull it out into a local like

function abortHandler() {
    aborter.abort();
}
options?.abortSignal?.addEventListener("abort", abortHandler);
try {
// etc
} finally {
options?.abortSignal?.removeEventListener("abort", abortHandler);
}

in the future we can use addEventListener with the 'once' option to avoid this dance, but the above will work everywhere.

Maybe this event handler not being unregistered is the source of your small leak?

Copy link
Member Author

@HarshaNalluru HarshaNalluru Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, a great catch, will check that, thanks.

});

it("should resolve with the first promise that resolves, abort the rest", async function () {
const startTime = Date.now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little wary of doing math with clocktime in tests unless we're mocking the clock to begin with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the delay function for an in-place function with my trackers in it :)
and got rid of the clocks and math, let me know if that looks good

@HarshaNalluru
Copy link
Member Author

/azp run js - event-hubs - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@jeremymeng jeremymeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Co-authored-by: Deyaaeldeen Almahallawi <dealmaha@microsoft.com>
@HarshaNalluru HarshaNalluru enabled auto-merge (squash) September 21, 2023 23:45
@HarshaNalluru HarshaNalluru merged commit 6f67147 into Azure:main Sep 22, 2023
24 checks passed
@HarshaNalluru HarshaNalluru deleted the harshan/issue/25426 branch September 22, 2023 11:12
HarshaNalluru added a commit that referenced this pull request Sep 22, 2023
Packages have not been released yet.
Followup of #26748
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants