Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventHub: consumer memory issues processing event backlog #27253

Closed
2 of 6 tasks
thomasstoermer opened this issue Sep 26, 2023 · 9 comments
Closed
2 of 6 tasks

EventHub: consumer memory issues processing event backlog #27253

thomasstoermer opened this issue Sep 26, 2023 · 9 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close.

Comments

@thomasstoermer
Copy link

  • Package Name: @azure/event-hubs
  • Package Version: 5.11.1, 5.11.2
  • Operating system:
  • nodejs
    • version: Docker: node:18.17.1-bullseye-slim
  • browser
    • name/version:
  • typescript
    • version: 4.7.4
  • Is the bug related to documentation in

Describe the bug
We observe event hub consumer client issues in case of processing a large event backlog on several partitions. The client starts to consume high amount of memory until reaching limits and triggering pod restarts in Kubernetes (limit already increased to 4GB and still restarting).
The issue is resolved by using old version @azure/event-hubs: 5.8.0 running stable without any memory problem.

To Reproduce
Steps to reproduce the behavior:

  1. Event Hub with multiple partitions (32).
  2. Stopping Event Hub consumer and producing backlog of events on all partitions (e.g. 100,000 events per partition, event body >500 Bytes).
  3. Start one Event Hub consumer with checkpointstore processing all partitions.
  4. Important: The consumer test client must simulate delay for event processing (e.g. setTimeout(x)). Because a production client might e.g. wait for DB operation to complete.
  5. Observe heavy memory consumption until limits reached.

Expected behavior
We expect the consumer client will run stable in case of processing a huge event backlog on several partitions - as it did with the old version 5.8.0. Otherwise the event hub client cannot be operated stable in production.

Additional context

  • Consumer is created with:
    • loadBalancingOptions: { strategy: "greedy" }
    • Checkpointstore on Azure storage
    • SubscribeOptions: { maxWaitTimeInSeconds: 0.5, maxBatchSize: 100}
@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-triage Workflow: This issue needs the team to triage. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Sep 26, 2023
@deyaaeldeen deyaaeldeen added bug This issue requires a change to an existing behavior in the product in order to be resolved. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that needs-team-triage Workflow: This issue needs the team to triage. labels Sep 26, 2023
@deyaaeldeen deyaaeldeen self-assigned this Sep 26, 2023
@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Sep 26, 2023
@HarshaNalluru HarshaNalluru self-assigned this Sep 27, 2023
@deyaaeldeen
Copy link
Member

Thanks for filing this issue. Did you try any other versions? such as 5.9? Does setting the prefetch count help?

@thomasstoermer
Copy link
Author

We only tried versions:

  • 5.11.1
  • 5.11.2
  • 5.8.0

We are not defining the prefetchCount, but as far as I see it will get a default based on the maxBatchSize. But we can run another test with defining the prefetchCount.

We did not test version 5.9.0, because it had another memory issue fixed in the past.

Thanks for your support.

@ManuelTreu
Copy link

ManuelTreu commented Sep 29, 2023

I'm following up on @thomasstoermer issue:
We have now tested it with 5.9.0 and have not found any memory problems.
We have also tested version 5.11.2 with prefetchCount = 50 and 100 and still see Out Of Memory problems.

@thomasstoermer
Copy link
Author

Have you been able to reproduce the issue?

@HarshaNalluru
Copy link
Member

@thomasstoermer thanks for your patience, we did reproduce the issue.

Will post the investigation updates once we have anything more significant.

@deyaaeldeen
Copy link
Member

deyaaeldeen commented Nov 3, 2023

Hi @thomasstoermer and @ManuelTreu,

I would like to provide you with a brief update regarding the ongoing issue. Upon further investigation, @HarshaNalluru and I have identified that the root cause of the problem lies in the enabling of automatic flow management in Rhea, as detailed in this GitHub pull request: Prefetch Events.

To clarify, Rhea currently forwards events to the consumer client as soon as they are received. Enabling automatic flow management in Rhea results in forwarding all events to the client, as soon as they're received. Consequently, if the processEvents handler is unable to process incoming events promptly, they accumulate in the client's internal queue, leading to a memory leak.

As a temporary solution, you should be able to upgrade to v5.10.0 without encountering issues since this feature was introduced in v5.11.0. In the meantime, I will consult with our team to determine the best course of action to address this issue. My initial recommendation is to implement custom prefetching in the client itself to ensure that the internal queue does not become overloaded.

Again, thanks for your patience and I'll keep you posted.

@thomasstoermer
Copy link
Author

Thanks for the update @deyaaeldeen and provided details 👍

Unfortunately, v5.10.0 will not work for us, because of the other previous issue: #25928

deyaaeldeen added a commit that referenced this issue Nov 7, 2023
### Packages impacted by this PR
@Azure/event-hubs

### Issues associated with this PR
#27253

### Describe the problem that is addressed by this PR

#27253 (comment)

### What are the possible designs available to address the problem? If
there are more than one possible design, why was the one in this PR
chosen?


### Are there test cases added in this PR? _(If not, why?)_
To be tested using stress testing framework.

UPDATE: The results are in and it is confirmed there is no more space
leak!

### Provide a list of related PRs _(if any)_
#26065

### Command used to generate this PR:**_(Applicable only to SDK release
request PRs)_

### Checklists
- [x] Added impacted package name to the issue description
- [ ] Does this PR needs any fixes in the SDK Generator?** _(If so,
create an Issue in the
[Autorest/typescript](https://github.com/Azure/autorest.typescript)
repository and link it here)_
- [x] Added a changelog (if necessary)
@deyaaeldeen
Copy link
Member

@azure/event-hub@5.11.3 has been released and it fixes this issue 😊 Please let me know if you have any questions!

@deyaaeldeen deyaaeldeen added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Nov 7, 2023
Copy link

github-actions bot commented Nov 7, 2023

Hi @thomasstoermer. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

@github-actions github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Nov 7, 2023
@xirzec xirzec closed this as completed Nov 14, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Feb 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close.
Projects
Status: Done
Development

No branches or pull requests

5 participants