-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142
Comments
Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ? The symptom reported here is general and hard to find cause without the job info available |
one thing at least could check is to see if 2.47 and 2.48 had the same symptom thus narrow down the issue |
Sure, where I can submit this? Can't find anything in GCP, do you have a link? Thanks |
This issue appears to occur in 2.48 as well with a pipeline just consuming from Cloud Pubsub.
|
Please check this: https://cloud.google.com/dataflow/docs/support/getting-support#file-bugs-or-feature-requests |
@boolangery to confirm, was this a Go or Python pipeline? |
A Go one |
Issue has been created: https://issuetracker.google.com/issues/297918533 |
Adding the following service option when starting the job will let you get / provide CPU and HEAP profiles of the SDK worker in dataflow:
From |
FYI, we have observed a memory leak in Python SDK, which we correlated with a protobuf dependency upgrade: #28246. This issue may or may not be similar in nature. |
If this makes the Go SDK unusable in 2.49.0 and beyond then per https://beam.apache.org/contribute/issue-priorities/ I would agree with P1. If it is usable in some cases then P2 is appropriate. |
And if P1 it should not be unassigned and should have ~daily updates and block releases. |
This issue is still here in 2.53 |
@boolangery where does the heap profile show the memory is being held? The heap profile can be collected as described in the earlier comment: Otherwise, additional information would be useful for me to replicate the issue. A rough throughput, and message size would be very useful. |
Ah I see #28142 (comment) has been updated with profiles! Thank you. |
The allocation is in makeChannels, which likely means it's the map from instruction/bundle ids to element channels. Something isn't getting cleaned up for some reason. I believe it's a quick fix, and as the 2.54.0 release manager, I'm going to cherry pick it in once I've got it, since we're still in the "stabilization" phase of the new release. Thank you for your patience and cooperation. |
I've successfully locally reproduced the issue locally using a lightly adjusted local prism runner, executing the pipeline in loopback mode and pprof, and narrowed down the leak to the channel cache in the read loop. It's not as aware of finished instructions as it should be. Very localized as a fix at least. |
The root cause is a subtle thing from the design of the Beam FnAPI protocol, but otherwise going to be on an SDK to SDK basis. Essentially, the data channel and the control channel are coordinated. But they are independant. The data could come in before the bundle that processes that data, but we need to hold onto it. Similarly, the ProcessBundle request could come in earlier, and it needs to wait until the data is ready. Or any particular interleaving of the two. The leak in the code is from that former case, where we're able to pull in all the data before ProcessBundle even starts up. Unfortunately, the Data channel doesn't know if it may close the Go Channel (elementsChan in the code) that sends the elements to the execution layer, until it knows what BundleDescriptor is being used so it can see if the Bundle uses Timers or not, and if so, how many transforms. In practice, there's likely to only be 2 streams, One for data, and the other for timers, but per the protocol, it could be arbitrary, so the SDK can't make an assumptions. So the flow causing the leak is: But leak is because the read loop never "learns" that the data is complete and it can evict that reader from its cache, since the read loop never sees the instructionID again. PubSub ends up triggering this behavior because outside of backlog catch up, each bundle is for a single element, so this causes a great deal of readers in the cache. I should have a PR shortly. |
Thank you for the explanation and the fix! |
FYI, 2.54.0 is now available. While I'm pretty sure this issue is now resolved, it's good to get confirmation from affected users too. |
What happened?
Hi,
We updated a Pubsub streaming job on Dataflow from 2.46.0 to 2.49.0. See these memory diagrams:
2.46.0 memory utilisation:
2.49.0 memory utilisation:
We sent back on the 2.46.0 for this job as workers were running out of memory and a lot of lag was introduced.
Do you have an explanation? What changed on memory management between 2.46.0 and 2.49.0?
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: