[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142

boolangery · 2023-08-24T07:50:29Z

Abacn · 2023-08-24T15:06:39Z

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ?

The symptom reported here is general and hard to find cause without the job info available

Abacn · 2023-08-24T15:07:37Z

one thing at least could check is to see if 2.47 and 2.48 had the same symptom thus narrow down the issue

boolangery · 2023-08-25T15:00:16Z

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ?

The symptom reported here is general and hard to find cause without the job info available

Sure, where I can submit this? Can't find anything in GCP, do you have a link? Thanks

scwhittle · 2023-08-28T10:26:31Z

This issue appears to occur in 2.48 as well with a pipeline just consuming from Cloud Pubsub.

    _ = pipeline | "Read pubsub" >> io.ReadFromPubSub(
        subscription=sub, with_attributes=True
    )

liferoad · 2023-08-28T13:33:18Z

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ?
The symptom reported here is general and hard to find cause without the job info available

Sure, where I can submit this? Can't find anything in GCP, do you have a link? Thanks

Please check this: https://cloud.google.com/dataflow/docs/support/getting-support#file-bugs-or-feature-requests

tvalentyn · 2023-08-28T16:53:14Z

@boolangery to confirm, was this a Go or Python pipeline?

chleech · 2023-08-28T20:11:35Z

I’ve been experiencing the same issue. To validate, we also set up a pipeline that only reads from a single sub, ran it for 2 weeks and the mem is constantly increasing.

Got a response from DF team and their suggestion was to try 2.46.0. Will update here once we manage to test.

boolangery · 2023-08-29T07:46:52Z

@boolangery to confirm, was this a Go or Python pipeline?

A Go one

boolangery · 2023-08-29T07:56:17Z

Issue has been created: https://issuetracker.google.com/issues/297918533

lostluck · 2023-08-29T23:43:00Z

Adding the following service option when starting the job will let you get / provide CPU and HEAP profiles of the SDK worker in dataflow:

--dataflow_service_options=enable_google_cloud_profiler

From
https://cloud.google.com/dataflow/docs/guides/profiling-a-pipeline#enable_for_pipelines

tvalentyn · 2023-08-31T00:39:28Z

FYI, we have observed a memory leak in Python SDK, which we correlated with a protobuf dependency upgrade: #28246. This issue may or may not be similar in nature.

kennknowles · 2023-09-12T13:36:17Z

If this makes the Go SDK unusable in 2.49.0 and beyond then per https://beam.apache.org/contribute/issue-priorities/ I would agree with P1. If it is usable in some cases then P2 is appropriate.

kennknowles · 2023-09-12T13:36:37Z

And if P1 it should not be unassigned and should have ~daily updates and block releases.

boolangery · 2024-01-25T10:19:22Z

This issue is still here in 2.53

lostluck · 2024-01-25T14:04:39Z

@boolangery where does the heap profile show the memory is being held? The heap profile can be collected as described in the earlier comment:

#28142 (comment)

Otherwise, additional information would be useful for me to replicate the issue. A rough throughput, and message size would be very useful.

lostluck · 2024-01-25T16:59:48Z

Ah I see #28142 (comment) has been updated with profiles! Thank you.

lostluck · 2024-01-25T17:10:40Z

The allocation is in makeChannels, which likely means it's the map from instruction/bundle ids to element channels. Something isn't getting cleaned up for some reason.

I believe it's a quick fix, and as the 2.54.0 release manager, I'm going to cherry pick it in once I've got it, since we're still in the "stabilization" phase of the new release. Thank you for your patience and cooperation.

lostluck · 2024-01-25T18:31:33Z

I've successfully locally reproduced the issue locally using a lightly adjusted local prism runner, executing the pipeline in loopback mode and pprof, and narrowed down the leak to the channel cache in the read loop. It's not as aware of finished instructions as it should be.

Very localized as a fix at least.

lostluck · 2024-01-25T22:29:01Z

The root cause is a subtle thing from the design of the Beam FnAPI protocol, but otherwise going to be on an SDK to SDK basis.

Essentially, the data channel and the control channel are coordinated. But they are independant. The data could come in before the bundle that processes that data, but we need to hold onto it. Similarly, the ProcessBundle request could come in earlier, and it needs to wait until the data is ready. Or any particular interleaving of the two.

The leak in the code is from that former case, where we're able to pull in all the data before ProcessBundle even starts up. Unfortunately, the Data channel doesn't know if it may close the Go Channel (elementsChan in the code) that sends the elements to the execution layer, until it knows what BundleDescriptor is being used so it can see if the Bundle uses Timers or not, and if so, how many transforms. In practice, there's likely to only be 2 streams, One for data, and the other for timers, but per the protocol, it could be arbitrary, so the SDK can't make an assumptions.

So the flow causing the leak is:
See Data for an unseen instruction.
Create and cache a elementChan in the read loop.
Get all the data.
Marks off how many "is-last" signals we see. (Once we have all the IsLasts, the read loop never sees a reference to that instruction ever again).
Receives the ProcessBundle request.
Know we have everything, close the channel, so the ProcessBundle can terminate.

But leak is because the read loop never "learns" that the data is complete and it can evict that reader from its cache, since the read loop never sees the instructionID again.

PubSub ends up triggering this behavior because outside of backlog catch up, each bundle is for a single element, so this causes a great deal of readers in the cache.

I should have a PR shortly.

boolangery · 2024-01-26T08:20:23Z

Thank you for the explanation and the fix!

Co-authored-by: lostluck <13907733+lostluck@users.noreply.github.com>

lostluck · 2024-02-15T22:56:34Z

FYI, 2.54.0 is now available. While I'm pretty sure this issue is now resolved, it's good to get confirmation from affected users too.

boolangery added awaiting triage bug labels Aug 24, 2023

github-actions bot added go dataflow P2 labels Aug 24, 2023

tvalentyn added P1 and removed P2 labels Aug 28, 2023

tvalentyn changed the title ~~[Bug]: Memory seems to be leaking on 2.49.0 with Dataflow~~ [Bug]: [Go] Memory seems to be leaking on 2.49.0 with Dataflow Aug 31, 2023

tvalentyn changed the title ~~[Bug]: [Go] Memory seems to be leaking on 2.49.0 with Dataflow~~ [Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow Aug 31, 2023

jrmccluskey removed the awaiting triage label Sep 28, 2023

lostluck self-assigned this Jan 25, 2024

lostluck added a commit to lostluck/beam that referenced this issue Jan 25, 2024

[apache#28142] Evict closed readers from the cache.

5233c87

lostluck added a commit to lostluck/beam that referenced this issue Jan 25, 2024

[apache#28142][Go SDK] Evict closed readers from the cache.

4a09c43

lostluck mentioned this issue Jan 25, 2024

[#28142][Go SDK] Evict closed readers from the cache. #30119

Merged

3 tasks

jrmccluskey closed this as completed in #30119 Jan 26, 2024

jrmccluskey pushed a commit that referenced this issue Jan 26, 2024

[#28142][Go SDK] Evict closed readers from the cache. (#30119)

e0e20a1

Co-authored-by: lostluck <13907733+lostluck@users.noreply.github.com>

github-actions bot added this to the 2.55.0 Release milestone Jan 26, 2024

lostluck mentioned this issue Jan 26, 2024

[Cherry Pick #30119] [Go SDK] Evict closed readers from the cache. #30133

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142

[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142

boolangery commented Aug 24, 2023

Abacn commented Aug 24, 2023 •

edited

Loading

Abacn commented Aug 24, 2023

boolangery commented Aug 25, 2023

scwhittle commented Aug 28, 2023

liferoad commented Aug 28, 2023

tvalentyn commented Aug 28, 2023

chleech commented Aug 28, 2023

boolangery commented Aug 29, 2023

boolangery commented Aug 29, 2023

lostluck commented Aug 29, 2023 •

edited

Loading

tvalentyn commented Aug 31, 2023

kennknowles commented Sep 12, 2023

kennknowles commented Sep 12, 2023

boolangery commented Jan 25, 2024 •

edited

Loading

lostluck commented Jan 25, 2024

lostluck commented Jan 25, 2024

lostluck commented Jan 25, 2024

lostluck commented Jan 25, 2024

lostluck commented Jan 25, 2024 •

edited

Loading

boolangery commented Jan 26, 2024

lostluck commented Feb 15, 2024

[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142

[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow #28142

Comments

boolangery commented Aug 24, 2023

What happened?

Issue Priority

Issue Components

Abacn commented Aug 24, 2023 • edited Loading

Abacn commented Aug 24, 2023

boolangery commented Aug 25, 2023

scwhittle commented Aug 28, 2023

liferoad commented Aug 28, 2023

tvalentyn commented Aug 28, 2023

chleech commented Aug 28, 2023

boolangery commented Aug 29, 2023

boolangery commented Aug 29, 2023

lostluck commented Aug 29, 2023 • edited Loading

tvalentyn commented Aug 31, 2023

kennknowles commented Sep 12, 2023

kennknowles commented Sep 12, 2023

boolangery commented Jan 25, 2024 • edited Loading

lostluck commented Jan 25, 2024

lostluck commented Jan 25, 2024

lostluck commented Jan 25, 2024

lostluck commented Jan 25, 2024

lostluck commented Jan 25, 2024 • edited Loading

boolangery commented Jan 26, 2024

lostluck commented Feb 15, 2024

Abacn commented Aug 24, 2023 •

edited

Loading

lostluck commented Aug 29, 2023 •

edited

Loading

boolangery commented Jan 25, 2024 •

edited

Loading

lostluck commented Jan 25, 2024 •

edited

Loading