JoinNode ignores Delete BarrierNode messages. #2436

m4ce · 2020-11-28T21:15:41Z

After some testing, I found out that the JoinNode cardinality doesn't decrease when a BarrierMessage is emitted for a group that should expire. This effectively leads the JoinNode's cardinality to increase forever, leading to a memory leak.

docmerlin · 2020-12-08T15:52:11Z

Are you using the .delete(TRUE) method on the barrier node? If you are not, it will not decrease the cardinality.

m4ce · 2020-12-08T15:57:05Z

@docmerlin, I am using the .delete(TRUE). However, the join() cardinality keeps increasing nonetheless

m4ce · 2020-12-08T15:57:20Z

From my testing, it looks like join() is ignoring barrier messages?

docmerlin · 2020-12-08T16:14:13Z

@m4ce
Can you post here your script?

m4ce · 2020-12-08T16:15:25Z

var data1 = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement('kubernetes_pod_container')
        .where(lambda: isPresent("cpu_usage_nanocores"))
        .groupBy(groupBy)
    |barrier()
        .idle(5m)
        .delete(TRUE)
    |window()
        .period(1m)
        .every(1m)
        .align()
    |mean('cpu_usage_nanocores')
        .as('avgCpuUsageNanoCores')
    // Convert from nanocores to millicores.
    |eval(lambda: float("avgCpuUsageNanoCores") / 1000000.0)
        .as('avgCpuUsageMilliCores')

var data2 = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement('kubernetes_pod_container_status')
        .groupBy(groupBy)
    |default()
        // default to one core
        .field('resource_limits_millicpu_units', oneCpu)
    |barrier()
        .idle(5m)
        .delete(TRUE)
    |window()
        .period(1m)
        .every(1m)
        .align()
    |last('resource_limits_millicpu_units')
        .as('milliCores')
    // When no cpus limits are defined, set the default to 1 CPU instead of defaulting to the number of cores available on the box.
    // We want to catch run-away processes that do not have any limits set.
    |eval(lambda: if("milliCores" > 0, float("milliCores"), float(oneCpu)))
        .as('maxUsageMilliCores')

var data = data1
    |join(data2)
        .as('stats', 'limits')
        .tolerance(15s)
    |barrier()
        .idle(5m)
        .delete(TRUE)
    |eval(lambda: "stats.avgCpuUsageMilliCores" / "limits.maxUsageMilliCores" * 100.0)
        .as('usage_percent')
    |stateCount(lambda: "usage_percent" > crit)
        .as('critCount')
    |stateCount(lambda: "usage_percent" <= crit)
        .as('critResetCount')
    |stateCount(lambda: "usage_percent" > warn AND "usage_percent" <= crit)
        .as('warnCount')
    |stateCount(lambda: "usage_percent" <= warn)
        .as('warnResetCount')

m4ce · 2020-12-08T16:15:59Z

I even had to add a barrier after join() in order to reduce the cardinality on the other nodes. I can see the cardinality decreasing before and after join. But the join itself keeps increasing!

docmerlin · 2020-12-09T15:29:37Z

Ok, I've been able to confirm that join node is ignoring barrier node delete messages.

docmerlin · 2020-12-09T15:35:00Z

The NewMultiConsumerWithStats isn't handling the DeleteGroupMessage.
Ordinary Barrier messages are being handled properly but the DeleteGroupMessage are not.

docmerlin · 2020-12-09T15:41:59Z

A tenative plan is to see if it is easier to add DeleteGroupMessage handling to MultiConsumerWithStats or to create a NewMultiConsumerWithStatsAndDelete that can handle the DeleteGroupMessage.

docmerlin · 2020-12-09T15:58:11Z

Tentatively adding this to the 1.5.8 milestone, because OOMS.

m4ce · 2021-01-08T14:21:01Z

@docmerlin, we just had another OOM due to this. Do you think you guys will be able to get it in for this or the next release? Thanks.

joshdurbin · 2021-01-22T00:11:15Z

I'm also impacted by this. It looks like the 1.5.8 milestone was removed from this, any idea when, if not 1.5.8 this might be resolved?

vj6 · 2021-02-02T12:22:31Z

@docmerlin, we are experiencing frequent OOM due to this. Can you pls let us know the next milestone this will be added to?

docmerlin · 2021-02-02T15:34:26Z

This definitely will be in 1.5.9. Sorry, but this wasn't quite done and we needed 1.5.8 out to fix problems for other customers.

…

On Tue, Feb 2, 2021 at 6:22 AM VJ ***@***.***> wrote: @docmerlin <https://github.com/docmerlin>, we are experiencing frequent OOM due to this. Can you pls let us know the next milestone this will be added to? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2436 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJN6W7CQGX4DQMUUQBZS33S47VBRANCNFSM4UGBF7GQ> .

m4ce · 2021-02-24T12:23:37Z

Looks like it's not planned for 1.5.9? Will it manage to make it in 1.6.0..?

docmerlin · 2021-03-25T15:50:53Z

@m4ce We found a honest to goodness old fashioned memory leak in the join node so we fixed that instead. That will likely fix most people's problems.

docmerlin · 2021-03-25T15:51:53Z

@m4ce We found a honest to goodness old fashioned memory leak in the join node so we fixed that instead. That will likely fix most people's problems.

m4ce · 2021-03-25T16:28:31Z

But didn't you say that the join node ignores barrier delete node messages? How is that going to fix this then?

docmerlin · 2021-03-29T15:23:47Z

@m4ce said:

But didn't you say that the join node ignores barrier delete node messages? How is that going to fix this then?

The old join node kept a copy of every point that passed through it because of a memory leak, even points it was done joining. The fix removes the leak so memory usage should be vastly improved.

joshdurbin · 2021-03-29T20:03:55Z

Any idea of when this will be merged and released? Is this issue fixed by:

PR 2474?
PR 2489?

m4ce · 2021-03-29T20:31:08Z

@docmerlin - Okay, let's see how the new release performs before closing this issue.

vj6 · 2021-04-06T04:01:57Z

@docmerlin this still persists in 1.5.9.

Just curious, earlier mentioned "old fashioned memory leak" planned for which milestone? Thanks.

m4ce · 2021-04-06T08:52:14Z

@docmerlin - the issue here is that we have a high cardinality in our join node which will not decrease unless the barrier messages are supported so that groups can be expired. I am sure the bugs you fixed will help in several cases but not here I am afraid. Unless the barrier messages are handled, this won't solve the issue outlined in this ticket.

vj6 · 2021-05-04T10:28:28Z

@docmerlin any updates pls?

docmerlin · 2021-05-05T19:45:15Z

@vj-gsr sorry man, I am currently working on other kapacitor tasks. I can say that this is about 80% done, however. I'm trying really hard to squeeze it into 1.6.0, but can't promise anything yet.

Edit: I've been told we are holding 1.6.0 till we can get it in, to make sure we get it in.

vj6 · 2021-05-06T07:27:22Z

OK many thanks @docmerlin, any tentative date for 1.6.0 release.

vj6 · 2021-06-01T04:55:17Z

@docmerlin did you get a chance to work on this? any update pls?

m4ce · 2021-06-14T07:33:03Z

ts=2021-06-14T07:27:50.277Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=barrier2 err="unexpected message of type *edge.deleteGroupMessage"
ts=2021-06-14T07:27:50.279Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=from1 err="edge aborted"
ts=2021-06-14T07:27:50.387Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=stream0 err="edge aborted"

tickscript was working with 1.5.9 - trying 1.6.0-rc2, failing. It seems there's something wrong in handling barrier messages in the join node?

m4ce · 2021-06-14T08:04:12Z

I actually see issues also with tickscripts that didn't use the join node. I effectively see the problem with any tickscript that uses the barrier node. I think we have a regression here.

prashanthjbabu · 2021-06-23T11:18:19Z

@m4ce @docmerlin I have a fix for

ts=2021-06-14T07:27:50.277Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=barrier2 err="unexpected message of type *edge.deleteGroupMessage"
ts=2021-06-14T07:27:50.279Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=from1 err="edge aborted"
ts=2021-06-14T07:27:50.387Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=stream0 err="edge aborted"

This PR #2585 takes care of it , I don't see the error message with this fix , I also noticed no mem leaks either.

vj6 · 2021-06-29T03:23:17Z

@docmerlin can you pls confirm this has been sorted in v1.6.0?

vj6 · 2021-08-16T03:47:54Z

@docmerlin the memory leak is still present in v1.6.1 it has slow down a bit with .deleteAll(TRUE) but not yet resolved. We are using the same script mentioned here.

docmerlin · 2021-11-09T17:34:41Z

I'm going to go ahead and close this issue. As it was closed by #2562

@vj-gsr I believe your leak is coming from a different source than this issue.

docmerlin added the bug label Dec 9, 2020

docmerlin changed the title ~~JoinNode cardinality and BarrierNode - bug~~ JoinNode ignores Delete BarrierNode messages. Dec 9, 2020

docmerlin added this to the 1.5.8 milestone Dec 9, 2020

psteinbachs removed this from the 1.5.8 milestone Jan 13, 2021

m4ce mentioned this issue Jun 15, 2021

Feat: handle Delete messages in join #2562

Merged

docmerlin closed this as completed Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JoinNode ignores Delete BarrierNode messages. #2436

JoinNode ignores Delete BarrierNode messages. #2436

m4ce commented Nov 28, 2020

docmerlin commented Dec 8, 2020

m4ce commented Dec 8, 2020 •

edited

Loading

m4ce commented Dec 8, 2020

docmerlin commented Dec 8, 2020

m4ce commented Dec 8, 2020

m4ce commented Dec 8, 2020

docmerlin commented Dec 9, 2020

docmerlin commented Dec 9, 2020

docmerlin commented Dec 9, 2020

docmerlin commented Dec 9, 2020

m4ce commented Jan 8, 2021

joshdurbin commented Jan 22, 2021

vj6 commented Feb 2, 2021

docmerlin commented Feb 2, 2021 via email

m4ce commented Feb 24, 2021

docmerlin commented Mar 25, 2021

docmerlin commented Mar 25, 2021

m4ce commented Mar 25, 2021 •

edited

Loading

docmerlin commented Mar 29, 2021 •

edited

Loading

joshdurbin commented Mar 29, 2021 •

edited

Loading

m4ce commented Mar 29, 2021

vj6 commented Apr 6, 2021

m4ce commented Apr 6, 2021

vj6 commented May 4, 2021

docmerlin commented May 5, 2021 •

edited

Loading

vj6 commented May 6, 2021

vj6 commented Jun 1, 2021

m4ce commented Jun 14, 2021 •

edited

Loading

m4ce commented Jun 14, 2021 •

edited

Loading

prashanthjbabu commented Jun 23, 2021

vj6 commented Jun 29, 2021

vj6 commented Aug 16, 2021

docmerlin commented Nov 9, 2021

JoinNode ignores Delete BarrierNode messages. #2436

JoinNode ignores Delete BarrierNode messages. #2436

Comments

m4ce commented Nov 28, 2020

docmerlin commented Dec 8, 2020

m4ce commented Dec 8, 2020 • edited Loading

m4ce commented Dec 8, 2020

docmerlin commented Dec 8, 2020

m4ce commented Dec 8, 2020

m4ce commented Dec 8, 2020

docmerlin commented Dec 9, 2020

docmerlin commented Dec 9, 2020

docmerlin commented Dec 9, 2020

docmerlin commented Dec 9, 2020

m4ce commented Jan 8, 2021

joshdurbin commented Jan 22, 2021

vj6 commented Feb 2, 2021

docmerlin commented Feb 2, 2021 via email

m4ce commented Feb 24, 2021

docmerlin commented Mar 25, 2021

docmerlin commented Mar 25, 2021

m4ce commented Mar 25, 2021 • edited Loading

docmerlin commented Mar 29, 2021 • edited Loading

joshdurbin commented Mar 29, 2021 • edited Loading

m4ce commented Mar 29, 2021

vj6 commented Apr 6, 2021

m4ce commented Apr 6, 2021

vj6 commented May 4, 2021

docmerlin commented May 5, 2021 • edited Loading

vj6 commented May 6, 2021

vj6 commented Jun 1, 2021

m4ce commented Jun 14, 2021 • edited Loading

m4ce commented Jun 14, 2021 • edited Loading

prashanthjbabu commented Jun 23, 2021

vj6 commented Jun 29, 2021

vj6 commented Aug 16, 2021

docmerlin commented Nov 9, 2021

m4ce commented Dec 8, 2020 •

edited

Loading

m4ce commented Mar 25, 2021 •

edited

Loading

docmerlin commented Mar 29, 2021 •

edited

Loading

joshdurbin commented Mar 29, 2021 •

edited

Loading

docmerlin commented May 5, 2021 •

edited

Loading

m4ce commented Jun 14, 2021 •

edited

Loading

m4ce commented Jun 14, 2021 •

edited

Loading