Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JoinNode ignores Delete BarrierNode messages. #2436

Closed
m4ce opened this issue Nov 28, 2020 · 33 comments
Closed

JoinNode ignores Delete BarrierNode messages. #2436

m4ce opened this issue Nov 28, 2020 · 33 comments
Labels

Comments

@m4ce
Copy link
Contributor

m4ce commented Nov 28, 2020

After some testing, I found out that the JoinNode cardinality doesn't decrease when a BarrierMessage is emitted for a group that should expire. This effectively leads the JoinNode's cardinality to increase forever, leading to a memory leak.

@docmerlin
Copy link
Contributor

Are you using the .delete(TRUE) method on the barrier node? If you are not, it will not decrease the cardinality.

@m4ce
Copy link
Contributor Author

m4ce commented Dec 8, 2020

@docmerlin, I am using the .delete(TRUE). However, the join() cardinality keeps increasing nonetheless

@m4ce
Copy link
Contributor Author

m4ce commented Dec 8, 2020

From my testing, it looks like join() is ignoring barrier messages?

@docmerlin
Copy link
Contributor

@m4ce
Can you post here your script?

@m4ce
Copy link
Contributor Author

m4ce commented Dec 8, 2020

var data1 = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement('kubernetes_pod_container')
        .where(lambda: isPresent("cpu_usage_nanocores"))
        .groupBy(groupBy)
    |barrier()
        .idle(5m)
        .delete(TRUE)
    |window()
        .period(1m)
        .every(1m)
        .align()
    |mean('cpu_usage_nanocores')
        .as('avgCpuUsageNanoCores')
    // Convert from nanocores to millicores.
    |eval(lambda: float("avgCpuUsageNanoCores") / 1000000.0)
        .as('avgCpuUsageMilliCores')

var data2 = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement('kubernetes_pod_container_status')
        .groupBy(groupBy)
    |default()
        // default to one core
        .field('resource_limits_millicpu_units', oneCpu)
    |barrier()
        .idle(5m)
        .delete(TRUE)
    |window()
        .period(1m)
        .every(1m)
        .align()
    |last('resource_limits_millicpu_units')
        .as('milliCores')
    // When no cpus limits are defined, set the default to 1 CPU instead of defaulting to the number of cores available on the box.
    // We want to catch run-away processes that do not have any limits set.
    |eval(lambda: if("milliCores" > 0, float("milliCores"), float(oneCpu)))
        .as('maxUsageMilliCores')

var data = data1
    |join(data2)
        .as('stats', 'limits')
        .tolerance(15s)
    |barrier()
        .idle(5m)
        .delete(TRUE)
    |eval(lambda: "stats.avgCpuUsageMilliCores" / "limits.maxUsageMilliCores" * 100.0)
        .as('usage_percent')
    |stateCount(lambda: "usage_percent" > crit)
        .as('critCount')
    |stateCount(lambda: "usage_percent" <= crit)
        .as('critResetCount')
    |stateCount(lambda: "usage_percent" > warn AND "usage_percent" <= crit)
        .as('warnCount')
    |stateCount(lambda: "usage_percent" <= warn)
        .as('warnResetCount')

@m4ce
Copy link
Contributor Author

m4ce commented Dec 8, 2020

I even had to add a barrier after join() in order to reduce the cardinality on the other nodes. I can see the cardinality decreasing before and after join. But the join itself keeps increasing!

@docmerlin docmerlin added the bug label Dec 9, 2020
@docmerlin docmerlin changed the title JoinNode cardinality and BarrierNode - bug JoinNode ignores Delete BarrierNode messages. Dec 9, 2020
@docmerlin
Copy link
Contributor

Ok, I've been able to confirm that join node is ignoring barrier node delete messages.

@docmerlin
Copy link
Contributor

The NewMultiConsumerWithStats isn't handling the DeleteGroupMessage.
Ordinary Barrier messages are being handled properly but the DeleteGroupMessage are not.

@docmerlin
Copy link
Contributor

A tenative plan is to see if it is easier to add DeleteGroupMessage handling to MultiConsumerWithStats or to create a NewMultiConsumerWithStatsAndDelete that can handle the DeleteGroupMessage.

@docmerlin docmerlin added this to the 1.5.8 milestone Dec 9, 2020
@docmerlin
Copy link
Contributor

Tentatively adding this to the 1.5.8 milestone, because OOMS.

@m4ce
Copy link
Contributor Author

m4ce commented Jan 8, 2021

@docmerlin, we just had another OOM due to this. Do you think you guys will be able to get it in for this or the next release? Thanks.

@psteinbachs psteinbachs removed this from the 1.5.8 milestone Jan 13, 2021
@joshdurbin
Copy link

I'm also impacted by this. It looks like the 1.5.8 milestone was removed from this, any idea when, if not 1.5.8 this might be resolved?

@vj6
Copy link

vj6 commented Feb 2, 2021

@docmerlin, we are experiencing frequent OOM due to this. Can you pls let us know the next milestone this will be added to?

@docmerlin
Copy link
Contributor

docmerlin commented Feb 2, 2021 via email

@m4ce
Copy link
Contributor Author

m4ce commented Feb 24, 2021

Looks like it's not planned for 1.5.9? Will it manage to make it in 1.6.0..?

@docmerlin
Copy link
Contributor

@m4ce We found a honest to goodness old fashioned memory leak in the join node so we fixed that instead. That will likely fix most people's problems.

1 similar comment
@docmerlin
Copy link
Contributor

@m4ce We found a honest to goodness old fashioned memory leak in the join node so we fixed that instead. That will likely fix most people's problems.

@m4ce
Copy link
Contributor Author

m4ce commented Mar 25, 2021

But didn't you say that the join node ignores barrier delete node messages? How is that going to fix this then?

@docmerlin
Copy link
Contributor

docmerlin commented Mar 29, 2021

@m4ce said:

But didn't you say that the join node ignores barrier delete node messages? How is that going to fix this then?

The old join node kept a copy of every point that passed through it because of a memory leak, even points it was done joining. The fix removes the leak so memory usage should be vastly improved.

@joshdurbin
Copy link

joshdurbin commented Mar 29, 2021

Any idea of when this will be merged and released? Is this issue fixed by:

@m4ce
Copy link
Contributor Author

m4ce commented Mar 29, 2021

@docmerlin - Okay, let's see how the new release performs before closing this issue.

@vj6
Copy link

vj6 commented Apr 6, 2021

@docmerlin this still persists in 1.5.9.

Just curious, earlier mentioned "old fashioned memory leak" planned for which milestone? Thanks.

@m4ce
Copy link
Contributor Author

m4ce commented Apr 6, 2021

@docmerlin - the issue here is that we have a high cardinality in our join node which will not decrease unless the barrier messages are supported so that groups can be expired. I am sure the bugs you fixed will help in several cases but not here I am afraid. Unless the barrier messages are handled, this won't solve the issue outlined in this ticket.

@vj6
Copy link

vj6 commented May 4, 2021

@docmerlin any updates pls?

@docmerlin
Copy link
Contributor

docmerlin commented May 5, 2021

@vj-gsr sorry man, I am currently working on other kapacitor tasks. I can say that this is about 80% done, however. I'm trying really hard to squeeze it into 1.6.0, but can't promise anything yet.

Edit: I've been told we are holding 1.6.0 till we can get it in, to make sure we get it in.

@vj6
Copy link

vj6 commented May 6, 2021

OK many thanks @docmerlin, any tentative date for 1.6.0 release.

@vj6
Copy link

vj6 commented Jun 1, 2021

@docmerlin did you get a chance to work on this? any update pls?

@m4ce
Copy link
Contributor Author

m4ce commented Jun 14, 2021

ts=2021-06-14T07:27:50.277Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=barrier2 err="unexpected message of type *edge.deleteGroupMessage"
ts=2021-06-14T07:27:50.279Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=from1 err="edge aborted"
ts=2021-06-14T07:27:50.387Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=stream0 err="edge aborted"

tickscript was working with 1.5.9 - trying 1.6.0-rc2, failing. It seems there's something wrong in handling barrier messages in the join node?

@m4ce
Copy link
Contributor Author

m4ce commented Jun 14, 2021

I actually see issues also with tickscripts that didn't use the join node. I effectively see the problem with any tickscript that uses the barrier node. I think we have a regression here.

@prashanthjbabu
Copy link
Contributor

@m4ce @docmerlin I have a fix for

ts=2021-06-14T07:27:50.277Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=barrier2 err="unexpected message of type *edge.deleteGroupMessage"
ts=2021-06-14T07:27:50.279Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=from1 err="edge aborted"
ts=2021-06-14T07:27:50.387Z lvl=error msg="node failed" service=kapacitor task_master=main task=Infra-Kubernetes-ContainerCPUUsage node=stream0 err="edge aborted"

This PR #2585 takes care of it , I don't see the error message with this fix , I also noticed no mem leaks either.

@vj6
Copy link

vj6 commented Jun 29, 2021

@docmerlin can you pls confirm this has been sorted in v1.6.0?

@vj6
Copy link

vj6 commented Aug 16, 2021

@docmerlin the memory leak is still present in v1.6.1 it has slow down a bit with .deleteAll(TRUE) but not yet resolved. We are using the same script mentioned here.

@docmerlin
Copy link
Contributor

I'm going to go ahead and close this issue. As it was closed by #2562

@vj-gsr I believe your leak is coming from a different source than this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants