You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
kapacitor version: 1.5.2
distribution: Ubuntu xenial 16.04.3
installed from official prebuilt deb downloaded from influxdata website.
We upgraded to kapacitor 1.5.2 to take advantage of the delete feature of barrier nodes as our stream kapacitor tasks (mostly rollups) deal with ephemeral series for which memory was previously not released.
We have two sets of kapacitor nodes
type1: few rollup tick scripts (high cardinality data)
type2: Around 228 tick scripts with a mix of alerts and rollups
After upgrading to version 1.5.2. We added barrier nodes with delete before all window and union nodes in our tick scripts.
With type1, we have had only two instances of kapacitor being unresponsive (tasks don't proceed and kapacitor api is unresponsive; only way to resolve is to restart kapacitor). We've been running with this change for ~2 weeks and have had good uptime and it does release memory as expected.
With type2, we run into "unresponsive" kapacitor quite regularly (~3hrs). When kapacitor gets unresponsive, we observed that the machine had 64K tcp connections to our influxdb cluster in CLOSED_WAIT and exhausted all file descriptors (the systemd unit raises it to 64K). This does not happen when barrier nodes are added without the delete option.
Attached is a stacktrace obtained with SIGQUIT when the kapacitor processs was hung. Also attaching graphs of growth in CLOSED_WAIT in comparision to a sibling kapacitor machine without barrier nodes and a sample tick script.
Hi,
kapacitor version: 1.5.2
distribution: Ubuntu xenial 16.04.3
installed from official prebuilt deb downloaded from influxdata website.
We upgraded to kapacitor 1.5.2 to take advantage of the delete feature of barrier nodes as our stream kapacitor tasks (mostly rollups) deal with ephemeral series for which memory was previously not released.
We have two sets of kapacitor nodes
After upgrading to version 1.5.2. We added barrier nodes with delete before all window and union nodes in our tick scripts.
With type1, we have had only two instances of kapacitor being unresponsive (tasks don't proceed and kapacitor api is unresponsive; only way to resolve is to restart kapacitor). We've been running with this change for ~2 weeks and have had good uptime and it does release memory as expected.
With type2, we run into "unresponsive" kapacitor quite regularly (~3hrs). When kapacitor gets unresponsive, we observed that the machine had 64K tcp connections to our influxdb cluster in CLOSED_WAIT and exhausted all file descriptors (the systemd unit raises it to 64K). This does not happen when barrier nodes are added without the delete option.
Attached is a stacktrace obtained with
SIGQUIT
when the kapacitor processs was hung. Also attaching graphs of growth inCLOSED_WAIT
in comparision to a sibling kapacitor machine without barrier nodes and a sample tick script.stacktrace.txt
rollup-infra-cassandra.tick.txt
The text was updated successfully, but these errors were encountered: