Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kapacitor leaks sockets when using barrier nodes with delete #2144

Closed
pdvyas opened this issue Jan 7, 2019 · 0 comments · Fixed by #2181
Closed

Kapacitor leaks sockets when using barrier nodes with delete #2144

pdvyas opened this issue Jan 7, 2019 · 0 comments · Fixed by #2181

Comments

@pdvyas
Copy link

pdvyas commented Jan 7, 2019

Hi,

kapacitor version: 1.5.2
distribution: Ubuntu xenial 16.04.3
installed from official prebuilt deb downloaded from influxdata website.

We upgraded to kapacitor 1.5.2 to take advantage of the delete feature of barrier nodes as our stream kapacitor tasks (mostly rollups) deal with ephemeral series for which memory was previously not released.

We have two sets of kapacitor nodes

  • type1: few rollup tick scripts (high cardinality data)
  • type2: Around 228 tick scripts with a mix of alerts and rollups

After upgrading to version 1.5.2. We added barrier nodes with delete before all window and union nodes in our tick scripts.

With type1, we have had only two instances of kapacitor being unresponsive (tasks don't proceed and kapacitor api is unresponsive; only way to resolve is to restart kapacitor). We've been running with this change for ~2 weeks and have had good uptime and it does release memory as expected.

With type2, we run into "unresponsive" kapacitor quite regularly (~3hrs). When kapacitor gets unresponsive, we observed that the machine had 64K tcp connections to our influxdb cluster in CLOSED_WAIT and exhausted all file descriptors (the systemd unit raises it to 64K). This does not happen when barrier nodes are added without the delete option.

Attached is a stacktrace obtained with SIGQUIT when the kapacitor processs was hung. Also attaching graphs of growth in CLOSED_WAIT in comparision to a sibling kapacitor machine without barrier nodes and a sample tick script.

closed_wait_conns

stacktrace.txt
rollup-infra-cassandra.tick.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant