Memory leak using appsignal lib #820

Kalaww · 2023-02-02T13:36:39Z

Hello,

Describe the bug

I am using appsignal in some of my elixir applications and I noticed that these applications
Some of my elixir applications have memory leak. It started happening when I added appsignal to them. I then tested to run my applications without appsignal to see if the memory leak is still there, but the memory leak disappear so the culprit looks to be appsignal.

To Reproduce

I don't really know how to make this reproductible. Here everything I can share about my setup and what I have observed.

erlang 24.3
elixir 1.13.3
appsignal 2.5.0
appsignal_phoenix 2.2.1

My apps are running in a docker container inside a kubernetes cluster.

My appsignal config :

config :appsignal, :config,
  otp_app: :myapp,
  name: "myapp",
  revision: Mix.Project.config[:version],
  push_api_key: "mykey",
  active: true,
  send_session_data: false,
 env: "prod"

App1 is a simple phoenix web server where

I have added use Appsignal.Phoenix to my phoenix endpoint
I instrument some functions myself calling Appsignal.instrument
I add some tags to the root span when I received http requests
I trace ecto queries with my own telemetry handlers that build a span for the query

App2 is just having a job that push metrics every minute to some appsignal gauges.

The two apps are using appsignal in different ways but both have the memory leak so I guess that it might not be linked to my appsignal integration.

I checked the memory usage of the genserver from appsignal supervisor but it look normal

Process.whereis(Appsignal.Tracer) |> Process.info(:memory)
{:memory, 2816}

Process.whereis(Appsignal.Probes) |> Process.info(:memory)
{:memory, 29552}

Process.whereis(Appsignal.Monitor) |> Process.info(:memory)
{:memory, 16664}

Here the graph of momery usage of one of my app, we can see here the memory that keep growing until the kubernetes pod reached its limit and restart. The last part of the graph is when I removed appsignal and noticed that memory stayed stable.

I don't know where to look to have more information to trace back the source of the leak. If you have any idea I would highly appreciate it, thanks for taking the time to read

The text was updated successfully, but these errors were encountered:

jeffkreeftmeijer · 2023-02-02T16:20:36Z

Hey @Kalaww,

Thanks for opening this issue. If AppSignal leaks memory through the Elixir process, you should be able to see that in Process.info/1, or the ETS table should keep growing, so I’m wondering what’s up here.

Do you have any apps running AppSignal without any custom instrumentation? If not, could you try removing your instrumentation to see if the problem persists? I’ll work on a testing setup to try and reproduce what you’re seeing. Assuming there could be something with your custom instrumentation, could you send me everything custom you’ve added in your app?

Also, what value is that graph you posted showing, exactly?

Kalaww · 2023-02-02T17:03:43Z

Thank you for the quick response.

Here more details on my custom instrumentation.
In App1, I add user_id to the root span in my authenticate plug :

    with %Appsignal.Span{} = span <- Appsignal.Tracer.root_span() do
      tags = %{user_id: user.id}
      Appsignal.Span.set_sample_data(span, "tags", tags)
    end

Also in App1, my span to trace ecto queries :

    now = :os.system_time()
    start_time = now - total_time
    current_span = Appsignal.Tracer.current_span()

    Appsignal.Tracer.create_span("postgres", current_span, start_time: start_time)
    |> Appsignal.Span.set_name(query_name)
    |> Appsignal.Span.set_attribute("appsignal:category", "sql:#{query_name}")
    |> Appsignal.Span.set_sql(query)
    |> Appsignal.Tracer.close_span(end_time: now)

In App2, I have only setup two gauges to count the number of oban jobs and to count them by job state

defmodule App2.ObanMetrics do
  import Ecto.Query

  def register_gauges() do
    Appsignal.Probes.register(:oban_job_state_count, &__MODULE__.update_oban_job_state_counts_gauge/0)
    Appsignal.Probes.register(:oban_job_worker_count, &__MODULE__.update_oban_job_worker_counts_gauge/0)
  end

  def update_oban_job_state_counts_gauge() do
    if Oban.Peer.leader?() do
      counts = get_oban_job_state_counts() |> Map.new()
      total = Map.values(counts) |> Enum.sum()

      Oban.Job.states()
      |> Enum.each(fn state_atom ->
        state = Atom.to_string(state_atom)
        count = Map.get(counts, state, 0)
        tags = %{state: state}

        Appsignal.set_gauge("oban_job_state_count", count, tags)
      end)

      Appsignal.set_gauge("oban_job_total_count", total)
    end
  end

  def update_oban_job_worker_counts_gauge() do
    if Oban.Peer.leader?() do
      get_oban_job_worker_counts()
      |> Enum.each(fn {worker, count} ->
        tags = %{worker: worker}
        Appsignal.set_gauge("oban_job_worker_count", count, tags)
      end)
    end
  end

  def get_oban_job_state_counts() do
    repo = Core.Oban.repo()
    prefix = Core.Oban.prefix()

    query =
      Oban.Job
      |> group_by([job], job.state)
      |> select([job], {job.state, count(job.id)})

    repo.all(query, prefix: prefix)
  end

  def get_oban_job_worker_counts() do
    repo = Core.Oban.repo()
    prefix = Core.Oban.prefix()

    query =
      Oban.Job
      |> group_by([job], job.worker)
      |> select([job], {job.worker, count(job.id)})

    repo.all(query, prefix: prefix)
  end
end

Here a better graph where you can see the values of the memory (blue is memory used, red is memory limit)

I am going to setup my app without any custom instrumentation and let you know what I notice after letting it run for a few hours

jeffkreeftmeijer · 2023-02-03T11:19:37Z

Thanks, I’m curious to see if that changes the situation.

Another place where the memory could potentially go is in the ets table. That should be cleaned of closed spans automatically, but I’d like to know if that’s properly working in your setup. Could you check to see if :ets.info(:"$appsignal_registry", :memory) increases over time?

Kalaww · 2023-02-03T12:34:59Z

The memory leak is still here when using appsignal without any custom instrumentation. I also tried with elixir 1.14.3 and erlang 25 without success. The only setup so far where I didn't had memory leak is when I removed appsignal completely.
I used observer to see if there were any processes or ets tables with high memory but everything look fine.
:ets.info(:"$appsignal_registry", :memory) returns values around 300-330.
The docker image is build using tag alpine:3.15.

jeffkreeftmeijer · 2023-02-03T14:41:32Z

Thanks for helping us get to the bottom of this. It seems like this isn’t an issue in the Elixir code, then, if you don’t see anything out of the ordinary in the observer. The ets table seems to correctly remove the samples after they’re closed as well.

We might be seeing a problem in our extension. Could you check if the memory is increasing over time for the appsignal-agent process?

Kalaww · 2023-02-03T14:59:41Z

I have notice something interesting. For my apps using appsignal, there are a lot of epmd and inet_gethost processes. I checked other elixir container that don't use appsignal, there are only one or two of them.
The number of these processes keep growing over times (70min uptime has 3k processes, 6h 20k and 13h 40k) so it looks to be a good culprit for the memory leak

htop run inside the docker container of an app using appsignal (there a hundreds of these processes

htop run inside the docker container of an app not using appsignal:

Kalaww · 2023-02-03T15:57:14Z

I have set the entrypoint of my docker image to use tini. It is suppose to catch these zombie processes. I don't understand everything here and I don't know what create these zombie processes but I don't see the number of processes increases with this fix. I am going to check it during the next few hours to make sure it stays stable.

RUN apk add --no-cache tini
ENTRYPOINT [ "/sbin/tini", "--" ]

Sources:
http://erlang.org/pipermail/erlang-questions/2019-August/098267.html
https://www.slideshare.net/Elixir-Meetup/kubernetes-docker-elixir-alexei-sholik-andrew-dryga-elixir-club-ukraine
https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

Kalaww · 2023-02-06T11:03:32Z

No more memory leak with the fix above. I am closing this issue as the issue is coming from my docker configuration and not from the appsignal lib. Thank you for the help and the fast responses !

Kalaww added the bug label Feb 2, 2023

jeffkreeftmeijer self-assigned this Feb 2, 2023

unflxw mentioned this issue Feb 3, 2023

Implement probes that report metrics on total Oban jobs #823

Open

Kalaww closed this as completed Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak using appsignal lib #820

Memory leak using appsignal lib #820

Kalaww commented Feb 2, 2023

jeffkreeftmeijer commented Feb 2, 2023

Kalaww commented Feb 2, 2023

jeffkreeftmeijer commented Feb 3, 2023

Kalaww commented Feb 3, 2023 •

edited

Loading

jeffkreeftmeijer commented Feb 3, 2023

Kalaww commented Feb 3, 2023 •

edited

Loading

Kalaww commented Feb 3, 2023

Kalaww commented Feb 6, 2023

Memory leak using appsignal lib #820

Memory leak using appsignal lib #820

Comments

Kalaww commented Feb 2, 2023

Describe the bug

To Reproduce

jeffkreeftmeijer commented Feb 2, 2023

Kalaww commented Feb 2, 2023

jeffkreeftmeijer commented Feb 3, 2023

Kalaww commented Feb 3, 2023 • edited Loading

jeffkreeftmeijer commented Feb 3, 2023

Kalaww commented Feb 3, 2023 • edited Loading

Kalaww commented Feb 3, 2023

Kalaww commented Feb 6, 2023

Kalaww commented Feb 3, 2023 •

edited

Loading

Kalaww commented Feb 3, 2023 •

edited

Loading