Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak using appsignal lib #820

Closed
Kalaww opened this issue Feb 2, 2023 · 8 comments
Closed

Memory leak using appsignal lib #820

Kalaww opened this issue Feb 2, 2023 · 8 comments
Assignees
Labels

Comments

@Kalaww
Copy link

Kalaww commented Feb 2, 2023

Hello,

Describe the bug

I am using appsignal in some of my elixir applications and I noticed that these applications
Some of my elixir applications have memory leak. It started happening when I added appsignal to them. I then tested to run my applications without appsignal to see if the memory leak is still there, but the memory leak disappear so the culprit looks to be appsignal.

To Reproduce

I don't really know how to make this reproductible. Here everything I can share about my setup and what I have observed.

erlang 24.3
elixir 1.13.3
appsignal 2.5.0
appsignal_phoenix 2.2.1

My apps are running in a docker container inside a kubernetes cluster.

My appsignal config :

config :appsignal, :config,
  otp_app: :myapp,
  name: "myapp",
  revision: Mix.Project.config[:version],
  push_api_key: "mykey",
  active: true,
  send_session_data: false,
 env: "prod"

App1 is a simple phoenix web server where

  • I have added use Appsignal.Phoenix to my phoenix endpoint
  • I instrument some functions myself calling Appsignal.instrument
  • I add some tags to the root span when I received http requests
  • I trace ecto queries with my own telemetry handlers that build a span for the query

App2 is just having a job that push metrics every minute to some appsignal gauges.

The two apps are using appsignal in different ways but both have the memory leak so I guess that it might not be linked to my appsignal integration.

I checked the memory usage of the genserver from appsignal supervisor but it look normal

Process.whereis(Appsignal.Tracer) |> Process.info(:memory)
{:memory, 2816}

Process.whereis(Appsignal.Probes) |> Process.info(:memory)
{:memory, 29552}

Process.whereis(Appsignal.Monitor) |> Process.info(:memory)
{:memory, 16664}

Here the graph of momery usage of one of my app, we can see here the memory that keep growing until the kubernetes pod reached its limit and restart. The last part of the graph is when I removed appsignal and noticed that memory stayed stable.
image

I don't know where to look to have more information to trace back the source of the leak. If you have any idea I would highly appreciate it, thanks for taking the time to read

@Kalaww Kalaww added the bug label Feb 2, 2023
@jeffkreeftmeijer
Copy link
Member

Hey @Kalaww,

Thanks for opening this issue. If AppSignal leaks memory through the Elixir process, you should be able to see that in Process.info/1, or the ETS table should keep growing, so I’m wondering what’s up here.

Do you have any apps running AppSignal without any custom instrumentation? If not, could you try removing your instrumentation to see if the problem persists? I’ll work on a testing setup to try and reproduce what you’re seeing. Assuming there could be something with your custom instrumentation, could you send me everything custom you’ve added in your app?

Also, what value is that graph you posted showing, exactly?

@jeffkreeftmeijer jeffkreeftmeijer self-assigned this Feb 2, 2023
@Kalaww
Copy link
Author

Kalaww commented Feb 2, 2023

Thank you for the quick response.

Here more details on my custom instrumentation.
In App1, I add user_id to the root span in my authenticate plug :

    with %Appsignal.Span{} = span <- Appsignal.Tracer.root_span() do
      tags = %{user_id: user.id}
      Appsignal.Span.set_sample_data(span, "tags", tags)
    end

Also in App1, my span to trace ecto queries :

    now = :os.system_time()
    start_time = now - total_time
    current_span = Appsignal.Tracer.current_span()

    Appsignal.Tracer.create_span("postgres", current_span, start_time: start_time)
    |> Appsignal.Span.set_name(query_name)
    |> Appsignal.Span.set_attribute("appsignal:category", "sql:#{query_name}")
    |> Appsignal.Span.set_sql(query)
    |> Appsignal.Tracer.close_span(end_time: now)

In App2, I have only setup two gauges to count the number of oban jobs and to count them by job state

defmodule App2.ObanMetrics do
  import Ecto.Query

  def register_gauges() do
    Appsignal.Probes.register(:oban_job_state_count, &__MODULE__.update_oban_job_state_counts_gauge/0)
    Appsignal.Probes.register(:oban_job_worker_count, &__MODULE__.update_oban_job_worker_counts_gauge/0)
  end

  def update_oban_job_state_counts_gauge() do
    if Oban.Peer.leader?() do
      counts = get_oban_job_state_counts() |> Map.new()
      total = Map.values(counts) |> Enum.sum()

      Oban.Job.states()
      |> Enum.each(fn state_atom ->
        state = Atom.to_string(state_atom)
        count = Map.get(counts, state, 0)
        tags = %{state: state}

        Appsignal.set_gauge("oban_job_state_count", count, tags)
      end)

      Appsignal.set_gauge("oban_job_total_count", total)
    end
  end

  def update_oban_job_worker_counts_gauge() do
    if Oban.Peer.leader?() do
      get_oban_job_worker_counts()
      |> Enum.each(fn {worker, count} ->
        tags = %{worker: worker}
        Appsignal.set_gauge("oban_job_worker_count", count, tags)
      end)
    end
  end

  def get_oban_job_state_counts() do
    repo = Core.Oban.repo()
    prefix = Core.Oban.prefix()

    query =
      Oban.Job
      |> group_by([job], job.state)
      |> select([job], {job.state, count(job.id)})

    repo.all(query, prefix: prefix)
  end

  def get_oban_job_worker_counts() do
    repo = Core.Oban.repo()
    prefix = Core.Oban.prefix()

    query =
      Oban.Job
      |> group_by([job], job.worker)
      |> select([job], {job.worker, count(job.id)})

    repo.all(query, prefix: prefix)
  end
end

Here a better graph where you can see the values of the memory (blue is memory used, red is memory limit)
image

I am going to setup my app without any custom instrumentation and let you know what I notice after letting it run for a few hours

@jeffkreeftmeijer
Copy link
Member

Thanks, I’m curious to see if that changes the situation.

Another place where the memory could potentially go is in the ets table. That should be cleaned of closed spans automatically, but I’d like to know if that’s properly working in your setup. Could you check to see if :ets.info(:"$appsignal_registry", :memory) increases over time?

@Kalaww
Copy link
Author

Kalaww commented Feb 3, 2023

The memory leak is still here when using appsignal without any custom instrumentation. I also tried with elixir 1.14.3 and erlang 25 without success. The only setup so far where I didn't had memory leak is when I removed appsignal completely.
I used observer to see if there were any processes or ets tables with high memory but everything look fine.
:ets.info(:"$appsignal_registry", :memory) returns values around 300-330.
The docker image is build using tag alpine:3.15.

@jeffkreeftmeijer
Copy link
Member

Thanks for helping us get to the bottom of this. It seems like this isn’t an issue in the Elixir code, then, if you don’t see anything out of the ordinary in the observer. The ets table seems to correctly remove the samples after they’re closed as well.

We might be seeing a problem in our extension. Could you check if the memory is increasing over time for the appsignal-agent process?

@Kalaww
Copy link
Author

Kalaww commented Feb 3, 2023

I have notice something interesting. For my apps using appsignal, there are a lot of epmd and inet_gethost processes. I checked other elixir container that don't use appsignal, there are only one or two of them.
The number of these processes keep growing over times (70min uptime has 3k processes, 6h 20k and 13h 40k) so it looks to be a good culprit for the memory leak

htop run inside the docker container of an app using appsignal (there a hundreds of these processes
image

htop run inside the docker container of an app not using appsignal:
image

@Kalaww
Copy link
Author

Kalaww commented Feb 3, 2023

I have set the entrypoint of my docker image to use tini. It is suppose to catch these zombie processes. I don't understand everything here and I don't know what create these zombie processes but I don't see the number of processes increases with this fix. I am going to check it during the next few hours to make sure it stays stable.

RUN apk add --no-cache tini
ENTRYPOINT [ "/sbin/tini", "--" ]

Sources:
http://erlang.org/pipermail/erlang-questions/2019-August/098267.html
https://www.slideshare.net/Elixir-Meetup/kubernetes-docker-elixir-alexei-sholik-andrew-dryga-elixir-club-ukraine
https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

@Kalaww
Copy link
Author

Kalaww commented Feb 6, 2023

No more memory leak with the fix above. I am closing this issue as the issue is coming from my docker configuration and not from the appsignal lib. Thank you for the help and the fast responses !

@Kalaww Kalaww closed this as completed Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants