Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner unable to run again after OOM error #66

Open
barrelltech opened this issue Oct 25, 2024 · 0 comments
Open

Runner unable to run again after OOM error #66

barrelltech opened this issue Oct 25, 2024 · 0 comments

Comments

@barrelltech
Copy link

barrelltech commented Oct 25, 2024

I'm testing some FLAME runners to do KMeans clustering. Trying to figure out how many embeddings I can do per fly.io machine size, so I'm testing a bunch of different values until failure.

When I use a value too high, the process crashes, the machine cleans up, and subsequent invocations of the runner error with:

** (ArgumentError) errors were found at the given arguments:

  * 1st argument: the table identifier does not refer to an existing ETS table

    (stdlib 5.2.3) :ets.lookup_element(:kmeans_md, :meta, 2)
    (flame 0.5.1) lib/flame/pool.ex:381: FLAME.Pool.lookup_meta/1
    (flame 0.5.1) lib/flame/pool.ex:315: FLAME.Pool.caller_checkout!/5
    iex:1: (file)

I'm using these as single_use: true so I'm not super concerned about the OOM error. If this happened in production though, it'd be a pretty big issue that all subsequent invocations failed.

Not sure how to resolve the issue either - sometimes a restart fixes it, sometimes waiting 10+ minutes. Not a blocker for me (the point of the stress testing is to avoid this in prod), but seemed report-worthy!

—-

PS: Flame configuration:

defmodule KMeans.Supervisor do
  use Supervisor
  require Logger
  alias Machine.Size

  def start_link(_) do
    Supervisor.start_link(__MODULE__, nil, name: __MODULE__)
  end

  @impl true
  def init(_) do
    config = Application.get_env(:flame, :kmeans)

    children =
      [Size.sm(), Size.md(), Size.lg()]
      |> Enum.map(fn size ->
        backend =
          case config do
            {FLAME.FlyBackend, defaults} ->
              {FLAME.FlyBackend,
               Keyword.merge(defaults,
                 memory_mb: size.memory * 1024,
                 cpus: size.cpu
               )}

            _ ->
              {FLAME.LocalBackend, []}
          end

        {
          FLAME.Pool,
          backend: backend,
          name: size.name,
          min: 0,
          max: 10,
          max_concurrency: 1,
          single_use: true,
          log: :info,
          boot_timeout: :timer.minutes(5),
          timeout: :timer.minutes(10),
          on_grow_start: &IO.inspect(&1, label: "[FLAME | growing | #{size.name}]"),
          on_grow_end: &IO.inspect(Map.put(&2, :status, &1), label: "[FLAME | success | #{size.name}]"),
          on_shrink: &IO.inspect(&1, label: "[FLAME | closing | #{size.name}]")
        }
      end)

    Supervisor.init(children, strategy: :one_for_one)
  end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant