SPARK 256mb EOM #267

juliangruber · 2023-06-26T11:27:29Z

[Fly.io] core-fly ran out of memory and crashed
Out of memory: Killed process 529 (zinniad) total-vm:2019184kB, anon-rss:132416kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1080kB oom_score_adj:0

The machime had 256MB of ram, I've scale it up to 512MB for now.

@bajtos: Maybe there is a memory leak, maybe the integration with Go is not running GC often enough (or not at all)

This process was running for ~2 days, until it EOMed:

$ ENV... zinniad modules/spark/main.js

@juliangruber: I wonder if this is an EOM or whether zinnia/deno programs are configured to require more than 256mb of ram
@bajtos: Great point! It is possible that V8 does detect how much memory is available in our setup.

The text was updated successfully, but these errors were encountered:

bajtos · 2023-06-26T15:26:10Z

This seems to be a problem in Deno and V8, see:

v1.33.1 now requires --v8-flags=--max-old-space-size=8192 to prevent OOME denoland/deno#18935

We can add a new API similar to Deno.memoryUsage() to allow module developers observe patterns in memory usage
- Another possibly useful information: Deno.loadavg()
I am thinking about adding instrumentation & telemetry to zinniad. We can periodically report memory usage, event loop latency and other useful information.
Should we expose --v8-flags in zinnia and zinniad CLI to give users more control?

WDYT, @juliangruber?

bajtos · 2023-06-26T15:27:28Z

Also: Deno.systemMemoryInfo

juliangruber · 2023-06-27T07:46:35Z

We can add a new API similar to Deno.memoryUsage() to allow module developers observe patterns in memory usage
Another possibly useful information: Deno.loadavg()

How would this be actionable? I can see how in advanced cases the module author would use this, but in our SPARK scenario I don't see why SPARK would want to capture this 🤔

I am thinking about adding instrumentation & telemetry to zinniad. We can periodically report memory usage, event loop latency and other useful information.

I'm not so sure about that. Zinnia can run many different things (outside of Station), would it be useful to collect information for when we don't know what it's running?

Therefore I think it's more useful to collect this information in Station Core (assuming we can get all the info we need from outside the process), which for the current case we can. Or do you think we'll need to measure event loop latency fairly soon?

And even if we decide to measure stats from inside Zinnia, I wonder if Zinnia should report it itself, or whether it could expose those stats for Station Core to query and submit.

Should we expose --v8-flags in zinnia and zinniad CLI to give users more control?

I think that could work. The quick fix would either be to document that zinnia needs more memory, or to let users control how much memory it has available (if Deno/V8 don't auto-adjust)

bajtos · 2023-06-27T12:39:46Z

Great discussion!

We can add a new API similar to Deno.memoryUsage() to allow module developers observe patterns in memory usage
Another possibly useful information: Deno.loadavg()

How would this be actionable? I can see how in advanced cases the module author would use this, but in our SPARK scenario I don't see why SPARK would want to capture this 🤔

Let's say we have a suspicion that SPARK has a memory leak or maybe that it is triggering a memory leak in Zinnia. I would like to run SPARK on my computer using zinnia run and observe memory usage patterns to see if I can spot the problem.

I think the only tool I have available right now is to use OS-level tools like macOS Activity Monitor and the ps command to observe the amount of memory used by the Zinnia process.

Debugging memory leaks would be easier if my Zinnia module could observe how much memory is used at given points in time.

It would be even better if there was a way how to create heap snapshots for inspection in Chrome DevTools. And even more awesome would be the option to inspect the Zinnia process by connecting it with Chrome DevTools. DevTools provide functionality for creating memory snapshots and comparing them.

I am thinking about adding instrumentation & telemetry to zinniad. We can periodically report memory usage, event loop latency and other useful information.

I'm not so sure about that. Zinnia can run many different things (outside of Station), would it be useful to collect information for when we don't know what it's running?

Therefore I think it's more useful to collect this information in Station Core (assuming we can get all the info we need from outside the process), which for the current case we can. Or do you think we'll need to measure event loop latency fairly soon?

Regarding event loop latency - I think this metric can help us better understand how much CPU of the host machine we are using. It's not needed until there is a problem and we need it to be already in place :)

OTOH, if we can measure the CPU consumption of module processes (Zinnia, Bacalhau, etc.) from outside, then I think that's good enough for now.

As for submitting the metrics:

This be implemented in zinniad only. Do we want to support zinniad usage outside of Filecoin Station? I think it will be much easier for us if we keep zinniad tailored specifically to Station needs.
I am thinking about introducing a new optional env variable to specify InfluxDB writer URL to use for submitting the metrics. This way, zinniad does not report any metrics by default. People running zinniad in their custom setup can provide their own InfluxDB URL to let "their" zinniad instance report metrics to their database.

And even if we decide to measure stats from inside Zinnia, I wonder if Zinnia should report it itself, or whether it could expose those stats for Station Core to query and submit.

This is a fair point. I guess it depends on how frequently we want to report these metrics and what is the overhead of sending them via Station polling as opposed to submitting them to InfluxDB directly.

bajtos · 2023-06-27T12:41:03Z

@juliangruber To mitigate the impact of OOM, is it perhaps time to improve Station Core to automatically restart Zinnia when it crashes?

juliangruber · 2023-06-28T08:13:45Z

I think the only tool I have available right now is to use OS-level tools like macOS Activity Monitor and the ps command to observe the amount of memory used by the Zinnia process.

Debugging memory leaks would be easier if my Zinnia module could observe how much memory is used at given points in time.

Can you please help me understand why Zinnia exposing this information would be easier than the OS exposing it? 🤔

It would be even better if there was a way how to create heap snapshots for inspection in Chrome DevTools. And even more awesome would be the option to inspect the Zinnia process by connecting it with Chrome DevTools. DevTools provide functionality for creating memory snapshots and comparing them.

+100 That would be perfect

Regarding event loop latency - I think this metric can help us better understand how much CPU of the host machine we are using. It's not needed until there is a problem and we need it to be already in place :)

Given our small team size my vote is to add it once we require it :P I very much agree that it will be useful eventually.

As for submitting the metrics:

This be implemented in zinniad only. Do we want to support zinniad usage outside of Filecoin Station? I think it will be much easier for us if we keep zinniad tailored specifically to Station needs.

Fair point, to me zinniad is more independent of Station, but absolutely fair to keep them coupled for now.

I am thinking about introducing a new optional env variable to specify InfluxDB writer URL to use for submitting the metrics. This way, zinniad does not report any metrics by default. People running zinniad in their custom setup can provide their own InfluxDB URL to let "their" zinniad instance report metrics to their database.

It will need more config variables, right?

url
token
org
bucket
precision

If you want to add Influx reporting to zinnia I'm not opposed at all, looks like from a Station perspective it would be a relatively straight forward change to implement.

bajtos · 2023-06-28T14:15:15Z

Can you please help me understand why Zinnia exposing this information would be easier than the OS exposing it?

With Zinnia API, I can write a Zinnia script that's repeatedly calling the code suspected to leak memory and collect memory usage information over time.

An example from denoland/deno#18369 (comment):

console.log('Deno version', Deno.version.deno);

let timestamp = new Date();

while (true) {
  if (Date.now() >= timestamp.valueOf()) {
    const bytes = Deno.memoryUsage().rss;
    console.log(timestamp.toISOString(), Math.floor(bytes / (1024 * 1024) * 10) / 10);
    timestamp = new Date(timestamp.valueOf() + 1000);
  }
  const response = await fetch('http://localhost:8080');
  await response.text();
}

To measure this from the outside, I would need to figure out which ps options give me memory usage of a process and then write a script repeatedly calling ps. It's doable, but it seems to me like more work. (Also, ps is not available on Windows.)

bajtos · 2023-06-28T14:20:17Z

I opened two follow-up issues:

@juliangruber is there anything else we want to change in the near future based on what we learned here, or can we close this issue as resolved?

juliangruber · 2023-06-28T15:09:53Z

To measure this from the outside, I would need to figure out which ps options give me memory usage of a process and then write a script repeatedly calling ps. It's doable, but it seems to me like more work. (Also, ps is not available on Windows.)

Just my 2 cents, but to me writing this script is more work than running ps / Activity Montior.app.

@juliangruber is there anything else we want to change in the near future based on what we learned here, or can we close this issue as resolved?

I think that captures it all :)

juliangruber assigned bajtos Jun 26, 2023

bajtos added the bug 🐛 Something isn't working label Jun 26, 2023

bajtos closed this as completed Jun 28, 2023

bajtos mentioned this issue Jun 29, 2023

fix: restart Zinnia after it exits CheckerNetwork/core#169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK 256mb EOM #267

SPARK 256mb EOM #267

juliangruber commented Jun 26, 2023

bajtos commented Jun 26, 2023

bajtos commented Jun 26, 2023

juliangruber commented Jun 27, 2023

bajtos commented Jun 27, 2023 •

edited

Loading

bajtos commented Jun 27, 2023

juliangruber commented Jun 28, 2023

bajtos commented Jun 28, 2023

bajtos commented Jun 28, 2023

juliangruber commented Jun 28, 2023

SPARK 256mb EOM #267

SPARK 256mb EOM #267

Comments

juliangruber commented Jun 26, 2023

bajtos commented Jun 26, 2023

bajtos commented Jun 26, 2023

juliangruber commented Jun 27, 2023

bajtos commented Jun 27, 2023 • edited Loading

bajtos commented Jun 27, 2023

juliangruber commented Jun 28, 2023

bajtos commented Jun 28, 2023

bajtos commented Jun 28, 2023

juliangruber commented Jun 28, 2023

bajtos commented Jun 27, 2023 •

edited

Loading