Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK 256mb EOM #267

Closed
juliangruber opened this issue Jun 26, 2023 · 9 comments
Closed

SPARK 256mb EOM #267

juliangruber opened this issue Jun 26, 2023 · 9 comments
Assignees
Labels
bug 🐛 Something isn't working

Comments

@juliangruber
Copy link
Member

[Fly.io] core-fly ran out of memory and crashed
Out of memory: Killed process 529 (zinniad) total-vm:2019184kB, anon-rss:132416kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1080kB oom_score_adj:0

The machime had 256MB of ram, I've scale it up to 512MB for now.

@bajtos: Maybe there is a memory leak, maybe the integration with Go is not running GC often enough (or not at all)

This process was running for ~2 days, until it EOMed:

$ ENV... zinniad modules/spark/main.js

@juliangruber: I wonder if this is an EOM or whether zinnia/deno programs are configured to require more than 256mb of ram
@bajtos: Great point! It is possible that V8 does detect how much memory is available in our setup.

@bajtos bajtos added the bug 🐛 Something isn't working label Jun 26, 2023
@bajtos
Copy link
Member

bajtos commented Jun 26, 2023

This seems to be a problem in Deno and V8, see:

Related:

I think we should improve Zinnia to give us more visibility into memory usage.

  • We can add a new API similar to Deno.memoryUsage() to allow module developers observe patterns in memory usage
  • I am thinking about adding instrumentation & telemetry to zinniad. We can periodically report memory usage, event loop latency and other useful information.
  • Should we expose --v8-flags in zinnia and zinniad CLI to give users more control?

WDYT, @juliangruber?

@bajtos
Copy link
Member

bajtos commented Jun 26, 2023

Also: Deno.systemMemoryInfo

@juliangruber
Copy link
Member Author

We can add a new API similar to Deno.memoryUsage() to allow module developers observe patterns in memory usage
Another possibly useful information: Deno.loadavg()

How would this be actionable? I can see how in advanced cases the module author would use this, but in our SPARK scenario I don't see why SPARK would want to capture this 🤔

I am thinking about adding instrumentation & telemetry to zinniad. We can periodically report memory usage, event loop latency and other useful information.

I'm not so sure about that. Zinnia can run many different things (outside of Station), would it be useful to collect information for when we don't know what it's running?

Therefore I think it's more useful to collect this information in Station Core (assuming we can get all the info we need from outside the process), which for the current case we can. Or do you think we'll need to measure event loop latency fairly soon?

And even if we decide to measure stats from inside Zinnia, I wonder if Zinnia should report it itself, or whether it could expose those stats for Station Core to query and submit.

Should we expose --v8-flags in zinnia and zinniad CLI to give users more control?

I think that could work. The quick fix would either be to document that zinnia needs more memory, or to let users control how much memory it has available (if Deno/V8 don't auto-adjust)

@bajtos
Copy link
Member

bajtos commented Jun 27, 2023

Great discussion!

We can add a new API similar to Deno.memoryUsage() to allow module developers observe patterns in memory usage
Another possibly useful information: Deno.loadavg()

How would this be actionable? I can see how in advanced cases the module author would use this, but in our SPARK scenario I don't see why SPARK would want to capture this 🤔

Let's say we have a suspicion that SPARK has a memory leak or maybe that it is triggering a memory leak in Zinnia. I would like to run SPARK on my computer using zinnia run and observe memory usage patterns to see if I can spot the problem.

I think the only tool I have available right now is to use OS-level tools like macOS Activity Monitor and the ps command to observe the amount of memory used by the Zinnia process.

Debugging memory leaks would be easier if my Zinnia module could observe how much memory is used at given points in time.

It would be even better if there was a way how to create heap snapshots for inspection in Chrome DevTools. And even more awesome would be the option to inspect the Zinnia process by connecting it with Chrome DevTools. DevTools provide functionality for creating memory snapshots and comparing them.

I am thinking about adding instrumentation & telemetry to zinniad. We can periodically report memory usage, event loop latency and other useful information.

I'm not so sure about that. Zinnia can run many different things (outside of Station), would it be useful to collect information for when we don't know what it's running?

Therefore I think it's more useful to collect this information in Station Core (assuming we can get all the info we need from outside the process), which for the current case we can. Or do you think we'll need to measure event loop latency fairly soon?

Regarding event loop latency - I think this metric can help us better understand how much CPU of the host machine we are using. It's not needed until there is a problem and we need it to be already in place :)

OTOH, if we can measure the CPU consumption of module processes (Zinnia, Bacalhau, etc.) from outside, then I think that's good enough for now.

As for submitting the metrics:

  1. This be implemented in zinniad only. Do we want to support zinniad usage outside of Filecoin Station? I think it will be much easier for us if we keep zinniad tailored specifically to Station needs.

  2. I am thinking about introducing a new optional env variable to specify InfluxDB writer URL to use for submitting the metrics. This way, zinniad does not report any metrics by default. People running zinniad in their custom setup can provide their own InfluxDB URL to let "their" zinniad instance report metrics to their database.

And even if we decide to measure stats from inside Zinnia, I wonder if Zinnia should report it itself, or whether it could expose those stats for Station Core to query and submit.

This is a fair point. I guess it depends on how frequently we want to report these metrics and what is the overhead of sending them via Station polling as opposed to submitting them to InfluxDB directly.

@bajtos
Copy link
Member

bajtos commented Jun 27, 2023

@juliangruber To mitigate the impact of OOM, is it perhaps time to improve Station Core to automatically restart Zinnia when it crashes?

@juliangruber
Copy link
Member Author

I think the only tool I have available right now is to use OS-level tools like macOS Activity Monitor and the ps command to observe the amount of memory used by the Zinnia process.

Debugging memory leaks would be easier if my Zinnia module could observe how much memory is used at given points in time.

Can you please help me understand why Zinnia exposing this information would be easier than the OS exposing it? 🤔

It would be even better if there was a way how to create heap snapshots for inspection in Chrome DevTools. And even more awesome would be the option to inspect the Zinnia process by connecting it with Chrome DevTools. DevTools provide functionality for creating memory snapshots and comparing them.

+100 That would be perfect

Regarding event loop latency - I think this metric can help us better understand how much CPU of the host machine we are using. It's not needed until there is a problem and we need it to be already in place :)

Given our small team size my vote is to add it once we require it :P I very much agree that it will be useful eventually.

As for submitting the metrics:

This be implemented in zinniad only. Do we want to support zinniad usage outside of Filecoin Station? I think it will be much easier for us if we keep zinniad tailored specifically to Station needs.

Fair point, to me zinniad is more independent of Station, but absolutely fair to keep them coupled for now.

I am thinking about introducing a new optional env variable to specify InfluxDB writer URL to use for submitting the metrics. This way, zinniad does not report any metrics by default. People running zinniad in their custom setup can provide their own InfluxDB URL to let "their" zinniad instance report metrics to their database.

It will need more config variables, right?

  • url
  • token
  • org
  • bucket
  • precision

If you want to add Influx reporting to zinnia I'm not opposed at all, looks like from a Station perspective it would be a relatively straight forward change to implement.

@bajtos
Copy link
Member

bajtos commented Jun 28, 2023

Can you please help me understand why Zinnia exposing this information would be easier than the OS exposing it?

With Zinnia API, I can write a Zinnia script that's repeatedly calling the code suspected to leak memory and collect memory usage information over time.

An example from denoland/deno#18369 (comment):

console.log('Deno version', Deno.version.deno);

let timestamp = new Date();

while (true) {
  if (Date.now() >= timestamp.valueOf()) {
    const bytes = Deno.memoryUsage().rss;
    console.log(timestamp.toISOString(), Math.floor(bytes / (1024 * 1024) * 10) / 10);
    timestamp = new Date(timestamp.valueOf() + 1000);
  }
  const response = await fetch('http://localhost:8080');
  await response.text();
}

To measure this from the outside, I would need to figure out which ps options give me memory usage of a process and then write a script repeatedly calling ps. It's doable, but it seems to me like more work. (Also, ps is not available on Windows.)

@bajtos
Copy link
Member

bajtos commented Jun 28, 2023

I opened two follow-up issues:

@juliangruber is there anything else we want to change in the near future based on what we learned here, or can we close this issue as resolved?

@juliangruber
Copy link
Member Author

To measure this from the outside, I would need to figure out which ps options give me memory usage of a process and then write a script repeatedly calling ps. It's doable, but it seems to me like more work. (Also, ps is not available on Windows.)

Just my 2 cents, but to me writing this script is more work than running ps / Activity Montior.app.

@juliangruber is there anything else we want to change in the near future based on what we learned here, or can we close this issue as resolved?

I think that captures it all :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants