Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging forever when low memory #103

Open
gygabyte017 opened this issue Aug 12, 2021 · 12 comments
Open

Hanging forever when low memory #103

gygabyte017 opened this issue Aug 12, 2021 · 12 comments
Labels
bug something broken P3 not needed for current cycle

Comments

@gygabyte017
Copy link

Hi, am experiencing kaleido randomly freezing on our production environment (unix with kubernetes).

I noticed that when the container has low memory, perhaps because the main python program consumed a lot of resources, for instance for holding the dataframe data needed to be plotted, when it calls write_image it will hang forever.

The kaleido process never terminates, there are no errors about a low memory condition, it just sits there with zero cpu consumption forever.

How can this be improved?

This behavior is very frustrating because sometimes I just find containers stuck running forever, that if I manually relaunch with the very same conditions they may run correctly, so I have no way to monitor if they got stuck,.

It would be ok that kaleido returns a memory error or a process failed exception, then it could be handled. But freezing forever... is just bad.

Any advice? Thank you

@jonmmease
Copy link
Collaborator

Hi @gygabyte017, thanks for the report. I'm not sure if this is possible for you, but it would be helpful to see if any logging is collected (but not displayed) before it hangs.

Are you able to reproduce the issue from a python repl? If so, the instructions in this issue might yield some extra info that would be helpful (#36 (comment)).

If possible, what would be most helpful would be a reproducible example consisting of:

  • A docker file
  • A memory limit
  • A python script

Thanks!

@gygabyte017
Copy link
Author

Hi, unfortunately it is hard for me to give you what you asked, sorry about that :( because since it doesn't happen on my local pc while testing and it only happens on serverless containers spawned on EKS, andthe plotting happens after a lot of complex calculations involving other resources.
However here's what I found out, hoping that might be somehow useful:

  • The container has a memory limit of 2GB. If I increase it to 3GB, it never happens.
  • At the end of the calculations, write_image is called dozens of times to plot every needed plot, and the freeze never happens at the first image, always after a fews.
  • Using the trick proposed here Kaleido hangs on repeated write_image calls. #42 with scope._shutdown_kaleido() it almost never freezes anymore, even though I'm using v0.2.1 (it still randomly happens on 1-2% of executions, while before it happened half of the times, so that's a good result).

(Not sure about how could I access the frozen container and send an interrupt and interact with repl to provide more info).

Thanks

@jonmmease jonmmease added the bug something broken label Aug 13, 2021
@jonmmease
Copy link
Collaborator

Thanks for this info @gygabyte017, that's helpful. Marking as a bug.

@bhachauk
Copy link

bhachauk commented Sep 13, 2021

Same happening for the to_image call.
Is there something can be done to avoid this ?

@jonmmease
Copy link
Collaborator

@Bhanuchander210, are you seeing this behavior being related to low memory as well?

@bhachauk
Copy link

@jonmmease
I am not sure but i hope so.
It is happening only on production environment randomly (as production env has other servers too.. so that i didn't track)
After the change made scope._shutdown_kaleido() (as u suggested)... till now it not hanged...

@jonmmease
Copy link
Collaborator

Ok, thanks @Bhanuchander210.

@jonmmease
Copy link
Collaborator

Notes:
Cross reference #43, which added some internal tracking of JavaScript heap usage, periodically clearing memory by refreshing the active page. If manually running scope._shutdown_kaleido() works around the issue, then I assume this internal page refresh to clear memory would do the same.

We're already refreshing the page when the heap reaches 50% of the maximum allowed. But I don't know whether this maximum limit (as returned by window.performance.memory.jsHeapSizeLimit) takes into account the available system memory. If not, then this might explain the trouble we're running into in memory constrained environments. Two ideas (not mutually exclusive):

  1. See if the chromium API provides a way to access the current system memory available, and incorporate that when deciding whether to refresh the page.
  2. Make this memory limit configurable through the API

@LukasRauth
Copy link

LukasRauth commented Nov 3, 2021

Hi,
I get
I get the same issue on my local pc on the first plotly figure i want to statically export (to a PDF), when I limit the python process's virtual memory "RLIMIT_AS" manually to for example 6GB.

Is there any progress on that topic?

Here is the code that I use to limit the virtual memory. (Thats something we need to do for that specific program to make sure that i wont get in conflict with the productive processes...)

import resource

def limit_memory():
    max_memory_mb = 6000
    soft_limit = max_memory_mb * 1024 * 1024
    resource.setrlimit(resource.RLIMIT_AS, (soft_limit, resource.RLIM_INFINITY))

Since it is the first export anyways, I cannot use the proposed workaround with scope._shutdown_kaleido()...

Im running on

kaleido==0.2.1
plotly==5.3.1

@MaartenBW
Copy link

Hi @gygabyte017

Did you ever resolve this issue? Did downgrading to v0.1.0 work to solve this issue?

Thanks.

@gygabyte017
Copy link
Author

Hi @MaartenBW, unfortunately I didn't, any version seems random, I don't believe there are reasons to prefeer 0.1.0 over 0.2.1 or whatever, it's just luck depending on the machine resources.

I managed to develop a ugly workaround, that is: 1) Increase the maximum ram on the containers, even though it wouldn't be necessary, and 2) the write_image is executed in a separate process with a timeout, if after i.e. 30 seconds it is still working, I kill the separate process and try again up to 5 tries.

In this way it's very rare that all the 5 tries fails, however it may still happen.

Now I want to try the solution described here, maybe it can work in a stable way? #110 (comment)

@MaartenBW
Copy link

@gygabyte017 Wow, thanks for your fast reply.

@gvwilson gvwilson self-assigned this Jul 26, 2024
@gvwilson gvwilson removed their assignment Aug 3, 2024
@gvwilson gvwilson added the P3 not needed for current cycle label Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something broken P3 not needed for current cycle
Projects
None yet
Development

No branches or pull requests

6 participants