-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reoccurring errors with checkpointing #688
Comments
Hi @hazelnut-99 — thanks for reporting! Can you provide some more details about how you're running Arroyo? In particular:
Thanks! |
Scheduler: Kubernetes Scheduler
|
And to double check is this a distributed kubernetes cluster or a local one (like running in minikube or similar)? All nodes in the cluster need to be able to read from the state, so if you're running on a distributed kubernetes cluster it's necessary to use some kind of distributed state storage (typically this will be an object store like S3 but could be a shared filesystem like NFS as well). This is described in the docs (https://doc.arroyo.dev/deployment/kubernetes) but could be made clearer. |
Hi! To provide more context, this error is reproducible when I'm running it locally using the default process scheduler as well. Also typically when I launch the new pipeline, the first 100 (or so) checkpoints can succeed but will be stuck due to the reported error afterwards. Thanks so much! |
Thanks so much for providing the query — I'm able to reproduce the issue. |
Thanks for the great bug report and for helping me diagnose this issue! I've merged a fix that I believe should address it. Please try out the latest master (available as the docker image |
Wow, that was swift!!! I'll test it out, thanks so much! |
Hi!
We are currently running Arroyo version 0.11.0. After launching a pipeline and allowing it to run for an extended period, the checkpointing process appears to become stuck. The controller continuously logs the following error messages:
Subsequent attempts to launch new pipelines also encounter the same errors.
The text was updated successfully, but these errors were encountered: