-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
if loki is not reachable and loki-docker-driver is activated, containers apps stops and cannot be stopped/killed #2361
Comments
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
:-( |
This issue is being closed without any comment/ feedback? For me/ us this is a major issue/ blocker. @owen-d Can you please comment? Thank you! |
#2017 fixed the same problem for me |
Do you mean setting the non-blocking mode? |
I could reproduce the problem:
Running loki and the above client-container, then stopping loki, the client-container fails:
|
Yeah I meant the non-blocking mode, I haven't noticed it in the original issue, sorry. |
No response? 😢 |
Hi, I found out that the time needed to stop a container (any container) has "penalty" between 5 and 15 minutes when loki is the logging driver and the destination server (either loki or promtail) is unreachable.
At the moment we are trying with the Is there any viable fix available at the moment? |
I'm investigating! you can even reproduce by directly start any container with loki logger and some unreachable loki-url,
case 1, you can stop/kill container docker daemon log is not that useful either.
|
I probably figured out the reason why it takes so much time, and I can say my suspect was true and I think this is probably an intended behavior: If we try to start a container reducing to the (almost) minimum the backoff options, we can see the container stops (almost) immediately: As far as my undestanding goes tho, if the driver is unable to send the logs withing the backoff frame, the logs will be lost (so I would consider the keep-file seriously...) In my opinion the best thing to do would be to cache locally the logs if the client is unable to send the logs within the bakeoff window, to send them later on |
Agree with backoff logic, Tested with fluentd log driver, looks like same there as well, except may be fluentd have some default lower backoff time (so that container stops more quickly). And I see this on daemon log
also another small improvement could be to add a check to see if loki-url i reachable during start of the container and fail immediately. |
also 5mins time limit is from the default max-backoff we use. https://github.com/grafana/loki/blob/master/pkg/promtail/client/config.go#L19 |
I disagree, as starting a service may be more important than having its log (and debugging may not be that easy) As I said, in my opinion the best opinion would be to cache the logs and send them as soon as a Loki endpoint becomes available; In the meantime find a way to warn the user about the unreachable endpoint and cache the logs. |
Agree that a better way of understanding how to maintain control over a docker container when the end-point is unavailable is critical. I've been experimenting with different architecture deployments of Loki and found that even a Kill of the docker container doesn't work. Not being able to control a shutdown/restart of a container because I can't send logs out of the Loki driver shouldn't impact my container. Will look to change my container properties defaults to get around this. |
Maybe we should accept the behaviour of the docker driver plugin and send the logfiles to a local "kind of daemonset" promtail, which supports the loki push api? https://grafana.com/docs/loki/latest/clients/promtail/#loki-push-api |
Hi everyone, does somebody has a working fork that has changes allowing to lose data if such occurs? |
Hi, you don't need a fork. |
Pretty sad this will block a https://grafana.com/docs/loki/latest/clients/docker-driver/configuration/ |
That's what the file based discovery already does. The logging driver is really for local use cases and the Docker service discovery when you don't have the permissions to Mount the logging directory. |
**What this PR does / why we need it**: This pulls @Pandry's [workaround](#2361 (comment)) for the seemingly deadlocked Docker daemon into the documentation. **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)
What do you mean by "That's what the file besser discovery already does"? Is there a better way of sending the logs to loki than using the docker-driver that does not risk blocking the way the docker-driver does? |
@andoks yes. Yes, there's the service discovery or you could use file discovery or use jorunald. |
From reading through the docs and this issue, I can see there are three main solutions:
Which is the officially recommended solution to use for new projects? |
* Update Grafana/Loki/Prom pinned versions (especially to get Grafana UI improvements) * Pin Jaeger/hotrod tags to prevent future issues * Fix traces endpoint config for hotrod (traces export: Post "http://localhost:4318/v1/traces": dial tcp 127.0.0.1:4318: connect: connection refused) * Fix hotrod metrics scraping (endpoint has moved to the frontend service) * Fix Grafana dashboard (metric names, labels, migrate to new time series panel) * Add default Grafana credentials to README * Fix the loki container being stuck on shutdown by setting shorter timeouts (bug with the driver: grafana/loki#2361 (comment))
* Update Grafana/Loki/Prom pinned versions (especially to get Grafana UI improvements) * Pin Jaeger/hotrod tags to prevent future issues * Fix traces endpoint config for hotrod (traces export: Post "http://localhost:4318/v1/traces": dial tcp 127.0.0.1:4318: connect: connection refused) * Fix hotrod metrics scraping (endpoint has moved to the frontend service) * Fix Grafana dashboard (metric names, labels, migrate to new time series panel) * Add default Grafana credentials to README * Fix the loki container being stuck on shutdown by setting shorter timeouts (bug with the driver: grafana/loki#2361 (comment)) Signed-off-by: Stanislas Lange <git@slange.me>
## Which problem is this PR solving? Currently, the `grafana-integration` example doesn't work properly: if you run `docker-compose up` in that folder, services will start but only logging will work, the metrics and tracing won't. ## Description of the changes * Fix traces endpoint config for hotrod (`traces export: Post "http://localhost:4318/v1/traces": dial tcp 127.0.0.1:4318: connect: connection refused`) * Fix hotrod metrics scraping (endpoint has moved to the frontend service) * Pin Jaeger/hotrod tags to prevent future issues * Fix Grafana dashboard (metric names, labels, migrate to new time series panel) * Add default Grafana credentials to README * Fix the loki container being stuck on shutdown by setting shorter timeouts (bug with the driver: grafana/loki#2361 (comment)) * Update Grafana/Loki/Prom pinned versions (especially to get Grafana UI improvements) ## How was this change tested? `docker-compose up` 🙂 <img width="2304" alt="SCR-20231201-cohy" src="https://github.com/jaegertracing/jaeger/assets/11699655/22016bd9-0f99-40c7-be18-eb733561572a"> <img width="2304" alt="SCR-20231201-cxos" src="https://github.com/jaegertracing/jaeger/assets/11699655/db761bc3-53ac-41fa-914d-803c73233ad7"> <img width="2304" alt="SCR-20231201-coke" src="https://github.com/jaegertracing/jaeger/assets/11699655/004c99f0-0d1f-46f1-a9da-f50f0148d377"> ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [ ] I have added unit tests for the new functionality - [ ] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` Signed-off-by: Stanislas Lange <git@slange.me>
Thanks. Haven't been able to find any working solution yet. As soon as the Loki container goes offline, Im unable to restart it or otherwise, do useful stuff with docker, and only a shutdown or powerdown command properly downs my docker and restarts. |
This is quite straightforwardly mentioned in deadlock section: |
When I raised the issue? Or now? |
It was added |
I believe i tried that once. But it's been a long time. |
I wonder if we should finally close this issue. |
I have the same problem. So I guess it's still a problem. :/ |
@longGr did you try one of the documented workarounds? |
even docs is deadlock. This suggest promtail This suggest driver what a ... |
I've been struggling for a few days, but it seems like I've solved the problem. The reason for this issue is that the official documentation is terribly bad and hasn't been maintained. The solution to the problem is to not follow the documentation, but instead manually install the latest version. Listen! When I wrote this comment, the loki version is Step:
In sum. I use When you install https://grafana.com/docs/loki/latest/setup/upgrade/ The official documentation is really outdated and wasted a lot of my time. Time is valuable, and I hate this documentation. I despise it. I've never seen the worst documentation. I really don't know why they haven't updated the documentation. Why are their employees in such chaos? When your product has good documentation, it saves everyone time and benefits everyone. I have already started a company, and I really wouldn't purchase a product without proper documentation. |
@daimalou Hi, do you no longer have the docker daemon hang? |
@Potrek1337 "log-driver": "loki",
"log-opts": {
"mode":"non-blocking",
"loki-url": "http://localhost:3100/loki/api/v1/push",
"loki-batch-size": "400",
"loki-retries": "2",
"loki-max-backoff":"800ms",
"loki-timeout":"1s"
} |
@daimalou setting Now, you do raise a good point that the docs on the Driver and Promtail should be clearer and use the latest version but updating to the lastest version did not solve the deadlock but changing the configuration like you did. |
@jeschkies These solutions #2361 (comment) and configurations were obtained after searching all over GitHub, including the Grafana and Docker and other project repository. |
Describe the bug
we have installed the loki-docker-driver on all our devices.
The loki server on an extra server, if the loki-server is updated/restarted or just not reachable then after a short time all containers get stuck (docker logs does not update anymore).
If the loki-server is not reachable, the containers can neither be stopped/kill nor restarted.
To Reproduce
Steps to reproduce the behavior:
2.1.
/etc/docker/daemon.json
{ "live-restore": true, "log-driver": "loki", "log-opts": { "loki-url": "http://loki:3100/api/prom/push", "mode": "non-blocking", "loki-batch-size": "400", "max-size": "1g" } }
docker run --rm --name der-container -d debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"
(client)docker exec -it der-container tail -f /tmp/ts
shows every second the time (client)
docker logs -f der-container
show numbers from 0-1000000 (client)docker stop der-container
(client)Expected behavior
A clear and concise description of what you expected to happen.
I would like all containers to continue to run as desired even if the loci is not accessible.
That man container can start/stop even if loki is not reachable
Environment:
Screenshots, Promtail config, or terminal output
loki-docker-driver version: loki-docker-driver:master-616771a (from then on the driver option "non-blocking" is supported)
loki server: 1.5.0
I am very grateful for any help, this problem has caused our whole system to collapse
The text was updated successfully, but these errors were encountered: