if loki is not reachable and loki-docker-driver is activated, containers apps stops and cannot be stopped/killed #2361

badsmoke · 2020-07-16T13:37:38Z

Describe the bug
we have installed the loki-docker-driver on all our devices.
The loki server on an extra server, if the loki-server is updated/restarted or just not reachable then after a short time all containers get stuck (docker logs does not update anymore).
If the loki-server is not reachable, the containers can neither be stopped/kill nor restarted.

To Reproduce
Steps to reproduce the behavior:

start loki server (server)
install loki-docker-driver on another system (can also be tested on one and the same system) (client)
2.1. /etc/docker/daemon.json { "live-restore": true, "log-driver": "loki", "log-opts": { "loki-url": "http://loki:3100/api/prom/push", "mode": "non-blocking", "loki-batch-size": "400", "max-size": "1g" } }
docker run --rm --name der-container -d debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"(client)
docker exec -it der-container tail -f /tmp/ts
shows every second the time (client)
docker logs -f der-container show numbers from 0-1000000 (client)
stop loki server (server)
you will see that the outputs on the system stop with the loci-driver and that you cannot stop the container (client)
docker stop der-container (client)

Expected behavior
A clear and concise description of what you expected to happen.
I would like all containers to continue to run as desired even if the loci is not accessible.
That man container can start/stop even if loki is not reachable

Environment:

Infrastructure: [bare-metal, laptop, VMs]
Deployment tool: [docker-compose]

Screenshots, Promtail config, or terminal output
loki-docker-driver version: loki-docker-driver:master-616771a (from then on the driver option "non-blocking" is supported)
loki server: 1.5.0

I am very grateful for any help, this problem has caused our whole system to collapse

The text was updated successfully, but these errors were encountered:

stale · 2020-08-15T13:44:46Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

badsmoke · 2020-08-26T11:27:10Z

:-(

rkno82 · 2020-09-08T13:19:14Z

This issue is being closed without any comment/ feedback?

For me/ us this is a major issue/ blocker.

@owen-d Can you please comment? Thank you!

ondrejmo · 2020-09-21T17:01:38Z

#2017 fixed the same problem for me

rndmh3ro · 2020-09-21T17:29:13Z

#2017 fixed the same problem for me

Do you mean setting the non-blocking mode?
The OP stated that they set the mode to non-blocking but it still does not work. I'll have to try it tomorrow.

rndmh3ro · 2020-09-22T13:23:52Z

I could reproduce the problem:

root@loki # docker run -d --log-driver=loki     --log-opt loki-url="http://172.29.95.195:3101/loki/api/v1/push"     --log-opt loki-retries=5     --log-opt loki-batch-size=400 --log-opt mode=non-blocking  --name der-container -d debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"```

Running loki and the above client-container, then stopping loki, the client-container fails:

error from daemon in stream: Error grabbing logs: error decoding log message: net/http: request canceled (Client.Timeout exceeded while reading body)```

ondrejmo · 2020-09-26T17:17:40Z

#2017 fixed the same problem for me

Do you mean setting the non-blocking mode?
The OP stated that they set the mode to non-blocking but it still does not work. I'll have to try it tomorrow.

Yeah I meant the non-blocking mode, I haven't noticed it in the original issue, sorry.

rkno82 · 2020-10-14T08:47:41Z

No response? 😢

Pandry · 2020-10-27T19:40:02Z

Hi,
We are testing Loki for our architecture, and I encountered this issue too

I found out that the time needed to stop a container (any container) has "penalty" between 5 and 15 minutes when loki is the logging driver and the destination server (either loki or promtail) is unreachable.
In our testing architecture, we have the docker log driver that pushes the logs to the promtail container, and promtail that pushes the logs to the loki server (I tought (promtail cached and so) it could have been a good idea)

+-----------------------+   +--------------------+
|    Virtual Machine 01 |   | Virtual Machine 02 |
|                       |   |                    |
|   +------+--------+   |   |                    |
|   |Loki  | Docker |   |   |                    |
|   |DRIVER|        |   |   |                    |
|   +-+---++        |   |   |                    |
|   | ^   |         |   |   | +--------+         |
|   | | +-v------+  |   |   | | Loki   |         |
|   | | |Promtail+----------->+ Server |         |
|   | | +--------+  |   |   | |        |         |
|   | |             |   |   | +--------+         |
|   | +-------+     |   |   |                    |
|   | | NGINX |     |   |   |                    |
|   | +-------+     |   |   |                    |
|   +---------------+   |   |                    |
|                       |   |                    |
+-----------------------+   +--------------------+

At the moment we are trying with the mode: non-blocking mode, and, other than slowing down the stop of the promtail container itself, ~~it seems to be ok with the other containers~~ but it's not working anyway.

Is there any viable fix available at the moment?

kavirajk · 2020-10-28T13:16:23Z

I'm investigating!

you can even reproduce by directly start any container with loki logger and some unreachable loki-url,

with local log driver

docker run --log-driver local --log-opt max-size=10m alpine ping 127.0.0.1

with loki log driver

docker run --log-driver loki --log-opt loki-url="http://172.17.0.1:3100/loki/api/v1/push" alpine ping 127.0.0.1

case 1, you can stop/kill container
case 2, you can stop/kill container only after 5 mins or so

docker daemon log is not that useful either.

level=warn ts=2020-10-28T11:55:05.178484441Z caller=client.go:288 container_id=eb8c67b975f20837210c638d5f83fa1fa011c183c725af337c1fad9ffb2d3a01 component=client host=172.17.0.1:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://172.17.0.1:3100/loki/api/v1/push\": dial tcp 172.17.0.1:3100: connect: connection refused"

Pandry · 2020-10-28T15:47:11Z

I probably figured out the reason why it takes so much time, and I can say my suspect was true and I think this is probably an intended behavior:
As we can read from the source code, the message is given inside the backoff logic loop.

If we try to start a container reducing to the (almost) minimum the backoff options, we can see the container stops (almost) immediately:
docker run --log-driver loki --log-opt loki-url="http://0.0.0.0:3100/loki/api/v1/push" --log-opt loki-time out=1s --log-opt loki-max-backoff=800ms --log-opt loki-retries=2 alpine ping 127.0.0.1
(If you want to keep the log file after the container stopped, add the --log-opt keep-file=true parameter)

As far as my undestanding goes tho, if the driver is unable to send the logs withing the backoff frame, the logs will be lost (so I would consider the keep-file seriously...)

In my opinion the best thing to do would be to cache locally the logs if the client is unable to send the logs within the bakeoff window, to send them later on

kavirajk · 2020-10-28T16:57:44Z

Agree with backoff logic,

Tested with fluentd log driver, looks like same there as well, except may be fluentd have some default lower backoff time (so that container stops more quickly). And I see this on daemon log

dockerd[1476]: time="2020-10-28T17:50:12.580014937+01:00" level=warning msg="Logger didn't exit in time: logs may be truncated"

also another small improvement could be to add a check to see if loki-url i reachable during start of the container and fail immediately.

kavirajk · 2020-10-28T17:13:16Z

also 5mins time limit is from the default max-backoff we use. https://github.com/grafana/loki/blob/master/pkg/promtail/client/config.go#L19

Pandry · 2020-10-28T17:36:40Z

also another small improvement could be to add a check to see if loki-url i reachable during start of the container and fail immediately.

I disagree, as starting a service may be more important than having its log (and debugging may not be that easy)
I would rather use a feature-flag and by default keep it disabled

As I said, in my opinion the best opinion would be to cache the logs and send them as soon as a Loki endpoint becomes available; In the meantime find a way to warn the user about the unreachable endpoint and cache the logs.

lux4rd0 · 2020-11-20T01:22:42Z

Agree that a better way of understanding how to maintain control over a docker container when the end-point is unavailable is critical. I've been experimenting with different architecture deployments of Loki and found that even a Kill of the docker container doesn't work. Not being able to control a shutdown/restart of a container because I can't send logs out of the Loki driver shouldn't impact my container. Will look to change my container properties defaults to get around this.

rkno82 · 2020-11-20T08:47:21Z

Maybe we should accept the behaviour of the docker driver plugin and send the logfiles to a local "kind of daemonset" promtail, which supports the loki push api?

https://grafana.com/docs/loki/latest/clients/promtail/#loki-push-api

MaxZubrytskyi · 2023-05-19T11:25:46Z

Hi everyone, does somebody has a working fork that has changes allowing to lose data if such occurs?
Also, @jeschkies how about adding "log-opts" to lose data if loki is unavailable?

horvie · 2023-05-19T11:34:47Z

Hi, you don't need a fork.
For containers where we can afford to lose logs we have added configuration as described in #2361 (comment) and containers are stopped without a problem.

danthegoodman1 · 2023-07-03T19:52:08Z

Pretty sad this will block a rm --force too for the default loki-max-backoff of 5 minutes. Just drop that value down is my guess but I already switched over to running vector and mounting the docker logs directory to it because I don't trust this anymore. Vector wont block the docker daemon.

https://grafana.com/docs/loki/latest/clients/docker-driver/configuration/

jeschkies · 2023-07-10T10:44:28Z

@danthegoodman1

mounting the docker logs directory to it because I don't trust this anymore. Vector wont block the docker daemon.

That's what the file based discovery already does. The logging driver is really for local use cases and the Docker service discovery when you don't have the permissions to Mount the logging directory.

@Pandry

**What this PR does / why we need it**: This pulls @Pandry's [workaround](#2361 (comment)) for the seemingly deadlocked Docker daemon into the documentation. **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

andoks · 2023-07-10T22:13:54Z

@jeschkies

mounting the docker logs directory to it because I don't trust this anymore. Vector wont block the docker daemon.

That's what the file besser discovery already does. The logging driver is really for local use cases and the Docker service discovery when you don't have the permissions to Mount the logging directory.

What do you mean by "That's what the file besser discovery already does"? Is there a better way of sending the logs to loki than using the docker-driver that does not risk blocking the way the docker-driver does?

jeschkies · 2023-07-31T12:59:30Z

@andoks yes. Yes, there's the service discovery or you could use file discovery or use jorunald.

pharapeti · 2023-10-29T09:23:00Z

@jeschkies @btaani

From reading through the docs and this issue, I can see there are three main solutions:

Use Docker loki plugin with workaround to reduce max backoff/retries/timeout
Use Promtail Docker target (not sure which Docker logging driver should be used in this case)
Configure Docker daemon to use json-file or journald logging driver + Docker service discovery

Which is the officially recommended solution to use for new projects?

* Update Grafana/Loki/Prom pinned versions (especially to get Grafana UI improvements) * Pin Jaeger/hotrod tags to prevent future issues * Fix traces endpoint config for hotrod (traces export: Post "http://localhost:4318/v1/traces": dial tcp 127.0.0.1:4318: connect: connection refused) * Fix hotrod metrics scraping (endpoint has moved to the frontend service) * Fix Grafana dashboard (metric names, labels, migrate to new time series panel) * Add default Grafana credentials to README * Fix the loki container being stuck on shutdown by setting shorter timeouts (bug with the driver: grafana/loki#2361 (comment))

* Update Grafana/Loki/Prom pinned versions (especially to get Grafana UI improvements) * Pin Jaeger/hotrod tags to prevent future issues * Fix traces endpoint config for hotrod (traces export: Post "http://localhost:4318/v1/traces": dial tcp 127.0.0.1:4318: connect: connection refused) * Fix hotrod metrics scraping (endpoint has moved to the frontend service) * Fix Grafana dashboard (metric names, labels, migrate to new time series panel) * Add default Grafana credentials to README * Fix the loki container being stuck on shutdown by setting shorter timeouts (bug with the driver: grafana/loki#2361 (comment)) Signed-off-by: Stanislas Lange <git@slange.me>

## Which problem is this PR solving? Currently, the `grafana-integration` example doesn't work properly: if you run `docker-compose up` in that folder, services will start but only logging will work, the metrics and tracing won't. ## Description of the changes * Fix traces endpoint config for hotrod (`traces export: Post "http://localhost:4318/v1/traces": dial tcp 127.0.0.1:4318: connect: connection refused`) * Fix hotrod metrics scraping (endpoint has moved to the frontend service) * Pin Jaeger/hotrod tags to prevent future issues * Fix Grafana dashboard (metric names, labels, migrate to new time series panel) * Add default Grafana credentials to README * Fix the loki container being stuck on shutdown by setting shorter timeouts (bug with the driver: grafana/loki#2361 (comment)) * Update Grafana/Loki/Prom pinned versions (especially to get Grafana UI improvements) ## How was this change tested? `docker-compose up` 🙂 <img width="2304" alt="SCR-20231201-cohy" src="https://github.com/jaegertracing/jaeger/assets/11699655/22016bd9-0f99-40c7-be18-eb733561572a"> <img width="2304" alt="SCR-20231201-cxos" src="https://github.com/jaegertracing/jaeger/assets/11699655/db761bc3-53ac-41fa-914d-803c73233ad7"> <img width="2304" alt="SCR-20231201-coke" src="https://github.com/jaegertracing/jaeger/assets/11699655/004c99f0-0d1f-46f1-a9da-f50f0148d377"> ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [ ] I have added unit tests for the new functionality - [ ] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` Signed-off-by: Stanislas Lange <git@slange.me>

keesfluitman · 2024-04-26T10:53:18Z

@jeschkies @btaani

From reading through the docs and this issue, I can see there are three main solutions:
1. Use Docker loki plugin with workaround to reduce max backoff/retries/timeout

2. Use Promtail Docker target (_not sure which Docker logging driver should be used in this case_)

3. Configure Docker daemon to use `json-file` or `journald` logging driver + Docker service discovery
Which is the officially recommended solution to use for new projects?

Thanks. Haven't been able to find any working solution yet. As soon as the Loki container goes offline, Im unable to restart it or otherwise, do useful stuff with docker, and only a shutdown or powerdown command properly downs my docker and restarts.
I will have to forfeit this way of gaining the docker logs. I get regular downs at night, when the loki container is somehow downed.

dtap001 · 2024-07-31T15:11:30Z

This is quite straightforwardly mentioned in deadlock section:
https://grafana.com/docs/loki/latest/send-data/docker-driver/#known-issue-deadlocked-docker-daemon

danthegoodman1 · 2024-07-31T16:26:57Z

This is quite straightforwardly mentioned in deadlock section: https://grafana.com/docs/loki/latest/send-data/docker-driver/#known-issue-deadlocked-docker-daemon

When I raised the issue? Or now?

Impact123 · 2024-08-01T07:59:11Z

It was added Aug 23, 2021/Jul 10, 2023: e25587b/02027e4
See blame here: https://github.com/grafana/loki/blame/main/docs/sources/send-data/docker-driver/_index.md

keesfluitman · 2024-08-01T09:58:57Z

This is quite straightforwardly mentioned in deadlock section: https://grafana.com/docs/loki/latest/send-data/docker-driver/#known-issue-deadlocked-docker-daemon

I believe i tried that once. But it's been a long time.

jeschkies · 2024-08-22T20:15:22Z

I wonder if we should finally close this issue.

longGr · 2024-08-28T21:09:39Z

I have the same problem. So I guess it's still a problem. :/

jeschkies · 2024-08-29T16:04:10Z

I have the same problem. So I guess it's still a problem. :/

@longGr did you try one of the documented workarounds?

daimalou · 2024-11-13T09:25:34Z

even docs is deadlock.

This suggest promtail
https://grafana.com/docs/loki/latest/send-data/docker-driver/

This suggest driver
https://grafana.com/docs/loki/latest/send-data/promtail/configuration/#example-docker-config

what a ...

daimalou · 2024-11-15T08:45:55Z

I've been struggling for a few days, but it seems like I've solved the problem. The reason for this issue is that the official documentation is terribly bad and hasn't been maintained.

The solution to the problem is to not follow the documentation, but instead manually install the latest version.

Listen! When I wrote this comment, the loki version is 3.2.1.

Step:

Do not follow https://grafana.com/docs/loki/latest/setup/install/, because it use loki 2.9.2, use https://github.com/grafana/loki/blob/main/production/docker-compose.yaml instead.
Find latest loki promtail version, It may be same in https://github.com/grafana/loki/releases/ , and manually change the image version to 3.2.1, do not use :latest.
why? see loki-docker-driver plugin version always shows loki:latest #3155
Do not follow https://grafana.com/docs/loki/latest/send-data/docker-driver/, Install latest docker-driver, find https://hub.docker.com/r/grafana/loki-docker-driver/tags, I found 3.2.1.
I dont find grafana has version issue.

docker plugin install grafana/loki-docker-driver:3.2.1 --alias loki --grant-all-permissions

In sum.

I use grafana/loki:3.2.1 promtail:3.2.1 grafana:latest loki-docker-driver:3.2.1

When you install grafana/loki above 3.0.0, you must read this.

https://grafana.com/docs/loki/latest/setup/upgrade/

#12588

The official documentation is really outdated and wasted a lot of my time. Time is valuable, and I hate this documentation. I despise it. I've never seen the worst documentation.

I really don't know why they haven't updated the documentation. Why are their employees in such chaos?

#12506

When your product has good documentation, it saves everyone time and benefits everyone.

I have already started a company, and I really wouldn't purchase a product without proper documentation.

Potrek1337 · 2024-11-16T12:40:48Z

@daimalou Hi, do you no longer have the docker daemon hang?
What daemon.json configuration did you use?

daimalou · 2024-11-19T03:04:40Z

@Potrek1337
Hi.
I don't have.
This is my configuration.

 "log-driver": "loki",
  "log-opts": {
    "mode":"non-blocking",
    "loki-url": "http://localhost:3100/loki/api/v1/push", 
    "loki-batch-size": "400",
    "loki-retries": "2",
    "loki-max-backoff":"800ms",
    "loki-timeout":"1s"
  }

jeschkies · 2024-11-19T13:11:40Z

@daimalou setting loki-retries="2", "loki-timeout":"1s" and "loki-max-backoff":"800ms" is exactly what the documentation says.

Now, you do raise a good point that the docs on the Driver and Promtail should be clearer and use the latest version but updating to the lastest version did not solve the deadlock but changing the configuration like you did.

daimalou · 2024-11-19T13:41:43Z

@jeschkies
Here are the results of my experiment.
the old version on docs + my config did not solve the deadlock.
the latest version + my config solve the deadlock.

These solutions #2361 (comment) and configurations were obtained after searching all over GitHub, including the Grafana and Docker and other project repository.

stale bot added the stale A stale issue or PR that will automatically be closed. label Aug 15, 2020

stale bot closed this as completed Aug 22, 2020

slim-bean reopened this Sep 8, 2020

stale bot removed the stale A stale issue or PR that will automatically be closed. label Sep 8, 2020

slim-bean added the keepalive An issue or PR that will be kept alive and never marked as stale. label Sep 8, 2020

glebsam mentioned this issue Sep 25, 2020

Loki docker daemon logger #2339

Closed

kavirajk self-assigned this Oct 14, 2020

slim-bean added this to the 2.1 milestone Oct 29, 2020

kavirajk mentioned this issue Nov 8, 2020

fix(docker-driver): Propagate promtail's client.Stop properly #2898

Merged

slim-bean self-assigned this Nov 24, 2020

cyriltovena closed this as completed in #2898 Dec 1, 2020

jeschkies mentioned this issue Jul 10, 2023

Document deadlocked Docker daemon workaround. #9896

Merged

7 tasks

angristan mentioned this issue Dec 1, 2023

Fix example/grafana-integration jaegertracing/jaeger#4980

Merged

4 tasks

jeschkies linked a pull request Nov 19, 2024 that will close this issue

fix: Set a maximum retries for Docker driver to avoid deadlock. #15026

Draft

6 tasks

damirhadzagic mentioned this issue Jan 15, 2025

Stopping the loki container takes a really long time #15708

Open

if loki is not reachable and loki-docker-driver is activated, containers apps stops and cannot be stopped/killed #2361

if loki is not reachable and loki-docker-driver is activated, containers apps stops and cannot be stopped/killed #2361

Comments

badsmoke commented Jul 16, 2020

stale bot commented Aug 15, 2020

badsmoke commented Aug 26, 2020

rkno82 commented Sep 8, 2020 • edited Loading

ondrejmo commented Sep 21, 2020

rndmh3ro commented Sep 21, 2020

rndmh3ro commented Sep 22, 2020 • edited Loading

ondrejmo commented Sep 26, 2020

rkno82 commented Oct 14, 2020

Pandry commented Oct 27, 2020 • edited Loading

kavirajk commented Oct 28, 2020

Pandry commented Oct 28, 2020 • edited Loading

kavirajk commented Oct 28, 2020

kavirajk commented Oct 28, 2020

Pandry commented Oct 28, 2020

lux4rd0 commented Nov 20, 2020

rkno82 commented Nov 20, 2020

MaxZubrytskyi commented May 19, 2023

horvie commented May 19, 2023

danthegoodman1 commented Jul 3, 2023 • edited Loading

jeschkies commented Jul 10, 2023 • edited Loading

andoks commented Jul 10, 2023 • edited Loading

jeschkies commented Jul 31, 2023

pharapeti commented Oct 29, 2023

keesfluitman commented Apr 26, 2024

dtap001 commented Jul 31, 2024

danthegoodman1 commented Jul 31, 2024 • edited Loading

Impact123 commented Aug 1, 2024 • edited Loading

keesfluitman commented Aug 1, 2024

jeschkies commented Aug 22, 2024

longGr commented Aug 28, 2024

jeschkies commented Aug 29, 2024

daimalou commented Nov 13, 2024 • edited Loading

daimalou commented Nov 15, 2024 • edited Loading

Potrek1337 commented Nov 16, 2024

daimalou commented Nov 19, 2024 • edited Loading

jeschkies commented Nov 19, 2024

daimalou commented Nov 19, 2024 • edited Loading

rkno82 commented Sep 8, 2020 •

edited

Loading

rndmh3ro commented Sep 22, 2020 •

edited

Loading

Pandry commented Oct 27, 2020 •

edited

Loading

Pandry commented Oct 28, 2020 •

edited

Loading

danthegoodman1 commented Jul 3, 2023 •

edited

Loading

jeschkies commented Jul 10, 2023 •

edited

Loading

andoks commented Jul 10, 2023 •

edited

Loading

danthegoodman1 commented Jul 31, 2024 •

edited

Loading

Impact123 commented Aug 1, 2024 •

edited

Loading

daimalou commented Nov 13, 2024 •

edited

Loading

daimalou commented Nov 15, 2024 •

edited

Loading

daimalou commented Nov 19, 2024 •

edited

Loading

daimalou commented Nov 19, 2024 •

edited

Loading