Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm's tasks.db takes up lots of disk space #2367

Closed
joelchen opened this issue Sep 6, 2017 · 24 comments
Closed

Swarm's tasks.db takes up lots of disk space #2367

joelchen opened this issue Sep 6, 2017 · 24 comments

Comments

@joelchen
Copy link

joelchen commented Sep 6, 2017

Servers running a couple of Docker containers on Swarm has tasks.db using a few GBs of disk space. Docker containers are deployed globally on 2 servers, both Swarm managers, and their tasks.db gradually fill up 8 GBs of each server's disk space. Docker version is 17.06.1-ce.

What could be taking up such enormous space in tasks.db? How do I prevent tasks.db from growing?

@nishanttotla
Copy link
Contributor

@joelchen can you provide more information about the services you're running?

@christopherobin
Copy link

@nishanttotla Just had the same issue happen, the node was a worker running about 10 stacks making for 30+ services.

The Docker process was using about 10GB of memory and I decided to restart it, I went with the following procedure:

  1. Set the node to drain mode
  2. Restart docker

Upon restarting I noticed two things:

  • The node would not connect to the Swarm (doing node ls on the manager would show it as down)
  • It would try to spawn very old versions of some stacks, for example we are running gitlab/gitlab-ce:10.1.3-ce.0 but it started creating containers using tag 9.5.3-ce.0

After digging around I thought that maybe the task DB was probably corrupted and checked in /var/lib/docker/swarm/worker only to find that tasks.db was 8.8GB in size, since I needed the cluster up I went with the dangerous decision of stopping docker, renaming the worker folder and restarting the node.

It then proceeded to properly join the swarm and start the correct version of containers but now I have ghost tasks that are marked as being on that node but that node has no information about them.

I saved a copy of the tasks.db and I'm gonna try taking a look at the data in it and will post more information here once this is done.

@christopherobin
Copy link

bolt check tasks.db

page 2282443: unreachable unfreed
1 errors found
invalid value

bolt stats tasks.db

Aggregate statistics for 1 buckets

Page count statistics
	Number of logical branch pages: 18
	Number of physical branch overflow pages: 0
	Number of logical leaf pages: 1257
	Number of physical leaf overflow pages: 222
Tree statistics
	Number of keys/value pairs: 11531
	Number of levels in B+tree: 5
Page size utilization
	Bytes allocated for physical branch pages: 73728
	Bytes actually used for branch data: 34195 (46%)
	Bytes allocated for physical leaf pages: 6057984
	Bytes actually used for leaf data: 3324514 (54%)
Bucket statistics
	Total number of buckets: 5321
	Total number on inlined buckets: 4873 (91%)
	Bytes used for inlined buckets: 1018794 (30%)

bolt pages tasks.db | awk '{ print $2 }' | sort | uniq -c

      1 ==========
      1 TYPE
     19 branch
2276487 free
      1 freelist
   1259 leaf
      2 meta

bolt compact -o tasks.compressed.db tasks.db && ls -l tasks.*

9365671936 -> 8388608 bytes (gain=1116.48x)
-rw-r--r-- 1 crobin crobin    8388608 Nov 13 15:35 tasks.compressed.db
-rw-r--r-- 1 crobin crobin 9365671936 Nov 13 15:33 tasks.db

So it seems to be an issue with docker never compacting the database?

@donswa
Copy link

donswa commented Feb 27, 2018

@nishanttotla We are also seeing the same issue. /var/lib/docker/swarm/worker/task.db is of 5 GB.
docker version - 17.06.1-ce
Manager Status - Leader
No of Nodes - One
Is there any way to recover the space? Could you please advise clean up steps?

@tanmng
Copy link

tanmng commented May 8, 2018

Any updates on this matter guys? I would like to recover some hard disk space on a small experimental docker swarm I'm running right now.

@pouicr
Copy link

pouicr commented Aug 1, 2018

Hi,

Same for me ... can we just stop daemon / remove this file and restart?

@pouicr
Copy link

pouicr commented Aug 1, 2018

@Myself
it works

@Davidian1024
Copy link

I believe I'm running into this issue as well. /var/lib/docker/swarm/worker/tasks.db has grown to 12GB and it seems that it's never going to stop.

I'd rather not stop the daemon, delete the tasks.db file and the start the daemon again if possible.

Is it possible to determine what's filing up this tasks database? Could the containers that I'm running be leaving stale tasks in the database?

The bolt commands that @christopherobin used above are a bit of a mystery to me. Some Google searches have me thinking that this tasks.db file is a Bolt database.
https://github.com/boltdb/bolt

Docker version 17.06.0-ce, build 02c1d87

@drdaeman
Copy link

drdaeman commented Sep 9, 2018

Just a note. Here's an even safer approach, but it requires some downtime. Even longer than just removing tasks.db.

# Install bolt CLI if you don't have it. Run as your normal user.
[ -e ${GOPATH:-$HOME/go}/bin/bolt ] || go get github.com/boltdb/bolt/...

# Become root and get to the database directory
sudo -s # YMMV
cd /var/lib/docker/swarm/worker

# Stop Docker daemon. This is systemd invocation, your init system may vary.
systemctl stop docker

# Compact the database. May take a while.
${GOPATH:-$HOME/go}/bin/bolt compact -o tasks.db.new tasks.db

# Replace old database with a new compacted version
rm tasks.db
mv tasks.db.new tasks.db

# Start Docker daemon. Again, YMMV.
systemctl start docker

This should do the trick in case your swarm has tasks scheduled, and you don't want to risk them, but can shut down a manager node for the time being (especially if it's redundant).

Bolt databases cannot be shared between multiple processes (by design), so while the Docker daemon is alive there is no way of compacting it.

@Davidian1024
Copy link

Seeing this again. I've had updated to Docker version 18.03.1-ce, build 9ee9f40.
/var/lib/docker/swarm/worker/tasks.db grew to 9.1GB in about 3 weeks on one host.
I'm stopping the service and deleting tasks.db. Fortunately this wasn't production.

@marcwaz
Copy link

marcwaz commented Nov 4, 2018

same bug with docker 18.03.1-ce on ubuntu 16.
just 20 containers running and a tasks.db file which consumes 10 Gb...
have you plan to fix it ? nothing done since one year.

@olljanat
Copy link
Contributor

Which kind of workloads you who have seen are running on Swarm?

Only situation which I can image is that these must be some service which is constantly crashing and swarm is all the time scheduling to new container to be created (creating new tasks). That one can be easily tested by creating broken service with command like:
docker service create --name broken --detach --restart-delay 1ms does-not-exists

That situation can be easily avoided by using --restart-max-attempts 10 parameter on service create command (and if ten restarts is not enough you should fix your unstable service(s)).

Btw. You can fix this issue by leaving from swarm and joining again. That will clean up everything from /var/lib/docker/swarm/

@lybroman
Copy link

lybroman commented Dec 28, 2018

Meet the same issue in AWS EC2 Swarm nodes. It continuously eating up disk space. For some long running tasks, it would be a disaster :(.
Linux version 4.9.36-moby
docker Version: 17.06.0-ce

@olljanat
Copy link
Contributor

@lybroman to be able to fix this we need first understand why only some users are seeing this one.

Can you tell more which kind of workloads you have? (look my earlier message).

@lybroman
Copy link

lybroman commented Jan 3, 2019

@olljanat I am trying to investigate this possibility. BTW, is there any recommended tool to inspect the task.db file? That may help me figure out the potential issue.

@olljanat
Copy link
Contributor

olljanat commented Jan 3, 2019

I was able to look inside of it using some general boltdb viewer but new records are only created there when new tasks are scheduled so you should be able to see much more useful data with docker service ps <servicename> about how often services are restarted by swarm.

Or if you like UI you can also use example Portainer to see those tasks.

@korigod
Copy link

korigod commented Mar 3, 2019

I face this issue too. Yes, swarm restarts some containers every few minutes for recurring tasks. It of course can be done other way but unfortunately as far as I know swarm still doesn't have convenient way to schedule recurring containers.

@mcast
Copy link

mcast commented Nov 12, 2019

I came here for that bug. Thanks @olljanat for the tip,

ubuntu@pettle:~/tmp/oxo$ for srv in $( docker service ls | tail -n+2 | cut -f1 -d' '); do docker service ps $srv; done
ID                  NAME                IMAGE                                                        NODE                DESIRED STATE       CURRENT STATE           ERROR                       PORTS
j8clv0tnvv2w        lcm_csm-api.1       why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Ready               Ready 4 seconds ago                                 
kipcyelb6n8b         \_ lcm_csm-api.1   why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Shutdown            Failed 4 seconds ago    "task: non-zero exit (1)"   
qi57pj7a3jgg         \_ lcm_csm-api.1   why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Shutdown            Failed 11 seconds ago   "task: non-zero exit (1)"   
1gx0jkr3xza3         \_ lcm_csm-api.1   why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Shutdown            Failed 18 seconds ago   "task: non-zero exit (1)"   
uqa6hb2us5m4         \_ lcm_csm-api.1   why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Shutdown            Failed 25 seconds ago   "task: non-zero exit (1)"   
[...dull, like below...]
ID                  NAME                 IMAGE               NODE                DESIRED STATE       CURRENT STATE          ERROR               PORTS
9zo0w0hfgvm2        lcm_postgres.1       postgres:11.2       pettle              Running             Running 3 weeks ago                        
mm4kbbv1esmp         \_ lcm_postgres.1   postgres:11.2       pettle              Shutdown            Complete 3 weeks ago                       
rb7jvw7p40p6         \_ lcm_postgres.1   postgres:11.2       pettle              Shutdown            Complete 3 weeks ago                       

So I have a crashing service lcm_csm-api - not surprising on a dev node, but it has been doing it for weeks and eaten 1 GiB of disk - and some others which are stable.

Another (brutally stupid) approach to the same problem is sudo strings /var/lib/docker/swarm/worker/tasks.db | sort | uniq -c | sort -rn | less and then see what repeats most. Here the outstanding frequent unique bits of text look like text-guids,
so I started digging (again with little insight into the workings) with (sudo find /var/lib/docker*; docker image ls; docker service ls; docker network ls; docker secret ls; docker stack ls; docker info) | grep -5E 'j97puk982g93zpgztsdt00bb8|ndizbqqwbpnb6bs0hcquza8se|mz5ch9ywlyzzuywt3x7hb0bo0|83rrhgmzszw1eiekqcrkv0nzc'
and I see /var/lib/docker/network/files/lb_j97puk982g93zpgztsdt00bb8/, one secret and what seems to be the swarm. Don't know how to find out what the other repeater is.

Another clue is with the crashing service, it gets a new text-guid each restart, so I look for those in the .db. I'm using the Advanced String and Chewing Gum approach here,

ubuntu@pettle:~/tmp/oxo$ sudo strings /var/lib/docker/swarm/worker/tasks.db | grep -E "$( echo $( docker service ps lcm_csm-api | tail -n+2 | cut -f1 -d' ' ) | tr ' ' '|' )" | sort | uniq -c 
      1 2kqneq788czkmae219leuln2d
      1 2kqneq788czkmae219leuln2dK
      1 fwkkhfcxlciu4ec287e17kwpp
      1 fwkkhfcxlciu4ec287e17kwppR
      3 krt7024k65foj5i9xw4fh6iw8
      1 krt7024k65foj5i9xw4fh6iw8a
      2 onnwb0y2qsmlqhpri1y14hxji
      1 vkbp58edgsdlxivbkgm8c9tlc
      1 vkbp58edgsdlxivbkgm8c9tlcZ

but it looks like 2 ~ 4 records per restart, every 5~10 seconds.

docker logs <failing-service>.1.<txtguid> with tab completion shows that logs remain available while the docker service ps entry hangs around (a minute or so). Mine is having a database login failure.

I'll keep the database around for a week or two in case it helps with debugging...

@mcast
Copy link

mcast commented Nov 12, 2019

I forgot version info. The Ubuntu might be a bit of a dog's dinner of partial upgrades.

ubuntu@pettle:~/tmp/oxo$ docker version 
Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:23:02 2019
  OS/Arch:          linux/amd64
  Experimental:     false
ubuntu@pettle:~/tmp/oxo$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 19.10
Release:        19.10
Codename:       eoan
ubuntu@pettle:~/tmp/oxo$ dpkg -l docker*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name             Version                     Architecture Description
+++-================-===========================-============-========================================================
un  docker           <none>                      <none>       (no description available)
ii  docker-ce        5:18.09.7~3-0~ubuntu-bionic amd64        Docker: the open-source application container engine
ii  docker-ce-cli    5:18.09.7~3-0~ubuntu-bionic amd64        Docker CLI: the open-source application container engine
un  docker-engine    <none>                      <none>       (no description available)
un  docker-engine-cs <none>                      <none>       (no description available)
un  docker.io        <none>                      <none>       (no description available)

@zipy124
Copy link

zipy124 commented Nov 30, 2019

I've also got this issue, unfortuantely on prod with a 16Gb tasks.db.

Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77
 Built:             Sat May 4 02:35:27 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May 4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false

On Ubuntu 16.04 Xenial

@mcast
Copy link

mcast commented Dec 2, 2019 via email

@yuklia
Copy link

yuklia commented Aug 10, 2020

Hello! I have the same issue in one of envs:
Precondition:

Client: Docker Engine - Community
 Version:           19.03.4
 API version:       1.40
 Go version:        go1.12.10
 Git commit:        9013bf583a
 Built:             Fri Oct 18 15:54:09 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.4
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.10
  Git commit:       9013bf583a
  Built:            Fri Oct 18 15:52:40 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
  • 4 stacks
  • around 30 services
  • 2 services are running in global mode with Restart delay: 1h
  • Manager Status - Leader
  • No of Nodes - One
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.2 LTS
Release:	18.04
Codename:	bioni

I guess restart delay is an issue
image

@olljanat
Copy link
Contributor

olljanat commented Aug 11, 2020

Please, note that fix to bug was released as part of 19.03.9 version: https://docs.docker.com/engine/release-notes/#19039

@thaJeztah this issue can be closed.

@thaJeztah
Copy link
Member

Thanks; yes, looks like fixed through #2938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.