Swarm's tasks.db takes up lots of disk space #2367

joelchen · 2017-09-06T07:24:16Z

Servers running a couple of Docker containers on Swarm has tasks.db using a few GBs of disk space. Docker containers are deployed globally on 2 servers, both Swarm managers, and their tasks.db gradually fill up 8 GBs of each server's disk space. Docker version is 17.06.1-ce.

What could be taking up such enormous space in tasks.db? How do I prevent tasks.db from growing?

nishanttotla · 2017-09-07T20:22:34Z

@joelchen can you provide more information about the services you're running?

christopherobin · 2017-11-13T06:28:39Z

@nishanttotla Just had the same issue happen, the node was a worker running about 10 stacks making for 30+ services.

The Docker process was using about 10GB of memory and I decided to restart it, I went with the following procedure:

Set the node to drain mode
Restart docker

Upon restarting I noticed two things:

The node would not connect to the Swarm (doing node ls on the manager would show it as down)
It would try to spawn very old versions of some stacks, for example we are running gitlab/gitlab-ce:10.1.3-ce.0 but it started creating containers using tag 9.5.3-ce.0

After digging around I thought that maybe the task DB was probably corrupted and checked in /var/lib/docker/swarm/worker only to find that tasks.db was 8.8GB in size, since I needed the cluster up I went with the dangerous decision of stopping docker, renaming the worker folder and restarting the node.

It then proceeded to properly join the swarm and start the correct version of containers but now I have ghost tasks that are marked as being on that node but that node has no information about them.

I saved a copy of the tasks.db and I'm gonna try taking a look at the data in it and will post more information here once this is done.

christopherobin · 2017-11-13T06:37:09Z

bolt check tasks.db

page 2282443: unreachable unfreed
1 errors found
invalid value

bolt stats tasks.db

Aggregate statistics for 1 buckets

Page count statistics
	Number of logical branch pages: 18
	Number of physical branch overflow pages: 0
	Number of logical leaf pages: 1257
	Number of physical leaf overflow pages: 222
Tree statistics
	Number of keys/value pairs: 11531
	Number of levels in B+tree: 5
Page size utilization
	Bytes allocated for physical branch pages: 73728
	Bytes actually used for branch data: 34195 (46%)
	Bytes allocated for physical leaf pages: 6057984
	Bytes actually used for leaf data: 3324514 (54%)
Bucket statistics
	Total number of buckets: 5321
	Total number on inlined buckets: 4873 (91%)
	Bytes used for inlined buckets: 1018794 (30%)

bolt pages tasks.db | awk '{ print $2 }' | sort | uniq -c

      1 ==========
      1 TYPE
     19 branch
2276487 free
      1 freelist
   1259 leaf
      2 meta

bolt compact -o tasks.compressed.db tasks.db && ls -l tasks.*

9365671936 -> 8388608 bytes (gain=1116.48x)
-rw-r--r-- 1 crobin crobin    8388608 Nov 13 15:35 tasks.compressed.db
-rw-r--r-- 1 crobin crobin 9365671936 Nov 13 15:33 tasks.db

So it seems to be an issue with docker never compacting the database?

donswa · 2018-02-27T15:42:38Z

@nishanttotla We are also seeing the same issue. /var/lib/docker/swarm/worker/task.db is of 5 GB.
docker version - 17.06.1-ce
Manager Status - Leader
No of Nodes - One
Is there any way to recover the space? Could you please advise clean up steps?

tanmng · 2018-05-08T19:22:43Z

Any updates on this matter guys? I would like to recover some hard disk space on a small experimental docker swarm I'm running right now.

pouicr · 2018-08-01T14:09:38Z

Hi,

Same for me ... can we just stop daemon / remove this file and restart?

pouicr · 2018-08-01T16:12:04Z

@Myself
it works

Davidian1024 · 2018-08-20T14:52:14Z

I believe I'm running into this issue as well. /var/lib/docker/swarm/worker/tasks.db has grown to 12GB and it seems that it's never going to stop.

I'd rather not stop the daemon, delete the tasks.db file and the start the daemon again if possible.

Is it possible to determine what's filing up this tasks database? Could the containers that I'm running be leaving stale tasks in the database?

The bolt commands that @christopherobin used above are a bit of a mystery to me. Some Google searches have me thinking that this tasks.db file is a Bolt database.
https://github.com/boltdb/bolt

Docker version 17.06.0-ce, build 02c1d87

drdaeman · 2018-09-09T19:09:15Z

Just a note. Here's an even safer approach, but it requires some downtime. Even longer than just removing tasks.db.

# Install bolt CLI if you don't have it. Run as your normal user.
[ -e ${GOPATH:-$HOME/go}/bin/bolt ] || go get github.com/boltdb/bolt/...

# Become root and get to the database directory
sudo -s # YMMV
cd /var/lib/docker/swarm/worker

# Stop Docker daemon. This is systemd invocation, your init system may vary.
systemctl stop docker

# Compact the database. May take a while.
${GOPATH:-$HOME/go}/bin/bolt compact -o tasks.db.new tasks.db

# Replace old database with a new compacted version
rm tasks.db
mv tasks.db.new tasks.db

# Start Docker daemon. Again, YMMV.
systemctl start docker

This should do the trick in case your swarm has tasks scheduled, and you don't want to risk them, but can shut down a manager node for the time being (especially if it's redundant).

Bolt databases cannot be shared between multiple processes (by design), so while the Docker daemon is alive there is no way of compacting it.

Davidian1024 · 2018-09-21T13:51:58Z

Seeing this again. I've had updated to Docker version 18.03.1-ce, build 9ee9f40.
/var/lib/docker/swarm/worker/tasks.db grew to 9.1GB in about 3 weeks on one host.
I'm stopping the service and deleting tasks.db. Fortunately this wasn't production.

marcwaz · 2018-11-04T06:16:58Z

same bug with docker 18.03.1-ce on ubuntu 16.
just 20 containers running and a tasks.db file which consumes 10 Gb...
have you plan to fix it ? nothing done since one year.

olljanat · 2018-11-18T18:14:58Z

Which kind of workloads you who have seen are running on Swarm?

Only situation which I can image is that these must be some service which is constantly crashing and swarm is all the time scheduling to new container to be created (creating new tasks). That one can be easily tested by creating broken service with command like:
docker service create --name broken --detach --restart-delay 1ms does-not-exists

That situation can be easily avoided by using --restart-max-attempts 10 parameter on service create command (and if ten restarts is not enough you should fix your unstable service(s)).

Btw. You can fix this issue by leaving from swarm and joining again. That will clean up everything from /var/lib/docker/swarm/

lybroman · 2018-12-28T03:09:20Z

Meet the same issue in AWS EC2 Swarm nodes. It continuously eating up disk space. For some long running tasks, it would be a disaster :(.
Linux version 4.9.36-moby
docker Version: 17.06.0-ce

olljanat · 2018-12-28T06:30:32Z

@lybroman to be able to fix this we need first understand why only some users are seeing this one.

Can you tell more which kind of workloads you have? (look my earlier message).

lybroman · 2019-01-03T08:45:09Z

@olljanat I am trying to investigate this possibility. BTW, is there any recommended tool to inspect the task.db file? That may help me figure out the potential issue.

olljanat · 2019-01-03T17:04:48Z

I was able to look inside of it using some general boltdb viewer but new records are only created there when new tasks are scheduled so you should be able to see much more useful data with docker service ps <servicename> about how often services are restarted by swarm.

Or if you like UI you can also use example Portainer to see those tasks.

korigod · 2019-03-03T00:08:20Z

I face this issue too. Yes, swarm restarts some containers every few minutes for recurring tasks. It of course can be done other way but unfortunately as far as I know swarm still doesn't have convenient way to schedule recurring containers.

mcast · 2019-11-12T10:45:56Z

I came here for that bug. Thanks @olljanat for the tip,

ubuntu@pettle:~/tmp/oxo$ for srv in $( docker service ls | tail -n+2 | cut -f1 -d' '); do docker service ps $srv; done
ID                  NAME                IMAGE                                                        NODE                DESIRED STATE       CURRENT STATE           ERROR                       PORTS
j8clv0tnvv2w        lcm_csm-api.1       why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Ready               Ready 4 seconds ago                                 
kipcyelb6n8b         \_ lcm_csm-api.1   why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Shutdown            Failed 4 seconds ago    "task: non-zero exit (1)"   
qi57pj7a3jgg         \_ lcm_csm-api.1   why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Shutdown            Failed 11 seconds ago   "task: non-zero exit (1)"   
1gx0jkr3xza3         \_ lcm_csm-api.1   why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Shutdown            Failed 18 seconds ago   "task: non-zero exit (1)"   
uqa6hb2us5m4         \_ lcm_csm-api.1   why.docker.cgp-wr.sanger.ac.uk:5000/csm-api/develop:latest   pettle              Shutdown            Failed 25 seconds ago   "task: non-zero exit (1)"   
[...dull, like below...]
ID                  NAME                 IMAGE               NODE                DESIRED STATE       CURRENT STATE          ERROR               PORTS
9zo0w0hfgvm2        lcm_postgres.1       postgres:11.2       pettle              Running             Running 3 weeks ago                        
mm4kbbv1esmp         \_ lcm_postgres.1   postgres:11.2       pettle              Shutdown            Complete 3 weeks ago                       
rb7jvw7p40p6         \_ lcm_postgres.1   postgres:11.2       pettle              Shutdown            Complete 3 weeks ago

So I have a crashing service lcm_csm-api - not surprising on a dev node, but it has been doing it for weeks and eaten 1 GiB of disk - and some others which are stable.

Another (brutally stupid) approach to the same problem is sudo strings /var/lib/docker/swarm/worker/tasks.db | sort | uniq -c | sort -rn | less and then see what repeats most. Here the outstanding frequent unique bits of text look like text-guids,
so I started digging (again with little insight into the workings) with (sudo find /var/lib/docker*; docker image ls; docker service ls; docker network ls; docker secret ls; docker stack ls; docker info) | grep -5E 'j97puk982g93zpgztsdt00bb8|ndizbqqwbpnb6bs0hcquza8se|mz5ch9ywlyzzuywt3x7hb0bo0|83rrhgmzszw1eiekqcrkv0nzc'
and I see /var/lib/docker/network/files/lb_j97puk982g93zpgztsdt00bb8/, one secret and what seems to be the swarm. Don't know how to find out what the other repeater is.

Another clue is with the crashing service, it gets a new text-guid each restart, so I look for those in the .db. I'm using the Advanced String and Chewing Gum approach here,

ubuntu@pettle:~/tmp/oxo$ sudo strings /var/lib/docker/swarm/worker/tasks.db | grep -E "$( echo $( docker service ps lcm_csm-api | tail -n+2 | cut -f1 -d' ' ) | tr ' ' '|' )" | sort | uniq -c 
      1 2kqneq788czkmae219leuln2d
      1 2kqneq788czkmae219leuln2dK
      1 fwkkhfcxlciu4ec287e17kwpp
      1 fwkkhfcxlciu4ec287e17kwppR
      3 krt7024k65foj5i9xw4fh6iw8
      1 krt7024k65foj5i9xw4fh6iw8a
      2 onnwb0y2qsmlqhpri1y14hxji
      1 vkbp58edgsdlxivbkgm8c9tlc
      1 vkbp58edgsdlxivbkgm8c9tlcZ

but it looks like 2 ~ 4 records per restart, every 5~10 seconds.

docker logs <failing-service>.1.<txtguid> with tab completion shows that logs remain available while the docker service ps entry hangs around (a minute or so). Mine is having a database login failure.

~~I'll keep the database around for a week or two in case it helps with debugging...~~

mcast · 2019-11-12T10:48:32Z

I forgot version info. The Ubuntu might be a bit of a dog's dinner of partial upgrades.

ubuntu@pettle:~/tmp/oxo$ docker version 
Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:23:02 2019
  OS/Arch:          linux/amd64
  Experimental:     false
ubuntu@pettle:~/tmp/oxo$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 19.10
Release:        19.10
Codename:       eoan
ubuntu@pettle:~/tmp/oxo$ dpkg -l docker*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name             Version                     Architecture Description
+++-================-===========================-============-========================================================
un  docker           <none>                      <none>       (no description available)
ii  docker-ce        5:18.09.7~3-0~ubuntu-bionic amd64        Docker: the open-source application container engine
ii  docker-ce-cli    5:18.09.7~3-0~ubuntu-bionic amd64        Docker CLI: the open-source application container engine
un  docker-engine    <none>                      <none>       (no description available)
un  docker-engine-cs <none>                      <none>       (no description available)
un  docker.io        <none>                      <none>       (no description available)

zipy124 · 2019-11-30T19:47:32Z

I've also got this issue, unfortuantely on prod with a 16Gb tasks.db.

Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77
 Built:             Sat May 4 02:35:27 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May 4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false

On Ubuntu 16.04 Xenial

mcast · 2019-12-02T09:11:12Z

I've also got this issue, unfortuantely on prod with a 16Gb tasks.db.

I can't give production-grade advice on dealing with the symptoms, so my best offered help is 1. get a dev node 2. give it a few dozen services that fail persistently, and a couple that you want to keep up 3. wait an hour, so you have a mess to clear up 4. take the service down and back up the mess 5. ...try clearing up the mess various ways. I don't know what the symptoms could be of a failed clearup. It's possible that delving in the `tasks.db` would give clues to ways to tidy up without shutting down. I was not systematic in my examination, only looked for some keyword matches.

…

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

yuklia · 2020-08-10T12:16:40Z

Hello! I have the same issue in one of envs:
Precondition:

Client: Docker Engine - Community
 Version:           19.03.4
 API version:       1.40
 Go version:        go1.12.10
 Git commit:        9013bf583a
 Built:             Fri Oct 18 15:54:09 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.4
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.10
  Git commit:       9013bf583a
  Built:            Fri Oct 18 15:52:40 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

4 stacks
around 30 services
2 services are running in global mode with Restart delay: 1h
Manager Status - Leader
No of Nodes - One

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.2 LTS
Release:	18.04
Codename:	bioni

I guess restart delay is an issue

olljanat · 2020-08-11T04:29:08Z

Please, note that fix to bug was released as part of 19.03.9 version: https://docs.docker.com/engine/release-notes/#19039

@thaJeztah this issue can be closed.

thaJeztah · 2020-08-11T07:50:41Z

Thanks; yes, looks like fixed through #2938

nishanttotla added area/orchestration kind/bug labels Nov 13, 2017

djmaze mentioned this issue Feb 19, 2019

single manager docker swarm stuck in Down state after reboot moby/moby#34827

Closed

olljanat mentioned this issue May 30, 2019

Possible bugs in task reaper #2672

Open

7 tasks

olljanat mentioned this issue Dec 15, 2019

[WIP] Cleanup routine for old tasks.db tasks #2917

Closed

thaJeztah mentioned this issue Apr 16, 2020

[19.03] vendor: swarmkit 0b8364e7d08aa0e972241eb59ae981a67a587a0e moby/moby#40831

Merged

dperny mentioned this issue Apr 16, 2020

Bump swarmkit to ebe39a32e3ed4c3a3783a02c11cccf388818694c moby/moby#40745

Merged

thaJeztah closed this as completed Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swarm's tasks.db takes up lots of disk space #2367

Swarm's tasks.db takes up lots of disk space #2367

joelchen commented Sep 6, 2017

nishanttotla commented Sep 7, 2017

christopherobin commented Nov 13, 2017

christopherobin commented Nov 13, 2017

donswa commented Feb 27, 2018

tanmng commented May 8, 2018

pouicr commented Aug 1, 2018

pouicr commented Aug 1, 2018

Davidian1024 commented Aug 20, 2018

drdaeman commented Sep 9, 2018

Davidian1024 commented Sep 21, 2018

marcwaz commented Nov 4, 2018 •

edited

Loading

olljanat commented Nov 18, 2018

lybroman commented Dec 28, 2018 •

edited

Loading

olljanat commented Dec 28, 2018

lybroman commented Jan 3, 2019

olljanat commented Jan 3, 2019

korigod commented Mar 3, 2019

mcast commented Nov 12, 2019 •

edited

Loading

mcast commented Nov 12, 2019

zipy124 commented Nov 30, 2019

mcast commented Dec 2, 2019 via email

yuklia commented Aug 10, 2020 •

edited

Loading

olljanat commented Aug 11, 2020 •

edited

Loading

thaJeztah commented Aug 11, 2020

Swarm's tasks.db takes up lots of disk space #2367

Swarm's tasks.db takes up lots of disk space #2367

Comments

joelchen commented Sep 6, 2017

nishanttotla commented Sep 7, 2017

christopherobin commented Nov 13, 2017

christopherobin commented Nov 13, 2017

donswa commented Feb 27, 2018

tanmng commented May 8, 2018

pouicr commented Aug 1, 2018

pouicr commented Aug 1, 2018

Davidian1024 commented Aug 20, 2018

drdaeman commented Sep 9, 2018

Davidian1024 commented Sep 21, 2018

marcwaz commented Nov 4, 2018 • edited Loading

olljanat commented Nov 18, 2018

lybroman commented Dec 28, 2018 • edited Loading

olljanat commented Dec 28, 2018

lybroman commented Jan 3, 2019

olljanat commented Jan 3, 2019

korigod commented Mar 3, 2019

mcast commented Nov 12, 2019 • edited Loading

mcast commented Nov 12, 2019

zipy124 commented Nov 30, 2019

mcast commented Dec 2, 2019 via email

yuklia commented Aug 10, 2020 • edited Loading

olljanat commented Aug 11, 2020 • edited Loading

thaJeztah commented Aug 11, 2020

marcwaz commented Nov 4, 2018 •

edited

Loading

lybroman commented Dec 28, 2018 •

edited

Loading

mcast commented Nov 12, 2019 •

edited

Loading

yuklia commented Aug 10, 2020 •

edited

Loading

olljanat commented Aug 11, 2020 •

edited

Loading