Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leaking tasks.db #2938

Merged
merged 1 commit into from
Mar 25, 2020
Merged

Fix leaking tasks.db #2938

merged 1 commit into from
Mar 25, 2020

Conversation

dperny
Copy link
Collaborator

@dperny dperny commented Mar 23, 2020

- What I did

For a long time, it's been a known fault that tasks.db grows out of control. The reason is that this database is only cleaned up, and old tasks removed, during initialization.

When an assignment to a worker is removed, previously, the assignment was just marked as removed in the tasks database. However, an assignment is only removed from a worker when the task is removed from the manager. The worker does not need to continue to keep track of the task.

- How I did it

Instead of marking a task as no longer assigned, when a task is removed as an assignment, we'll simply delete it from the database.

I'm not 100% sure of what all the task database is responsible for, or why it needs to be persisted, so this change is targeted to have the minimal impact on the system.

This is sort of like @olljanat's fix in #2917, but instead of running a routine in a loop, we just delete tasks on demand.

- How to test it

Updates the worker test to reflect the new behavior.

- Description for the changelog

Fix longstanding issue with tasks.db growing out of control.

@dperny
Copy link
Collaborator Author

dperny commented Mar 23, 2020

/cc @cpuguy83

@cpuguy83
Copy link
Member

--- FAIL: TestWorkerAssign (0.01s)

    worker_test.go:185: 

        	Error Trace:	worker_test.go:185

        	Error:      	Not equal: 

        	            	expected: []*api.Task(nil)

        	            	actual  : []*api.Task{(*api.Task)(0xc0001bc5a0)}

        	            	

        	            	Diff:

        	            	--- Expected

        	            	+++ Actual

        	            	@@ -1,2 +1,4 @@

        	            	-([]*api.Task) <nil>

        	            	+([]*api.Task) (len=1) {

        	            	+ (*api.Task)(&Task{ID:,Meta:Meta{Version:Version{Index:0,},CreatedAt:(timestamp: nil types.Timestamp),UpdatedAt:(timestamp: nil types.Timestamp),},Spec:TaskSpec{Runtime:<nil>,Resources:nil,Restart:nil,Placement:nil,LogDriver:nil,Networks:[],ForceUpdate:0,ResourceReferences:[],},ServiceID:,Slot:0,NodeID:,Annotations:Annotations{Name:,Labels:map[string]string{},Indices:[],},ServiceAnnotations:Annotations{Name:,Labels:map[string]string{},Indices:[],},Status:TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:NEW,Message:,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),},DesiredState:NEW,Networks:[],Endpoint:nil,LogDriver:nil,SpecVersion:nil,AssignedGenericResources:[],JobIteration:nil,})

        	            	+}

        	            	 

        	Test:       	TestWorkerAssign

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).Assign" len(assignments)=4

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=0 len(updatedConfigs)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=2

time="2020-03-23T16:49:32Z" level=debug msg=assigned task.desiredstate=NEW task.id=task-1

time="2020-03-23T16:49:32Z" level=debug msg="state changed" service.id= state.desired=NEW state.transition="NEW->ACCEPTED" task.id=task-1

time="2020-03-23T16:49:32Z" level=info msg="status update received" service.id= status="&TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:ACCEPTED,Message:accepted,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),}" task.id=task-1

time="2020-03-23T16:49:32Z" level=debug msg=assigned task.desiredstate=NEW task.id=task-2

time="2020-03-23T16:49:32Z" level=debug msg="state changed" service.id= state.desired=NEW state.transition="NEW->ACCEPTED" task.id=task-2

time="2020-03-23T16:49:32Z" level=info msg="status update received" service.id= status="&TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:ACCEPTED,Message:accepted,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),}" task.id=task-2

time="2020-03-23T16:49:32Z" level=info msg="status update received" module=taskmanager service.id= status="&TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:NEW,Message:,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),}" task.id=task-2

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).Assign" len(assignments)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=0 len(updatedConfigs)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=0

time="2020-03-23T16:49:32Z" level=info msg="status update received" module=taskmanager service.id= status="&TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:NEW,Message:,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),}" task.id=task-1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).Assign" len(assignments)=3

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=0 len(updatedConfigs)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=1

time="2020-03-23T16:49:32Z" level=debug msg=assigned task.desiredstate=NEW task.id=task-1

time="2020-03-23T16:49:32Z" level=debug msg="state changed" service.id= state.desired=NEW state.transition="NEW->ACCEPTED" task.id=task-1

time="2020-03-23T16:49:32Z" level=info msg="status update received" service.id= status="&TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:ACCEPTED,Message:accepted,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),}" task.id=task-1

time="2020-03-23T16:49:32Z" level=info msg="status update received" module=taskmanager service.id= status="&TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:NEW,Message:,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),}" task.id=task-1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).Update" len(assignments)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=0 len(updatedConfigs)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).Update" len(assignments)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=0 len(updatedConfigs)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=1

time="2020-03-23T16:49:32Z" level=debug msg=assigned task.desiredstate=NEW task.id=task-1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).Update" len(assignments)=3

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=0 len(updatedSecrets)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=0 len(updatedConfigs)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=0 len(updatedTasks)=1

time="2020-03-23T16:49:32Z" level=debug msg=assigned task.desiredstate=NEW task.id=task-2

time="2020-03-23T16:49:32Z" level=debug msg="state changed" service.id= state.desired=NEW state.transition="NEW->ACCEPTED" task.id=task-2

time="2020-03-23T16:49:32Z" level=info msg="status update received" service.id= status="&TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:ACCEPTED,Message:accepted,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),}" task.id=task-2

time="2020-03-23T16:49:32Z" level=info msg="status update received" module=taskmanager service.id= status="&TaskStatus{Timestamp:(timestamp: nil types.Timestamp),State:NEW,Message:,Err:,RuntimeStatus:<nil>,PortStatus:nil,AppliedBy:,AppliedAt:(timestamp: nil types.Timestamp),}" task.id=task-2

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).Update" len(assignments)=5

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=1 len(updatedSecrets)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=1 len(updatedConfigs)=1

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=1 len(updatedTasks)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).Update" len(assignments)=6

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileSecrets" len(removedSecrets)=2 len(updatedSecrets)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileConfigs" len(removedConfigs)=2 len(updatedConfigs)=0

time="2020-03-23T16:49:32Z" level=debug msg="(*worker).reconcileTaskState" len(removedTasks)=2 len(updatedTasks)=0

time="2020-03-23T16:49:32Z" level=error msg="error removing de-assigned task task-1" error="bucket not found" task.id=task-1

This seems like it makes sense. Has this been tested in the case that a worker is promoted to manager?

@dperny
Copy link
Collaborator Author

dperny commented Mar 23, 2020

Just saw that failure. Can't reproduce it locally, so I'm not sure what's going on.

There shouldn't be any difference in a worker promoted to a manager. The databases for each component are completely separate.

@Thomas131
Copy link

If I am correct, this pull request won't shrink a huge tasks.db from previous versions. It will just prevent the file from continue growing, is this correct?

@cpuguy83
Copy link
Member

@Thomas131 That is correct.

agent/worker.go Outdated Show resolved Hide resolved
agent/worker.go Outdated Show resolved Hide resolved
agent/worker.go Outdated Show resolved Hide resolved
agent/worker.go Outdated Show resolved Hide resolved
@dperny dperny force-pushed the fix-leaking-task-db branch from cb48309 to af58e4d Compare March 23, 2020 19:18
@dperny
Copy link
Collaborator Author

dperny commented Mar 23, 2020

@Thomas131 I'm not Officially Endorsing this approach, but I've been told that Nothing Breaks if you delete tasks.db.

Of course, the Official Advice would be to leave the cluster, delete tasks.db, and then rejoin the cluster as a "new" node.

@dperny dperny force-pushed the fix-leaking-task-db branch from af58e4d to 4b3e5a7 Compare March 23, 2020 20:06
@codecov
Copy link

codecov bot commented Mar 23, 2020

Codecov Report

Merging #2938 into master will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2938      +/-   ##
==========================================
+ Coverage   61.71%   61.78%   +0.07%     
==========================================
  Files         142      142              
  Lines       22994    22998       +4     
==========================================
+ Hits        14190    14210      +20     
+ Misses       7309     7297      -12     
+ Partials     1495     1491       -4     

@dperny
Copy link
Collaborator Author

dperny commented Mar 23, 2020

I found the race condition. I "fixed" it, but I'm not powerful enough to actually fix it the right way without the whole house of cards tumbling down.

@dperny dperny force-pushed the fix-leaking-task-db branch from 4b3e5a7 to b4dd558 Compare March 23, 2020 20:27
agent/worker.go Outdated Show resolved Hide resolved
@dperny dperny force-pushed the fix-leaking-task-db branch from b4dd558 to 8372b96 Compare March 24, 2020 14:54
agent/worker.go Outdated Show resolved Hide resolved
@dperny dperny force-pushed the fix-leaking-task-db branch from 8372b96 to 3b3a2ea Compare March 24, 2020 15:59
For a long time, it's been a known fault that tasks.db grows out of
control. The reason is that this database is only cleaned up, and old
tasks removed, during initialization.

When an assignment to a worker is removed, previously, the assignment
was just marked as removed in the tasks database. However, an assignment
is only removed from a worker when the task is removed from the manager.
The worker does not need to continue to keep track of the task.

Instead of marking a task as no longer assigned, when a task is removed
as an assignment, we'll simply delete it from the database.

I'm not 100% sure of what all the task database is responsible for, or
why it needs to be persisted, so this change is targeted to have the
minimal impact on the system.

Signed-off-by: Drew Erny <derny@mirantis.com>
@dperny dperny force-pushed the fix-leaking-task-db branch from 3b3a2ea to 585521d Compare March 24, 2020 16:56
@thaJeztah
Copy link
Member

One failure; is it a known flaky, or actual related?

--- FAIL: TestUpdaterRollback (6.42s)
    --- FAIL: TestUpdaterRollback/pause/monitor_set/spec_version_set (3.59s)
        update_test.go:272: service was updated
        testutils.go:42: 
            	Error Trace:	testutils.go:42
            	            				update_test.go:343
            	            				update_test.go:21
            	Error:      	no task creation
            	Test:       	TestUpdaterRollback/pause/monitor_set/spec_version_set
FAIL

@dperny
Copy link
Collaborator Author

dperny commented Mar 25, 2020

Known flaky. There are two or three like that.

@dperny dperny merged commit ebe39a3 into moby:master Mar 25, 2020
dperny added a commit to dperny/docker that referenced this pull request Mar 26, 2020
Bumps swarmkit vendoring. Includes moby/swarmkit#2938, which fixes
tasks.db growing out of control on worker nodes.

Signed-off-by: Drew Erny <derny@mirantis.com>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this pull request Mar 26, 2020
Bumps swarmkit vendoring. Includes moby/swarmkit#2938, which fixes
tasks.db growing out of control on worker nodes.

Signed-off-by: Drew Erny <derny@mirantis.com>
Upstream-commit: 1dbf34f3aab290cb5a9246f54d66d84d59a27cb6
Component: engine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants