- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
SAN-4866 Cleanup Old Instances #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| log.info('CleanupInstances') | ||
|  | ||
| return Promise.using(mongodbHelper(), function (mongoClient) { | ||
| return mongoClient.fetchInstancesAsync({ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To whomever reviews this. PLEASE PLEASE PLEASE make sure you think through this query properly. I don't wanna delete the wrong things on prod.
| 'contextVersion.created': { $lt: moment().subtract(7, 'days').toDate() }, | ||
| $or: [ | ||
| { isolated: { $exists: false } }, | ||
| { isIsolationGroupMaster: true } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does deleting isolation master delete the kids too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
        
          
                lib/tasks/instances/cleanup.js
              
                Outdated
          
        
      | }) | ||
| .then(function (instances) { | ||
| log.info({ | ||
| instanceCount: instances.length | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put some more data here? Instance id? context version i maybe? When I look at the logs for an instance getting deleted it would be nice to be able to find how the instance got deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the problem, if I delete 500 instances this way how would you find your particular instance in that list? Do you want me to print a list of instance ID's?
        
          
                lib/tasks/instances/cleanup.js
              
                Outdated
          
        
      | }) | ||
| }, 'Found instances to cleanup') | ||
| instances.forEach(function (instance) { | ||
| rabbitmq.publishTask('instance.delete', { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anandkumarpatel is this the way we are going? khronos will publish task that api will pickup. Before we were talking that services will emit events  and not tasks to avoid coupling. What is your latest opinion here?
Shouldn't we emit event instance.expired here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so we'd make a new event instance.expired and that'd get picked up by the API which would then emit instance.delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anton your right, this should emit some event.  instance.expired sounds good.
reasoning:  tasks should only be used inside of an application. When you emit a task from 1 service for another to handle that couples the services. To send data to a different application we should use events. Using events makes flows difficult to understand, but it greatly improves our resilience and helps us write fault tolerant / gracefull degradation code upfront. Events do not have a direct action like a task does. A task is saying do this now. when you emit an event you don't know what is going to use it, and what it is going to do it. It might seem counter-intuitive make things more complex, but this actually helps strengthen our system. Khronos just knows that this instance is expired, but it does not know what to update. say we need to update the network, add a pr comment and send an email.
If we used a task, there would be 3 tasks here, and the person writing khronos would have to know how each service works making it harder to just update khronos. However making this event, khornos does not need to know what is going to happen, it just knows hey I check this at this time and it is expired.
the latest ponos actually forces this behavior by not allowing tasks to be used across applications (by auto-name spacing them)
TLDR; make this an event, handle delete in API
        
          
                lib/tasks/instances/cleanup.js
              
                Outdated
          
        
      |  | ||
| function CleanupInstances () { | ||
| var log = logger.child({ | ||
| tx: true | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we dont need tx anymore, its already passed in logger
| fix test, remove tx and this looks good to me | 
        
          
                test/unit/tasks/instances/cleanup.js
              
                Outdated
          
        
      | .then(function () { | ||
| sinon.assert.calledTwice(rabbitmq.publishTask) | ||
| sinon.assert.calledWith( | ||
| rabbitmq.publishTask, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
publishEvent
| To test this I went ahead and did a dry-run to find a list of all the instances we were going to delete and tried to see if any of them were incorrect. To do this I ran the query, and looked at the resulting instances. The goal is to find instances that have not seen their code touched in the past week. The biggest problem with that is since we're looking at contextVersions and not commit metadata we're really getting any container that's been migrated or re-built in the last week. This is actually okay since they took action on that container. Eventually we'll track "usage" which is deemed as viewed in navi/commit or viewed in runnable-angular. | 
| When ran in production this will remove approximately 940 instances and spare 270 that have been built in the last week... | 
Added logic to cleanup any instance that is older than 7 days.
Anything that:
Is not a master pod.
Has a contextVersion created date older than 7 days.
Is not Isolated OR is the isolation group master.