-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add One-pager for agentless helix #11316
Conversation
I have a proof of concept in this PR https://github.com/dotnet/arcade/pull/10342. The agentless job can be seen running here https://dev.azure.com/dnceng-public/public/_build/results?buildId=55561&view=logs&j=830c6850-7aa7-5384-6c90-1a1a71217f4b. | ||
|
||
## Implementation details | ||
This solution requires a web api endpoint on our server that handles the request from the agentless job, then a service running inside our cluster that starts and monitors jobs based on these requests. This service will need to fixup the job payloads with test run information, then monitor execution and report status back to azdo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this service basically going to be similar to the "DataMigrationService" (getting the same notifications from the SQL processor and doing... stuff)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PoC just uses the helix api like the build agents do, and that seems simplest and easiest, but it can use notifications if we desire that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notifications means it doesn't need to be a polling service, which will reduce the load on our service a TON, since there won't be dozens or hundreds of runs having to constantly ping our server for it to do all the math just to say "no, not done yet". It would also, presumably, be more real time, since you'd get the notification immediately, and wouldn't have to wait for the next poll interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also wouldn't need to be stateful at that point... it wouldn't need to track what things are "live" and poll them all, it would just respond to the notifications and take the appropriate actions. And statelessness is always a win in service land.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it still 100% needs the state, that contains info about what azdo job to report back to including the access token. I'd rather not throw more state into the helix jobs. It can use the notifications though. It still should report in-progress status to the build.
1. Start test runs | ||
1. Start helix job execution | ||
1. Wait for job completion - send progress logs back to azure pipelines | ||
1. Finish test runs and check for job pass/fail status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a bit in here about reporting results for workitem execution? Right now that's handled in helix, but presumably wouldn't be in this new model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thats included in here. The server reports test results for all the work items. I will add it.
## Proof of Concept | ||
I have a proof of concept in this PR https://github.com/dotnet/arcade/pull/10342. The agentless job can be seen running here https://dev.azure.com/dnceng-public/public/_build/results?buildId=55561&view=logs&j=830c6850-7aa7-5384-6c90-1a1a71217f4b. | ||
|
||
## Implementation details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we also need to augment the "new job" API a bit to include the extra information we'd need to properly handle this request. I think it would just be enough information to locate the build (so collection URI, project ID, and build ID)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a separate api, but yes, it would need this extra information.
## Overview | ||
We want to reduce wasted compute caused by build jobs waiting for helix tests to complete. To do this we can use the "agentless job" feature in azure pipelines to remove the machine that waits for the helix job to be finished. | ||
|
||
## Steakholders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like steak, and would like to hold one, but I think this should be "Stakeholders"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the product teams will need to implement this on their end. Do you have any thoughts on what would need to be done on their end (if anything) and how we would coordinate such an effort (if needed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is pretty fantastic and I look forward to seeing it implemented.
The implementation for product teams really depends on how much they have stuck on to the end of the current implementation. It should be just a simple switch of a property, and adding another stage and/or job to the build to do the waiting part. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
dotnet/dnceng#1213