Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add One-pager for agentless helix #11316

Merged
merged 3 commits into from
Nov 2, 2022

Conversation

alexperovich
Copy link
Member

I have a proof of concept in this PR https://github.com/dotnet/arcade/pull/10342. The agentless job can be seen running here https://dev.azure.com/dnceng-public/public/_build/results?buildId=55561&view=logs&j=830c6850-7aa7-5384-6c90-1a1a71217f4b.

## Implementation details
This solution requires a web api endpoint on our server that handles the request from the agentless job, then a service running inside our cluster that starts and monitors jobs based on these requests. This service will need to fixup the job payloads with test run information, then monitor execution and report status back to azdo.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this service basically going to be similar to the "DataMigrationService" (getting the same notifications from the SQL processor and doing... stuff)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PoC just uses the helix api like the build agents do, and that seems simplest and easiest, but it can use notifications if we desire that too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notifications means it doesn't need to be a polling service, which will reduce the load on our service a TON, since there won't be dozens or hundreds of runs having to constantly ping our server for it to do all the math just to say "no, not done yet". It would also, presumably, be more real time, since you'd get the notification immediately, and wouldn't have to wait for the next poll interval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also wouldn't need to be stateful at that point... it wouldn't need to track what things are "live" and poll them all, it would just respond to the notifications and take the appropriate actions. And statelessness is always a win in service land.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it still 100% needs the state, that contains info about what azdo job to report back to including the access token. I'd rather not throw more state into the helix jobs. It can use the notifications though. It still should report in-progress status to the build.

1. Start test runs
1. Start helix job execution
1. Wait for job completion - send progress logs back to azure pipelines
1. Finish test runs and check for job pass/fail status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a bit in here about reporting results for workitem execution? Right now that's handled in helix, but presumably wouldn't be in this new model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats included in here. The server reports test results for all the work items. I will add it.

ChadNedzlek
ChadNedzlek previously approved these changes Oct 18, 2022
## Proof of Concept
I have a proof of concept in this PR https://github.com/dotnet/arcade/pull/10342. The agentless job can be seen running here https://dev.azure.com/dnceng-public/public/_build/results?buildId=55561&view=logs&j=830c6850-7aa7-5384-6c90-1a1a71217f4b.

## Implementation details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we also need to augment the "new job" API a bit to include the extra information we'd need to properly handle this request. I think it would just be enough information to locate the build (so collection URI, project ID, and build ID)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a separate api, but yes, it would need this extra information.

## Overview
We want to reduce wasted compute caused by build jobs waiting for helix tests to complete. To do this we can use the "agentless job" feature in azure pipelines to remove the machine that waits for the helix job to be finished.

## Steakholders
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like steak, and would like to hold one, but I think this should be "Stakeholders"

Copy link
Contributor

@jonfortescue jonfortescue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the product teams will need to implement this on their end. Do you have any thoughts on what would need to be done on their end (if anything) and how we would coordinate such an effort (if needed)?

Copy link
Contributor

@jonfortescue jonfortescue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is pretty fantastic and I look forward to seeing it implemented.

@alexperovich
Copy link
Member Author

The implementation for product teams really depends on how much they have stuck on to the end of the current implementation. It should be just a simple switch of a property, and adding another stage and/or job to the build to do the waiting part.

@alexperovich
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@alexperovich alexperovich merged commit 5be5561 into dotnet:main Nov 2, 2022
@alexperovich alexperovich deleted the agentlessHelixOnePager branch November 2, 2022 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants