Proposal for changes to source-build to support parallelism

### Describe the Problem

Parallelism should provide serious gains for source-build's build. In a typical build, the end to end is around 50 mins. With parallelism enabled, I got < 30.

### Describe the Solution

The are a few problems with enabling parallelism:
- The repo dependency graph as represented in the repo projects is incorrect and incomplete
- The way that input versions for a component are determined would potentially yield non-deterministic results in a parallel environment.

#### How to define the dependency graph correctly

This one is trickier than expected. What we should strive for is that each component project identifies only those dependencies that it directly depends on, and does so completely. Ideally, this information would be generated programmatically off of what the repo depends on in its Version.Details.xml file, as this is the source of information we use to determine what versions should be overridden.

Theoretically, for every component, we could identify the set of `SourceBuild` dependencies in the Version.Details.xml, and use those as the `RepositoryReference` items. However, this graph would be both incomplete in some areas, and have too many edges in others. The main problem with this approach is that the SourceBuild elements are often centered around usage in repo-level source build, which is slightly different.

- Some repositories (e.g. NuGet.client) have no SourceBuild elements.
- Some components will depend on repos that have no source build intermediate  (e.g. NuGet.Client)
- Some components will overspecify and cause cycles in the graph. For example, arcade depends on sdk, which depends on sdk. These dependencies should come from previously source-built, not the live build. We may be able to tweak cases like this, but it's probably trivially easy to fix.
- Some components have no sources at all. dotnet.proj and source-build-packages do not correspond to sources in the repo, but do fit into the dependency graph.

All said though, info from traversing the graph is very close to the desired build order. Just need to add a few new edges and remove some existing ones. I propose the following approach:

**Approach**

- If a component has sources and a Version.Details.xml file, the baseline set of dependencies is generated from `SourceBuild` marked dependencies.
- A component project may define two additional input ItemGroups, `AdditionalRepositoryReferences` and `RemoveRepositoryReferences`
  - `AdditionalRepositoryReferences` is a set of additional `RepositoryReference` that should be added to the existing dependencies. For sdk.proj, this might include nuget.client, adding to the existing set. For dotnet.proj, this would be installer and source-build-packages, adding to an empty set because there are no sources.
  - `RemoveRepositoryReferences` is a set of `RepositoryReference` that should be removed from the generated set. This would include dependencies that cause circular references.

The set of repository references can be calculated as:

`RepositoryReferences = <Set of eng/Version.Details.xml SourceBuild dependencies, if available> + AdditionalRepositoryReferences - RemoveRepositoryReferences`

The set of final RepositoryReference elements is built before the current project is built.

#### How to ensure correctness

If the final generated graph has **cycles**, MSBuild will detect these cycles during evaluation and the build will fail. Ensuring that there are enough edges is slightly more difficult.

The key to ensuring there are enough edges and they are in the right places is to ensure that outputs of a repository **only** go to separated locations not used for inputs **and** input locations are not shared between repos. A similar approach to what ProdCon v1 did would work here.

- Components are given a unique output location. No writing to shared locations.
- When preparing the inputs for a given component build, the outputs of the final `RepositoryReferences` set for that component are combined together into a unique feed (including for non-NuGet). To save space, symlinks or hardlinks could theoretically be used.

Because the input version of a package is determined from the previously source-built packages + the input package feed, which would now be unique per component, if an edge is missing, the incorrect input version would be used. This would then usually result in a poison failure (taken from previously source-built), or a prebuilt (no edge at all).

**Caveat**

There is one case that this approach would **not** catch. A missing edge (e.g. to arcade) that would not cause a poison failure but could cause unwanted build behavior. In that case, the previously source-built would be used instead of the live arcade.

To fix this, we could go one step further and separate the previously source-built artifacts by component, and then apply the same methodology of restricting the inputs to a given component build only to declared dependencies. In this case, the declared repo references would need to avoid trimming away cycles in the graph (e.g. arcade would need to depend on arcade, sdk and runtime).

I don't think this step is necessary, unless it becomes clear that the number of edges that must be added or removed from the graph to ensure correctness is large (and thus easy to get wrong). My initial investigations suggest it is not, and only a few components need to alter their deps.

**T-Shirt Size: Medium**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal for changes to source-build to support parallelism #3608

Describe the Problem

Describe the Solution

How to define the dependency graph correctly

How to ensure correctness

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal for changes to source-build to support parallelism #3608

Description

Describe the Problem

Describe the Solution

How to define the dependency graph correctly

How to ensure correctness

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions