[RFC] From Identity to Ownership

**Status:** Closed for comments



This RFC takes a higher level view on how we should go from the identity of the logged in user, to showing the correct list of entities that they own in the Backstage interface. This involves a number of actors and interactions.

## Need



The concept of ownership plays a key role in the Backstage system.

Backstage users expect to be able to clearly see what entities that they themselves are responsible for, collected nicely with their current statuses in the interface. Users also expect to be able to go to the page for another user or group, to learn about their responsibilities. They also want to be able to search for an entity such as for example a malfunctioning service, and to see who is responsible for that service to help fix it.

We also expect that these ownership concepts will be relevant outside of the Backstage website. For example, monitoring systems that integrate with the Backstage catalog may want to perform lookups based on a service ID and send automated alerts to the relevant owner team when things go wrong.

## Actors

The following moving pieces are potentially involved in a solution:

| System | Purpose | Responsibility in the scope of this RFC |
| - | - | - |
| Sign-in (Authentication) | Allows users to sign in to Backstage. | To negotiate an identity inside an external system, and to somehow translate that identity into an identity reference that makes sense inside Backstage. |
| Authorization | Allows users to authorize actions in other systems, by negotiating access tokens. | Not used currently, but could be used similarly to the authentication, to translate the external system identity into an identity that makes sense inside Backstage. |
| Entity Descriptor Files | Where users create YAML entity descriptor files by hand, or with the help of automation plugins. | To let the user write some form of explicit reference to an owner. |
| File Ingesting Processors | Processors whose purpose it is to populate the catalog with entities, most commonly based on entity descriptor files. | To automate the mirroring of entity descriptor files into the catalog. |
| Org Ingesting Processors | Processors that mirror organizational data out of sources such as LDAP, and store them as `User` and `Group` entities. | To produce `User`/`Group` entities whose metadata is rich enough to be used for lookups by other systems. |
| Enriching Processors | Processors that fill in gaps in the entity data, for example by reading `CODEOWNERS` files and setting ownership based on them. | To fill in owner references in entity descriptors where missing, in a way that aligns with the org ingested entities. |
| Extracting Processors | Processors that analyze entities and generate entity [relations](https://backstage.io/docs/features/software-catalog/well-known-relations) based on them. | To take the stored owner string in the entity, and translate it to a full relation reference to an actual `User` or `Group` entity. |
| Backstage Frontend | To present ownership in a cohesive way. | To read the catalog and, with as little logic as possible, deduce ownership across entities. |

## Meet John

Let's look at what John's setup looks like.

- John signs in to Backstage using Google as an auth provider, which internally leads to a JWT pointing to his GSuite identity `john.knowles@example.com`.
- He also makes use of some public GitHub features, so his Backstage usually holds an OAuth2 access token issued to his GitHub user `johnHaxxesBezt` (don't judge, he made it before joining the company). He does some open source work there.
- Their internal systems actually use LDAP as the source of truth, and his userid there is just plain `john` since he was one of the early founders. He's a member of the InfraNinjas team, which is `infra-ninjas` in LDAP and can be reached on `infra-ninjas-team@example.com` because that's the naming standard.
- The company uses GitHub Enterprise as the main development platform. They sync their GHE teams and users with LDAP, but it's an ad hoc process where users set the teams up where necessary and have to check the box to sync with LDAP. There are a lot of non-LDAP teams too, like `system-blah-maintainers`. Well at least he's just plain `john` as his user on GHE.
- They are heavy users of CODEOWNERS to facilitate distributed development in their monorepos.

John is setting up Backstage and wants to know what settings to use. The docs say that he should want to end up with entities that have ownership relations to other entities, and that his sign-in should somehow map to the users and groups.

But it's not clear how things relate to each other:

- The sign-in results in his public Google email address.
- He has some components in public GitHub and some in GitHub Enterprise, whose team/user setups aren't the same to each other, and don't map 1:1 to LDAP either way
- He knows there is a processor that can ingest GitHub teams/users, but should he be mirroring both GitHub and GHE, and what would that look like?
- He knows there is a processor that can ingest LDAP groups/users, but these don't seem to have an obvious way to relate to the GitHub/GHE ones.
- There are a lot of pre-existing CODEOWNERS files he wants to leverage, but they are expressed in terms of the team structure in GHE, and the processor seems to write those IDs down verbatim into the entities.

## Current State

The `AuthResponse` type as returned when logging in currently has [the following shape](https://github.com/backstage/backstage/blob/295a3e2fffe6c07ef662bddcef799c42f6f6c970/plugins/auth-backend/src/providers/types.ts#L129):

```yaml
backstageIdentity:
  id: "freben"
  idToken: "ey..."
profile:
  displayName: "Fredrik Adelöw"
  picture: "https://avatars2.githubusercontent.com/u/..."
providerInfo:
  accessToken: "e411..."
  scope: "read:user"
```

This is what the auth backend supplies to the frontend, based on the response from the auth provider. This is also the input data that the frontend has to work with in order to deduce ownership. Currently, we look for a `User` entity with `metadata.namespace` equal to "default", and `metadata.name` equal to `backstageIdentity.id`. However, the profile may have been extracted from the auth provider data, not from the catalog.

The way that the ID is chosen, also varies. The Google provider [searches](https://github.com/backstage/backstage/blob/master/plugins/auth-backend/src/providers/google/provider.ts#L143) for a catalog user that has a Google specific annotation matching the email. The Microsoft provider just grabs [the first part of the email address](https://github.com/backstage/backstage/blob/master/plugins/auth-backend/src/providers/microsoft/provider.ts#L192) and returns that. The GitHub provider just [uses the GitHub username](https://github.com/backstage/backstage/blob/57deaf01012c16bceade1c928ecd816501aceaea/plugins/auth-backend/src/providers/github/provider.ts#L78) verbatim.

The problem is that this creates a strong coupling between the behaviours of each provider in the auth-backend, to catalog shape and the behaviours and config of org ingestion catalog plugins. The IDs chosen are often essentially keys in foreign spaces, not clearly mapped to catalog users. It becomes very hard to, for example, use Google for login but LDAP for org data.

The org ingestion plugins aren't particularly flexible; for example, the GitHub one doesn't let you set a specific namespace at the time of writing, so there may be collisions if you try to import more than one thing.

The CODEOWNERS processor just takes the entries in the lines of its file as-is and writes them down in the `spec`. This forces the operator to make sure that there are `User`/`Group` entries in the catalog that match those particular strings.

The processors system is pretty flexible. You can correct for some level of mismatch by making an "adjusting" processor that tweaks the entities to fit the bill before storing them. But that's not the best adopter experience to have to do, and it only has the ability to solve part of the problem.

## Proposal



### Phase One

In the first phase, we focus on the following:

- Expect there to be just one org hierarchy if any, which is from the most authoritative source available. For John, this would be LDAP. The users and groups are typically in the default namespace.
- Still support running Backstage with no org data ingested in the catalog, and logging in as an identity that does not match any catalog User at all.
- Expect the logged in identity to match one or zero users, no particular support for matching more than one
- Do not focus on codeowners. This use case is still functional, but may fit best for those that get org data from their SCM provider. For others, some amount of adjustment can be done with a custom processor.

The proposed changes:

- Task the auth backend with explicitly linking the auth provider response with a catalog user where possible. This involves making it possible to **give a custom callback to the provider** code, that can make requests to the catalog.
- As part of this callback, we also gain the **ability to reject the login** by throwing an exception - for example, to ensure that the email was a work email rather than a personal/unrelated one.
- Extend the `AuthResponse` to contain either the entire User entity, or at least **the full kind + namespace + name** triplet. The latter, or at most some subset of the user, may be preferable to an entire entity, to better support the case where there actually was no matching User entity and instead some "mocked" data is returned.
- Make the auth backend use the catalog **User as a source for profile data**, where available.

We will update the way that the auth providers are initialized, so that the backend can pass in these custom callbacks.

```ts
// Current packages/backend/src/plugins/auth.ts
return await createRouter({ logger, config, database, discovery });
```

You can supply a `providerFactories` argument here to customize the set of supported providers, but these factories may need to be amended and changed around a bit to make room for this callback to be supplied.

```ts
// Hypothetical future packages/backend/src/plugins/auth.ts
const providerFactories: ProviderFactories = {
  google: googleProviderFactory({
    // At this point, the authResponse can actually be typed and distinct for the
    // provider at hand, where applicable
    async onLogin({ authResponse, catalogClient, identity }) {
      // reject the response entirely by throwing, otherwise try to find a
      // matching user using the catalog client
      identity.setProfile(...);
      identity.setUserRef(...);
    }
  });
};

return await createRouter({ logger, config, database, discovery, providerFactories });
```

### Phase Two

In a second phase, the following focus areas could be added:

- Multiple current org hierarchies in the catalog
- Ability to match an identity to multiple users

The proposed changes and effects:

- Let users ingest more than one org hierarchy into the catalog, for example John might want to have both GitHub and LDAP reflected. They may be stored in their respective namespaces to keep them separate.
- Let the logged-in identity match more than one user. This would mean that after login, John would be clearly associated both with his GitHub and LDAP users and their respective groups. This could mean slightly more work in the backend auth setup code.
- The above will also make the codeowners situation much clearer. The codeowners processor can be given the same namespace parameter as the GitHub org ingestion, so that the `system-blah-maintainers` team lines up with the groups that correspond to those teams.

## Alternatives

One type of alternative that we discussed, was to use a standard shape of `metadata.label`s that are used for selecting users that match a certain identity.

For example, in John's case the login implies that the identity, specifically in the scope of Google, is "john@example.com". This could be used to search for catalog entities that have a label `identity.backstage.io/google: john@example.com`, where `google` is the ID of the provider. If he instead logged in with their GHE, his identity, in the scope of GitHub Enterprise, is "john". Then we search for users that are labeled `identity.backstage.io/ghe: john`. Interestingly, this label could be on any number of users from any number of org ingestions in any namespace, without any coupling to those ingestions.

However, the caveat here is that the ingestions must now instead be pluggable in a sense, to populate all users and groups with the right set of labels. A GitHub org ingestion can fairly easily tag a GitHub user with its own identity label without customization, but it has no current way of knowing what Google identity label that it corresponds to - if any. So the need of matching across domains is not necessarily addressed by this. There currently exists no GSuite org ingestor - what will John do when he logs in using Google? How will Backstage know that his Google identity actually happens to match with the `john` user in GHE?

One way of addressing the latter case, could be to for example make it possible to only ingest the LDAP org, and to extract the `email` attribute of each user entry into the `identity.backstage.io/google` label because we know that that's how this particular org works. And then for those users that happen to have the custom attribute `gitUserName` set, map that to an `identity.backstage.io/ghe` label. This would need to somehow be driven either by configuration or by injection of custom code, in the LDAP org processor. But that only works out if such a `gitUserName` attribute exists; if it goes the other way instead and the GHE users are bound to LDAP, we'd want to instead make a reverse lookup, and that's not something that exists currently.

## Risks



There is a risk of not getting the balance of flexibility at the right level.

This RFC also does not cover how this identity-to-ownership mapping would translate outside of the Backstage backend. In server-to-server scenarios, where you for example make requests to the catalog backend to list users, groups, or entities, it will not be clear from the data itself how ownership maps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] From Identity to Ownership #3870

Need

Actors

Meet John

Current State

Proposal

Phase One

Phase Two

Alternatives

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System	Purpose	Responsibility in the scope of this RFC
Sign-in (Authentication)	Allows users to sign in to Backstage.	To negotiate an identity inside an external system, and to somehow translate that identity into an identity reference that makes sense inside Backstage.
Authorization	Allows users to authorize actions in other systems, by negotiating access tokens.	Not used currently, but could be used similarly to the authentication, to translate the external system identity into an identity that makes sense inside Backstage.
Entity Descriptor Files	Where users create YAML entity descriptor files by hand, or with the help of automation plugins.	To let the user write some form of explicit reference to an owner.
File Ingesting Processors	Processors whose purpose it is to populate the catalog with entities, most commonly based on entity descriptor files.	To automate the mirroring of entity descriptor files into the catalog.
Org Ingesting Processors	Processors that mirror organizational data out of sources such as LDAP, and store them as `User` and `Group` entities.	To produce `User`/`Group` entities whose metadata is rich enough to be used for lookups by other systems.
Enriching Processors	Processors that fill in gaps in the entity data, for example by reading `CODEOWNERS` files and setting ownership based on them.	To fill in owner references in entity descriptors where missing, in a way that aligns with the org ingested entities.
Extracting Processors	Processors that analyze entities and generate entity relations based on them.	To take the stored owner string in the entity, and translate it to a full relation reference to an actual `User` or `Group` entity.
Backstage Frontend	To present ownership in a cohesive way.	To read the catalog and, with as little logic as possible, deduce ownership across entities.

[RFC] From Identity to Ownership #3870

Description

Need

Actors

Meet John

Current State

Proposal

Phase One

Phase Two

Alternatives

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions