Skip to content

Latest commit

 

History

History
330 lines (211 loc) · 13.6 KB

0000-rfds.md

File metadata and controls

330 lines (211 loc) · 13.6 KB
authors state
Andrew Lytvynov (andrew@goteleport.com)
draft

RFD 0 - RFDs

What

Request For Discussion (RFD) is a design document format for non-trivial technical Teleport changes. It's also a process by which these documents are proposed, discussed, approved and tracked.

Why

As the Teleport project grows, we need a way to discuss major changes (e.g. new features, major refactors, major distribution changes).

Prior to RFDs, Teleport engineers used several other discussion methods (Google Docs, brainstorm meetings, internal wiki pages, ad-hoc email/chat conversations).

RFDs formalize the process and provide several benefits:

  • discussions are retained in GitHub pull requests and commit history
  • discussions are in the open, for Teleport users to see and contribute
  • discussions are stored in one central place
  • approvals are recorded and enforced

The RFD idea is borrowed from https://oxide.computer/blog/rfd-1-requests-for-discussion/ which is in turn inspired by https://www.ietf.org/standards/rfcs/

Details

Each RFD is stored in a markdown file under https://github.com/gravitational/teleport/tree/master/rfd and has a unique number.

structure

Each RFD consists of:

  1. a header containing author name(s) and state
  2. title in the format RFD $NUMBER - $TITLE
  3. the Required Approvers which contains required set of approvers
  4. the What section - 1-3 sentence summary of what this RFD is about
  5. the Why section - a few paragraphs describing motivation for the RFD
  6. the Details section - detailed description of the proposal, including APIs, UX examples, migrations or any other relevant information

Use this RFD as an example.

process

Here's the process from and RFD idea in your head to a working implementation in the main Teleport branch.

  1. Pick the RFD number. RFD numbers are claimed by pushing a branch named rfd/$number-title.

    Use this one-liner to get the highest-numbered RFD branch that's been pushed:

    git fetch --quiet && git branch -r -v | grep origin/rfd/ | awk -F'[/,-]' '{ n=$3+0; print n }' | sort -n | tail -n 1
  2. make a branch off of master called rfd/$number-your-title

    For example, you're writing an RFD titled 'Teleport IRC Access' and end up with number 18 - your branch would be called rfd/0018-irc-access.

  3. write your RFD under /rfd/$number-your-title.md

    Our example RFD is in /rfd/0018-irc-access.md.

  4. submit a PR titled RFD $number: Your Title

    Our example RFD title: RFD 18: IRC Access

  5. iterate on the RFD based on reviewer feedback and get approvals

    Note: it's OK to use meetings or chat to discuss the RFD, but please write down the outcome in PR comments. A future reader will be grateful!

  6. merge the PR and start implementing

  7. once implemented, make another PR changing the state to implemented and updating any details that changed during implementation.

If an RFD is eventually deprecated (e.g. a feature is removed), make a PR changing the state to deprecated and optionally link to the replacement RFD (if applicable).

states

  1. draft - RFD is proposed or approved, but not yet implemented
  2. implemented - RFD is approved and implemented
  3. deprecated - RFD was approved and/or implemented at one point, but is now deprecated and should only be referenced for historic context; a superseding RFD, if one exists, may be linked in the header

The purpose of the state is to tell the reader whether they should care about this RFD at all. For example, deprecated RFDs can be skipped most of the time. implemented is relevant to Teleport users, but draft is mostly for Teleport engineers and early adopters.

Required Approvers

The purpose of the Required Approvers section is to be explicit on required and optional approvers for an RFD. For the subject matter experts that can provide high quality feedback to help refine and improve the RFD.

For example, suppose you are making a change with internal implementation changes, security relevant changes, with also product changes (new fields, flags, or behavior). You might create a Required Approvers section that looks something like the following.

# Required Approvers
* Engineering: @zmb3 && (@codingllama || @nklaassen)
* Security: (@rjones || @klizhentas)
* Product: (@xinding33 || @klizhentas)

UX

Always start the RFD with a user experience section where you start with user stories. Every other part of your design - security, scale and privacy will flow from the UX, not vice-versa.

User stories

Explore UI, CLI and API user experience by going through scenarios that users would go through while solving specific problems.

In each story, explain specific step-by-step UI, CLI and API requests/responses that the user would observe, as if you are writing a step by step guide for a user who knows as little as possible about Teleport.

If you find too many steps or concepts end users would have to learn, start again to reduce it to a minimum.

In each user story, think about failure modes - what will happen if your integration fails?

Example: Alice integrates Okta via UI

Here is an example of a UI-driven user story:

Alice is a system administrator and she would like to integrate Okta with Teleport. She does not know anything about Teleport except the basics, but she has detailed Okta knowledge.

She logs into Teleport, looks for "Integrations", quickly finds an Okta tile and clicks on it.

In the Okta tile, she is asked to add a name for her Okta tenant. She can find the tenant in the Okta's UI and the information bubble shows her how to do that.

The next step for Alice is to find and locate the SCIM bearer token. Alice needs to go back to Okta again, create Teleport API services app in the Okta catalog, copy the SCIM token and paste it back to Teleport. Teleport's UI directs her to do just that.

Alice copies the token into Teleport UI. Let's assume she makes a mistake, and the token is broken or misses the permissions.

Alice is directed to Test the integration. The test finds an error and shows her that Okta returns an error:

`Insufficient permissions when synchronizing a user". Teleport shows a detailed response from Okta service, offers to check the token permissions and try the test again.

Finally, Alice figures out the right permission set on Okta's side and Teleport test passes.

Teleport tries a test sync run and offers Alice to tweak the integration parameters. If Alice is happy with the set she clicks save.

Make failure modes a first class citizen.

Administrators and system managers spend most of their day debugging integration issues, failures and errors. Make their day pleasant by building user experiences for most common failure scenarios:

  • What if the integration fails after its setup? Can Alice learn that it's broken, then find out where to go back and troubleshoot it?
  • What if Alice needs to tweak the parameters of the integration after setup? Can she go back to the integration and test it?

Build Poka-Yoke Devices

In Manufacturing, a Poka-yoke device is anything that prevents an error within the manufacturing process or makes defects visible.

Translated to Teleport, you can build a UX that can prevent people from making a mistake.

For example, if an admin assigned to a role, and changes a mapping that will lock themselves out and leave no other admins, Teleport could prevent the error by blocking the action:

"You can't unassign yourself, because there will be no more admins left."

Make UX that reduces information overload and work

Let's take a look at Gmail. When a user clicks on an e-mail, they are offered an option - "Filter messages like this”. Instead of deleting or moving messages one by one, Gmail offers to write, test and set up a rule that also applies to all other messages.

This reduces the amount of manual, tedious work, and works well for one message or a thousand.

When possible, build UX that offers users to reduce the amount of steps and do extra work on their behalf, instead of prompting them to do work that can be automated.

Think through the Day One and Day Two user experiences

As a Day 1 user, we don't have any domain knowledge of the product, we are novices.

That's why Day 1 flow should be the first user story we think through. It does not have to be scalable, but it must be easy.

For example, as a Day 1 user, I need step by step guide on how to add one or two servers and databases without learning about RBAC, configs and other Teleport internals. On the UI, Day 1 flow is guiding user each step of the way to enroll a server, test its connection and get to success in the minimum amount of steps.

As a Day two user, I'm concerned about setting up a feature at scale. My Day two user experience is different, and I know a bit more about Teleport.

For example, I would like to spend a bit more time setting up Teleport to automatically discover all my AWS resources and add them to the cluster.

Here are two imaginary examples demonstrating how Day 1 and Day 2 CLI U.X. are different.

Example: Day 1 CLI certificates

As a day one user, I would like to issue a certificate to two services to set up mTLS in my cluster.

tbot join service-a --cluster=teleport.example.com
[1] Joining to cluster teleport.example.com...
[2] Issuing a certificate to ./tbot/certs/service-a/cert.pem and key.pem..

To test using this certificate, try:

curl https://teleport.example.com/ --cert=... --key=... --ca-cert=...

On day 1 we keep the amount of new concepts, ideas that users need to think about here to a minimum, and automate most of the steps.

This flow does not have to cover all possible scenarios, just 80% most common ones to get user to success as fast as possible.

Example Day 2 CLI certificates

The UX in the previous example won't scale for Day two, as there are many configuration options to consider, so for a day two user we can offer something more flexible at the expense of adding complexity.

tbot bootstrap service-a --cluster=teleport.example.com

[1] Generating tbot.yaml for service a in ./tbot/configuration/tbot.yaml
[2] Generating service-a role...
[3] Generating systemd unit ./tbot/certs/service-a/cert.pem and key.pem...
[4] Starting a daemon...

In this case instead of a simple one liner, we generate detailed step-by step parts and instruct users how to configure those.

Make error and info messages actionable.

Make sure errors and info give specific instructions and give enough information.

Explore common failure modes and how users can recover from them.

Here are a couple of examples of messages that need work:

Please review the access list "My-Awesome-Team", the review is due in 4 days.

This error message misses the actual link or any specific steps users need to take to review the list.

Failed to set up Okta integration - "Bad request".

This is the most frustrating error messages users can encounter - they don't see any logs, no way to re-test it or trigger the error, and all they can do is to reach out to support.

Consider Cloud UX from the start.

Cloud is a first class citizen. The feature setup can no longer rely on static teleport.yaml configuration, as this automatically excludes all cloud customers.

Upgrade UX

Consider the UX of configuration changes and their impact on Teleport upgrades.

Security

Describe the security considerations for your design doc. (Non-exhaustive list below.)

  • Explore possible attack vectors, explain how to prevent them
  • Explore DDoS and other outage-type attacks
  • If frontend, explore common web vulnerabilities
  • If introducing new attack surfaces (UI, CLI commands, API or gRPC endpoints), consider how they may be abused and how to prevent it
  • If introducing new auth{n,z}, explain their design and consequences
  • If using crypto, show that best practices were used to define it

Privacy

Describe the privacy considerations for your design doc. (Non-exhaustive list below.)

  • Consider the principles of "privacy by design" and "privacy by default"
  • Explore the kind and type of data that will be generated, accessed, or collected
  • Explore if all this data needs to be collected, and if so, if it is possible to mask of any Personal Data / Personally Identifiable Information (PII) or at least limit collection to only what is necessary
  • Explore where and how the data will be stored, how long it will be kept for, and how it will be retained/deleted
  • Explore if there are sufficient logs showing any data access or modification

Proto Specification

Include any .proto changes or additions that are necessary for your design.

Backward Compatibility

Describe the impact that your design doc has on backwards compatibility and include any migration steps. (Non-exhaustive list below.)

  • Will the change impact older clients? (tsh, tctl)
  • What impact does the change have on remote clusters?
  • Are there any backend migrations required?
  • How will changes be rolled out across future versions?

Audit Events

Include any new events that are required to audit the behavior introduced in your design doc and the criteria required to emit them.

Observability

Describe how you will know the feature is working correctly and with acceptable performance. Consider whether you should add new Prometheus metrics, distributed tracing, or emit log messages with a particular format to detect errors.

Product Usage

Describe how we can determine whether the feature is being adopted. Consider new telemetry or usage events.

Test Plan

Include any changes or additions that will need to be made to the Test Plan to appropriately test the changes in your design doc and prevent any regressions from happening in the future.