authors | state |
---|---|
Andrew Lytvynov (andrew@goteleport.com) |
draft |
Request For Discussion (RFD) is a design document format for non-trivial technical Teleport changes. It's also a process by which these documents are proposed, discussed, approved and tracked.
As the Teleport project grows, we need a way to discuss major changes (e.g. new features, major refactors, major distribution changes).
Prior to RFDs, Teleport engineers used several other discussion methods (Google Docs, brainstorm meetings, internal wiki pages, ad-hoc email/chat conversations).
RFDs formalize the process and provide several benefits:
- discussions are retained in GitHub pull requests and commit history
- discussions are in the open, for Teleport users to see and contribute
- discussions are stored in one central place
- approvals are recorded and enforced
The RFD idea is borrowed from https://oxide.computer/blog/rfd-1-requests-for-discussion/ which is in turn inspired by https://www.ietf.org/standards/rfcs/
Each RFD is stored in a markdown file under https://github.com/gravitational/teleport/tree/master/rfd and has a unique number.
Each RFD consists of:
- a header containing author name(s) and state
- title in the format
RFD $NUMBER - $TITLE
- the
Required Approvers
which contains required set of approvers - the
What
section - 1-3 sentence summary of what this RFD is about - the
Why
section - a few paragraphs describing motivation for the RFD - the
Details
section - detailed description of the proposal, including APIs, UX examples, migrations or any other relevant information
Use this RFD as an example.
Here's the process from and RFD idea in your head to a working implementation in the main Teleport branch.
-
Pick the RFD number. RFD numbers are claimed by pushing a branch named
rfd/$number-title
.Use this one-liner to get the highest-numbered RFD branch that's been pushed:
git fetch --quiet && git branch -r -v | grep origin/rfd/ | awk -F'[/,-]' '{ n=$3+0; print n }' | sort -n | tail -n 1
-
make a branch off of
master
calledrfd/$number-your-title
For example, you're writing an RFD titled 'Teleport IRC Access' and end up with number 18 - your branch would be called
rfd/0018-irc-access
. -
write your RFD under
/rfd/$number-your-title.md
Our example RFD is in
/rfd/0018-irc-access.md
. -
submit a PR titled
RFD $number: Your Title
Our example RFD title:
RFD 18: IRC Access
-
iterate on the RFD based on reviewer feedback and get approvals
Note: it's OK to use meetings or chat to discuss the RFD, but please write down the outcome in PR comments. A future reader will be grateful!
-
merge the PR and start implementing
-
once implemented, make another PR changing the
state
toimplemented
and updating any details that changed during implementation.
If an RFD is eventually deprecated (e.g. a feature is removed), make a PR
changing the state
to deprecated
and optionally link to the replacement RFD
(if applicable).
draft
- RFD is proposed or approved, but not yet implementedimplemented
- RFD is approved and implementeddeprecated
- RFD was approved and/or implemented at one point, but is now deprecated and should only be referenced for historic context; a superseding RFD, if one exists, may be linked in the header
The purpose of the state
is to tell the reader whether they should care about
this RFD at all. For example, deprecated
RFDs can be skipped most of the
time. implemented
is relevant to Teleport users, but draft
is mostly for
Teleport engineers and early adopters.
The purpose of the Required Approvers
section is to be explicit on required
and optional approvers for an RFD. For the subject matter experts that can
provide high quality feedback to help refine and improve the RFD.
For example, suppose you are making a change with internal implementation
changes, security relevant changes, with also product changes (new fields,
flags, or behavior). You might create a Required Approvers
section that looks
something like the following.
# Required Approvers
* Engineering: @zmb3 && (@codingllama || @nklaassen)
* Security: (@rjones || @klizhentas)
* Product: (@xinding33 || @klizhentas)
Always start the RFD with a user experience section where you start with user stories. Every other part of your design - security, scale and privacy will flow from the UX, not vice-versa.
Explore UI, CLI and API user experience by going through scenarios that users would go through while solving specific problems.
In each story, explain specific step-by-step UI, CLI and API requests/responses that the user would observe, as if you are writing a step by step guide for a user who knows as little as possible about Teleport.
If you find too many steps or concepts end users would have to learn, start again to reduce it to a minimum.
In each user story, think about failure modes - what will happen if your integration fails?
Example: Alice integrates Okta via UI
Here is an example of a UI-driven user story:
Alice is a system administrator and she would like to integrate Okta with Teleport. She does not know anything about Teleport except the basics, but she has detailed Okta knowledge.
She logs into Teleport, looks for "Integrations", quickly finds an Okta tile and clicks on it.
In the Okta tile, she is asked to add a name for her Okta tenant. She can find the tenant in the Okta's UI and the information bubble shows her how to do that.
The next step for Alice is to find and locate the SCIM bearer token. Alice needs to go back to Okta again, create Teleport API services app in the Okta catalog, copy the SCIM token and paste it back to Teleport. Teleport's UI directs her to do just that.
Alice copies the token into Teleport UI. Let's assume she makes a mistake, and the token is broken or misses the permissions.
Alice is directed to Test the integration. The test finds an error and shows her that Okta returns an error:
`Insufficient permissions when synchronizing a user". Teleport shows a detailed response from Okta service, offers to check the token permissions and try the test again.
Finally, Alice figures out the right permission set on Okta's side and Teleport test passes.
Teleport tries a test sync run and offers Alice to tweak the integration parameters. If Alice is happy with the set she clicks save.
Administrators and system managers spend most of their day debugging integration issues, failures and errors. Make their day pleasant by building user experiences for most common failure scenarios:
- What if the integration fails after its setup? Can Alice learn that it's broken, then find out where to go back and troubleshoot it?
- What if Alice needs to tweak the parameters of the integration after setup? Can she go back to the integration and test it?
In Manufacturing, a Poka-yoke device is anything that prevents an error within the manufacturing process or makes defects visible.
Translated to Teleport, you can build a UX that can prevent people from making a mistake.
For example, if an admin assigned to a role, and changes a mapping that will lock themselves out and leave no other admins, Teleport could prevent the error by blocking the action:
"You can't unassign yourself, because there will be no more admins left."
Let's take a look at Gmail. When a user clicks on an e-mail, they are offered an option - "Filter messages like this”. Instead of deleting or moving messages one by one, Gmail offers to write, test and set up a rule that also applies to all other messages.
This reduces the amount of manual, tedious work, and works well for one message or a thousand.
When possible, build UX that offers users to reduce the amount of steps and do extra work on their behalf, instead of prompting them to do work that can be automated.
As a Day 1 user, we don't have any domain knowledge of the product, we are novices.
That's why Day 1 flow should be the first user story we think through. It does not have to be scalable, but it must be easy.
For example, as a Day 1 user, I need step by step guide on how to add one or two servers and databases without learning about RBAC, configs and other Teleport internals. On the UI, Day 1 flow is guiding user each step of the way to enroll a server, test its connection and get to success in the minimum amount of steps.
As a Day two user, I'm concerned about setting up a feature at scale. My Day two user experience is different, and I know a bit more about Teleport.
For example, I would like to spend a bit more time setting up Teleport to automatically discover all my AWS resources and add them to the cluster.
Here are two imaginary examples demonstrating how Day 1 and Day 2 CLI U.X. are different.
Example: Day 1 CLI certificates
As a day one user, I would like to issue a certificate to two services to set up mTLS in my cluster.
tbot join service-a --cluster=teleport.example.com
[1] Joining to cluster teleport.example.com...
[2] Issuing a certificate to ./tbot/certs/service-a/cert.pem and key.pem..
To test using this certificate, try:
curl https://teleport.example.com/ --cert=... --key=... --ca-cert=...
On day 1 we keep the amount of new concepts, ideas that users need to think about here to a minimum, and automate most of the steps.
This flow does not have to cover all possible scenarios, just 80% most common ones to get user to success as fast as possible.
Example Day 2 CLI certificates
The UX in the previous example won't scale for Day two, as there are many configuration options to consider, so for a day two user we can offer something more flexible at the expense of adding complexity.
tbot bootstrap service-a --cluster=teleport.example.com
[1] Generating tbot.yaml for service a in ./tbot/configuration/tbot.yaml
[2] Generating service-a role...
[3] Generating systemd unit ./tbot/certs/service-a/cert.pem and key.pem...
[4] Starting a daemon...
In this case instead of a simple one liner, we generate detailed step-by step parts and instruct users how to configure those.
Make sure errors and info give specific instructions and give enough information.
Explore common failure modes and how users can recover from them.
Here are a couple of examples of messages that need work:
Please review the access list "My-Awesome-Team", the review is due in 4 days.
This error message misses the actual link or any specific steps users need to take to review the list.
Failed to set up Okta integration - "Bad request".
This is the most frustrating error messages users can encounter - they don't see any logs, no way to re-test it or trigger the error, and all they can do is to reach out to support.
Cloud is a first class citizen. The feature setup can no longer rely on static teleport.yaml configuration, as this automatically excludes all cloud customers.
Consider the UX of configuration changes and their impact on Teleport upgrades.
Describe the security considerations for your design doc. (Non-exhaustive list below.)
- Explore possible attack vectors, explain how to prevent them
- Explore DDoS and other outage-type attacks
- If frontend, explore common web vulnerabilities
- If introducing new attack surfaces (UI, CLI commands, API or gRPC endpoints), consider how they may be abused and how to prevent it
- If introducing new auth{n,z}, explain their design and consequences
- If using crypto, show that best practices were used to define it
Describe the privacy considerations for your design doc. (Non-exhaustive list below.)
- Consider the principles of "privacy by design" and "privacy by default"
- Explore the kind and type of data that will be generated, accessed, or collected
- Explore if all this data needs to be collected, and if so, if it is possible to mask of any Personal Data / Personally Identifiable Information (PII) or at least limit collection to only what is necessary
- Explore where and how the data will be stored, how long it will be kept for, and how it will be retained/deleted
- Explore if there are sufficient logs showing any data access or modification
Include any .proto
changes or additions that are necessary for your design.
Describe the impact that your design doc has on backwards compatibility and include any migration steps. (Non-exhaustive list below.)
- Will the change impact older clients? (tsh, tctl)
- What impact does the change have on remote clusters?
- Are there any backend migrations required?
- How will changes be rolled out across future versions?
Include any new events that are required to audit the behavior introduced in your design doc and the criteria required to emit them.
Describe how you will know the feature is working correctly and with acceptable performance. Consider whether you should add new Prometheus metrics, distributed tracing, or emit log messages with a particular format to detect errors.
Describe how we can determine whether the feature is being adopted. Consider new telemetry or usage events.
Include any changes or additions that will need to be made to the Test Plan to appropriately test the changes in your design doc and prevent any regressions from happening in the future.