-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Token Rotation #1244
Add Support for Token Rotation #1244
Conversation
@@ -91,7 +91,7 @@ export class InstallProvider { | |||
} | |||
|
|||
/** | |||
* Fetches data from the installationStore for non Org Installations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements! As this pull request takes a bit long before merging into main branch, can you submit another pull request including some token rotation unrelated changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The correction to the comment above is the only non-token rotation change included here. Since it's not time-sensitive and it's just a comment, I think it's fine to just wait until this PR goes in to see it fixed.
packages/oauth/src/index.ts
Outdated
@@ -636,7 +739,11 @@ export type OrgInstallationQuery = InstallationQuery<true>; | |||
// of Bolt. | |||
export interface AuthorizeResult { | |||
botToken?: string; | |||
botRefreshToken?: string; | |||
botTokenExpiresAt?: number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about having a comment like // UTC time in seconds
for the expiresAt properties?
packages/oauth/src/index.ts
Outdated
const installationUpdates: any = { ...queryResult }; // TODO :: TS | ||
client = new WebClient(undefined, this.clientOptions); | ||
|
||
// TODO :: Concurrent Calls? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrent calls are fine as long as no call can be skipped. Specifically, we cannot have delayed calls here as the calls can be forcibly terminated when running on a FaaS runtime (AWS Lambda, Google Cloud Functions, Azure Functions). We may want to have a flag option like processBeforeResponse in Bolt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. Since it's only ever two calls, do you think implementing a concurrent strategy is necessary? Also, thoughts on how to move forward with the retry logic if a call fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think implementing a concurrent strategy is necessary?
Yes, I think that it's valuable for end-user experience and less 3 second timeouts. To avoid having an extra overhead, we would like to asynchronously do these refreshing API calls as it's not necessary for this incoming request handling. If the oauth.v2.access
here takes 300-500 milliseconds in average, calling twice may take 1 second in a worst case scenario. This may not be a small overhead.
I haven't checked if this works but one possible idea would be having a private function like async function refreshExpiringToken(tokensToRefresh)
and calling the method asynchronously (no await
) by default. In addition to that, we can consider FaaS applications by calling the same method with await
if processBeforeResponse
is true.
Also, thoughts on how to move forward with the retry logic if a call fails?
It depends on the error. If the error is recoverable (e.g., connection failure, timeouts), we can retry with the same set of parameters. If the response has an error code like invalid parameters or invalid auth, of course, we don't need to retry the call anymore.
packages/oauth/src/index.ts
Outdated
|
||
if (tokensToRefresh.length > 0) { | ||
const installationUpdates: any = { ...queryResult }; // TODO :: TS | ||
client = new WebClient(undefined, this.clientOptions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This client
can be const client
as it's used only inside this code block (scope).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This is a remnant from a change I made during cleanup. Will fix.
packages/oauth/src/index.ts
Outdated
refresh_token: refreshToken, | ||
}) as OAuthV2Response; | ||
|
||
// TODO :: Sort out TS issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, as we use automatic code generator for the types, we are unable to have non-optional properties here.
When we can assume the property exists 100%, using !
can be okay but if I wrote this part, I would have if statements verifying the existence. Doing this satisfies the TS compiler plus may detect really unexpected situations (say, Slack server-side regressions / incidents).
const tokenType = refreshResp.token_type;
if (tokenType === undefined) {
throw new Error("unexpectedly token_type is missing here!");
}
const expireIn = refreshResp.expires_in;
if (expireIn === undefined) {
throw new Error("unexpectedly expire_in is missing here!");
}
const expiresAt: number = currentUTCSec + expireIn;
if (tokenType === 'bot') {
authResult.botToken = refreshResp.access_token;
authResult.botRefreshToken = refreshResp.refresh_token;
authResult.botTokenExpiresAt = expiresAt;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type is actually not using the generated code version of the response, but it brings up the greater question: should it? Are we throwing out all of our hand-rolled types and opting for the generated ones at this point?
Thanks for your suggestion for a resolution the current state of things -- I'll plan to incorporate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type is actually not using the generated code version of the response, but it brings up the greater question: should it?
Yes, the type will be updated once the properties are GAed (or at least in public beta). With the current approach, tall the properties in responses are optional. There is no way to exclude specific properties at this point. During the beta period, you can go with (=internally cast the type) an internal type in this source file.
Are we throwing out all of our hand-rolled types and opting for the generated ones at this point?
We don't have any hand written response types now. Also, we don’t have the bandwidth to guarantee the quality of it for all API responses and I have to say that the situation can be the same in the future. That's why we are auto generating the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For OAuth, we do have hand rolled types right now. See https://github.com/slackapi/node-slack-sdk/blob/main/packages/oauth/src/index.ts#L705-L726
But maybe we shouldn't have these and instead be switching over to using the generated types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time we remove types that reliably indicate what properties are present I die a little inside, but I do acknowledge the motivation outlined above. 🙂
Just let me know if you two would like me to remove it when the generated types are updated to reflect the payload changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha I agree. Lets not remove anything right now. Maybe in the future we can get more reliable with the generated response types and then we can revist this.
packages/oauth/src/index.ts
Outdated
|
||
if (authResult.botRefreshToken !== undefined && authResult.botTokenExpiresAt !== undefined) { | ||
const botTokenExpiresIn = authResult.botTokenExpiresAt - currentUTCSec; | ||
if (botTokenExpiresIn <= 7200) { // 2 hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing the same twice here. How about extracting the logic as a private function function isExpiring(expiresIn, currentUTCSeconds): boolean
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After circling back to this, I've decided to pull out the bad use of a magic number into a const
above. I don't think that creating a new function for a comparison operator is a good choice here, as it causes even more legwork for the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That's okay. Another factor to generally consider is that naming a chunk of logic may help us easily understand what the lines of code do but this time, having a method or not is not a big deal at all.
On another note, while checking this pull request again, I noticed that we can have two groups of tokens here.
- the tokens already expired
- the tokens that are going to be expired in 2 hours
In the case of 1. , we always must immediately call oauth.v2.access API with a refresh token for following usage.
On the other hand, in the case of 2., we basically don't want to have the refresh API call in the synchronous operation for an incoming request as it's not yet expired. As I mentioned in my comment, we may want to have a new flag option similar to processBeforeResponse
for FaaS users. If the flag is true, we have to call the refresh API synchronously.
I don't know having 1. and 2. in the tokensToRefresh
array is intentional but a bit more consideration for 2. can improve app's response time and it would be valuable for user experience and less timeouts. Thoughts? If you want to see example code, I can come up with Python draft PR in a few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree re: the chunks of logic and their role in aiding readability, so I do understand your point. I'll take another pass at it and see if I can improve upon what's there.
Regarding the second part, you bring up excellent concerns. The use of tokensToRefresh
was definitely intentional when I wrote it because we considered the situation either refresh vs. no refresh, but I agree with you that we can do better by drilling down. Given my lack of first-hand experience with the FaaS experience, I will rely on you heavily to ensure the inclusion of an additional option is being incorporated properly.
I'll work on the above over the next day or so and would like to compare notes with your implementation so that we can end up with the best solution. Thanks for your continued input!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding 2, I think for the beta we stick with the current approach (make the listener code wait until token refresh is done, even if token isn't expired). I'd like to see if people in the real world start getting 3 second timeout issues with this change. If the answer is no, than our solution is sufficient and we don't need to optimize this for now. If timeouts do increase, then we may want to look at moving token refresh code into web-api or possibly or even adding an entry point to run token refresh after the listener is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing thoughts, you both!
I agree that we can go with a simple approach first and see how it goes during the beta. The discussion here should be helpful when we optimize the part when it's needed. Let's go with this way for its beta releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @misscoded and thanks for the detailed write up about Token Rotation!
Two questions that come to mind are:
1. Tokens refreshed on events: When an event is received, the authorize
is run. What happens when an app doesn't subscribe to any events? For example, the app may only want to use channels.list
and chat.postMessage
. Would token rotation not work in that situation?
2. Refreshing a token: When we refresh a token, is the previous token immediately invalidated?
I can imagine a scenario where multiple events trigger at once, causing the same refresh token to be refreshed by multiple event handlers at the same time. Each one will be written to the database and there's no guarantee it would be the most "fresh" token.
In this scenario, would any of the refreshed tokens be valid? Or is there a chance that we would save an "old" token to the database?
This is a very valid point, IMO, and was my biggest push back when Steve/Kaz hit me with the redirection to event-based refresh rather than API-call-based. I can't speak in depth about the rationale beyond what I understand to be considering it an edge case and unlikely to happen, but I could be misremembering that so I'll tap @stevengill to fill in the details of the conversation where it was decided.
So, refresh tokens actually expire 5 minutes after they're first used, meaning that the same refresh token could be used multiple times back-to-back. That said, only two
In the case of multiple events coming in at once, assuming each is written to the DB, I've been assured it's been deemed safe to assume that the last one (or two) would be valid. |
I think the thinking here is that the main way to interact with client in bolt is to use a listener. Before any listener gets executed, authorize will run (which means the token can get refreshed). If they use client outside of the listener, that client will potentially have the wrong token attached. We may want to see if we can update the token on the top level app.client property (or remove this property?). Issue is the way webclient initializes, it takes in a token. We don't have a mechanism to change the token on a web client instance as far as I know. I wouldn't be surprised if people file an issue for this. Lets see how it goes. We very well may have to do some refactoring to web-api. |
packages/web-api/src/methods.ts
Outdated
@@ -506,6 +506,7 @@ export abstract class Methods extends EventEmitter<WebClientEvent> { | |||
access: bindApiCall<OAuthAccessArguments, OauthAccessResponse>(this, 'oauth.access'), | |||
v2: { | |||
access: bindApiCall<OAuthV2AccessArguments, OauthV2AccessResponse>(this, 'oauth.v2.access'), | |||
exchange: bindApiCall<OAuthV2AccessArguments, OauthV2AccessResponse>(this, 'oauth.v2.exchange'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Need to have OAuthV2ExchangeArguments
for this. The arguments are not the same with oauth.v2.access
. It has token
. Regarding the response, it's fine to use WebAPICallResult
for now.
a5ecedd
to
33d1c6d
Compare
}).catch(e => e) as OAuthV2ExchangeResponse; | ||
}); | ||
|
||
return Promise.all(refreshPromises); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a good amount of time trying to best solution this, but would love input.
With our current tsconfig, we don't have the ability to use Promise.allSettled
(we use es2017
and would need es2020
), which is what we want in this case (to wait until all Promises are settled, despite outcome of each).
Promise.all
will fail fast if a rejection occurs, and I think under normal circumstances that would be handled fine with the additional .catch(e => e)
that's included for each individual Promise, but in the case of a failure, the WebAPI actually throws an error before that catch can happen (I think that's what I've been observing while testing. Feel free to challenge that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we dropped Node 10 support, perhaps, Promise.allSettled
should be available in all supported Node runtimes.
const currentUTCSec = Math.floor(Date.now() / 1000); // seconds | ||
const tokensToRefresh: string[] = this.checkTokenExpiry(authResult, currentUTCSec); | ||
|
||
if (tokensToRefresh.length > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled out some logic (checkTokenExpiry
and refreshExpiringTokens
) in an attempt to clean up and separate concerns, but stopped short of going beyond this point because I couldn't figure out a clean way to do so without also mutating authResult
within the confines of the helper function and not hindering readability. Happy to take someone else's attempt and incorporate it.
packages/oauth/src/index.ts
Outdated
@@ -725,6 +859,20 @@ interface OAuthV2Response extends WebAPICallResult { | |||
}; | |||
} | |||
|
|||
export interface OAuthV2ExchangeResponse extends WebAPICallResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add this because of conflicts created in other, pre-existing parts of the code. If someone else wants to take a stab at how to resolve this in another way, again, more than happy to accept a better solution!
team: { id: string, name: string }; | ||
enterprise: { name: string, id: string } | null; | ||
is_enterprise_install: boolean; | ||
response_metadata: {}; // TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find what the actual shape of this property is, neither in the docs nor in conversations. If anyone happens to know, would like to update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't run the code yet but left a few comments
packages/oauth/src/index.ts
Outdated
* | ||
* The return value is an Array of expired or soon-to-expire access tokens. | ||
*/ | ||
private checkTokenExpiry(authResult: AuthorizeResult, currentUTCSec: number): string[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think check
is generally a bit unclear naming. How about naming the method using detect
or extract
? The method's responsibility is to return expired/expiring tokens from the auth result.
private checkTokenExpiry(authResult: AuthorizeResult, currentUTCSec: number): string[] { | |
private detectExpiredOrExpiringTokens(authResult: AuthorizeResult, currentUTCSec: number): string[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On another note, these new private methods are stateless (=they don't access the internal state of this object and probably they should not). You may want to place them at top level. You can call them without self.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both great suggestions, I'll make the changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I went to actually make this change, it became apparent that the second function does rely on the instance (client
, clientId
and clientSecret
). If there is a way to do as you've indicated that I'm not seeing, happy to hear it and adjust, but for the time being, I'll just make the name change and move the detection function to the top level.
}).catch(e => e) as OAuthV2ExchangeResponse; | ||
}); | ||
|
||
return Promise.all(refreshPromises); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we dropped Node 10 support, perhaps, Promise.allSettled
should be available in all supported Node runtimes.
33d1c6d
to
4c55fbf
Compare
…o 1241-add-token-rotation-support
#1241
Although the original consensus was to tie the expiry check and subsequent refresh to making a call to the API, it was at some point decided that this should occur when events are received. The current state of this draft reflects that decision.
HOW IT WORKS
On OAuth installation (the only installation flow allowed when token rotation is enabled), new access tokens are stored as they have been, but now include an accompanying
refresh_token
andexpires_in
. Therefresh_token
(does not expire) is used to "trade-in" for a newaccess_token
andrefresh_token
pair.For each token (bot|user), we use the
expires_in
value to calculate and store a corresponding future expiry timestamp (UTC, seconds).Each time an event is received,
authorize
gets run. Here we determine if token rotation is enabled. If so, we check the expiry of the token(s) and refresh those that have expired or will expire within 2 hours. We then update the returnedauthResult
and theInstallation
.NOTABLE CHANGES
AuthorizationResult
has four new keys:botRefreshToken
,botTokenExpiresAt
,userRefreshToken
, anduserTokenExpiresAt
. This was to handle apps that have both User and Bot tokens while still keeping the shape flat.Installation
has four new keys: arefreshToken
andexpiresAt
for both theuser
andbot
objects.TO CONSIDER
2 hours is the current window of time we try to get ahead of an expiry. If customization of the TTL is introduced later on, this is likely too inflexible and will need to be revisited.
Refreshing on incoming events still feels strange, but it appears to work as expected thus far. Logic for the expiry checks are run on every incoming event, which is no problem with single user testing, but I'm unsure if results would be different with events constantly pouring in across many channels.
TODO / OUTSTANDING
Retry on failure to refresh (discussion)IntroducedeleteInstallation
(discussion)Requirements (place an
x
in each[ ]
)