-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cli): watch streams resources' CloudWatch logs to the terminal #18159
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be an amazing feature - awesome job @corymhall!
My main feedback:
- Let's make the logs show by default.
- I don't think waiting for a resource to be hotswapped to show its logs is the correct experience. Let's show all of the logs of supported resources instead.
- Let's work on the log presentation, and try to reduce the delay in surfacing them to the minimum.
- Some code needs some polishing, as far as the split of responsibilities in the APIs (see my comments inline).
public print(event: CloudWatchLogEvent): void { | ||
this.stream.write( | ||
util.format( | ||
'%s %s %s %s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is way, way too much info on the left side. For sure you can kill the log stream from there, I think it's not important to show it. I would even change the log group name to the physical name of the Lambda, will make it clearer where the logs are coming from, and take less space.
Also, maybe think about using separators like []
to make the logs clearer? Something like this perhaps:
[HackathonExampleStack-NodeJsFunction6DD2A8DD-adO4KmEjQSuq] 3:48:32 PM: REPORT RequestId: b333acdd-f56b-403a-9b88-2a3bc07a62b9 Duration: 17.37 ms Billed Duration: 18 ms Memory Size: 128 MB Max Memory Used: 56 MB Init Duration: 185.21 ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is partially updated. Still working on the final version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general approach! A few more comments. The main thing is that I think we should try to unchain this functionality from hotswapping - I think it should be available to all cdk watch
customers, regardless of whether they use hotswapping, or not (I don't see this functionality being tied to hotswapping in any significant way).
This adds a new `--logs` flag on `cdk watch` which is set to false by default. When this flag is provided (set to true) watch will fetch any CloudWatch Logs log events for any hotswapped Lambda Function starting at the time the function was hotswapped. Once `cdk watch --logs` is run, any Lambda function that is hotswapped is added to a list of "monitored" resources. This list is maintained through full CloudFormation deployments. So for example if an application had 3 Lambda functions, after running `cdk watch --logs` none of the functions are monitored by default. 1. Change function code for Function1 Function1's CloudWatch log group is now being monitored 2. Change function code for Function2 Function2's CloudWatch log group is now being monitored 3. Change a non-hotswappable property which triggers a full CloudFormation deployment Function1 & Function2 are still being monitored re #18122
Co-authored-by: Adam Ruka <adamruka85@gmail.com>
updated to use the `print` method that we use elsewhere in the cli also switched back to using `setTimeout` instead of `setInterval` because `setInterval` sometimes caused events to be processed more than once.
@skinny85 I've updated the PR to include all hotswappable resources and it now monitors logs independent of hotswap. Some things I'm still thinking through:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to look great!
We need to iterate on the names, and on the locations of the particular files, a little bit more.
To answer your questions:
- Should we or should we not filter what log groups we monitor? I think we should to some degree (we don't want to monitor VPC Flow Logs for example). Right now I am filtering for only resources that we support hotswapping for, but I don't think that is the correct level. We also might want to somehow filter out Lambda Functions that are used as custom resource providers.
Like I said in #18159 (comment), I think we should filter out the Log Groups we don't want (like the VPC flow logs), instead of keeping only the logs we deem "interesting".
Still working through the "experience" aspect of this feature. Some things I added today:
- Pause the monitor while we are performing a deployment so we don't have application logs intermixed with deployment logs. After the deployment is complete, the monitor will read and print all logs since the last time we processed logs.
I actually think it's fine if we show the deployment logs during deployment - I assume those would be mainly the logs of our Custom Resources, which might actually make diagnosing problems with them easier when they fail.
Unless that screws up the progress bar we render during deployment, in which case yeah, that's a good call 🙂.
Yeah it was messing with the progress bar during deployment. We'll still get the logs from our CustomResources, they'll just be printed after the deployment finishes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing. The experience of the feature is awesome, and the code looks perfect to me. I'd want to wait for #18277 to be merged first, and then update this PR to also use the lookup Role for new-style synthesis, but I'm giving my "ShipIt" right now, and just adding the pr/do-not-merge
label.
Awesome work @corymhall!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still good with this, just one suggestion.
* then return those and note that we are returning the default credentials. The calling | ||
* function can then decide to use them or fallback to another role. | ||
*/ | ||
export async function prepareSdkWithLookupRoleFor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we can make this a method of SdkProvider
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had it as a method of SdkProvider
, but after discussing with @rix0rrr we decided to put it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the SdkProvider shouldn't know about CloudFormationStackArtifacts or the fact that SSM parameters hold bootstrap stack versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this feature. I just want more of it😅
* then return those and note that we are returning the default credentials. The calling | ||
* function can then decide to use them or fallback to another role. | ||
*/ | ||
export async function prepareSdkWithLookupRoleFor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the SdkProvider shouldn't know about CloudFormationStackArtifacts or the fact that SSM parameters hold bootstrap stack versions.
listStackResources(): Promise<AWS.CloudFormation.StackResourceSummary[]>; | ||
} | ||
|
||
export class LazyListStackResources implements ListStackResources { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's more like CachedListStackResources
than Lazy*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there ever a chance we will list the stack resources before a particular resource is created, and then never update the list to account for the new resource?
Aha I guess since we only use this for hotswapping, that will never add new resources so doesn't apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the PR description could do with a Cliff's Notes of the approach taken, biggest challenges, design decisions and refactorings. It's very helpful when evaluating a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the PR description could do with a Cliff's Notes of the approach taken, biggest challenges, design decisions and refactorings. It's very helpful when evaluating a change.
That is a good suggestion! This particular file was existing functionality that was moved here, which I could have called out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's more like
CachedListStackResources
thanLazy*
.
It is actually Lazy though (it will not make a service call unless you call listResources()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true for most methods though 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. I think a very sensible non-lazy implementation would be to make the service call in the constructor of the class, and not in listResources()
(it would be simpler, for one thing).
try { | ||
return await this.evaluateCfnExpression(physicalNameInCfnTemplate); | ||
} catch (e) { | ||
// If we can't evaluate the resource's name CloudFormation expression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this function used for? To predict physical names, before deployment happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just always list from the currently deployed stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent here was to avoid a lookup if we could ( cc @skinny85 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is used in case we can get the physical name from the template, and in this way save a service call (not super relevant here, but very relevant for hotswapping, and this class is used in both places).
const IGNORE_LOGS_RESOURCE_TYPES = ['AWS::EC2::FlowLog', 'AWS::CloudTrail::Trail', 'AWS::CodeBuild::Project']; | ||
|
||
// Resource types that will create a CloudWatch log group with a specific name if one is not provided. | ||
// The keys are CFN resource types, and the values are the names of the physical name property of that resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to account for the prefix? Or are you guaranteeing that it will always be /aws/<servicename>/
?
Because I can assure you, AWS don't roll that way 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could update this to include the prefix to more future proof this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
export async function findCloudWatchLogGroups( | ||
sdkProvider: SdkProvider, | ||
stackArtifact: cxapi.CloudFormationStackArtifact, | ||
): Promise<{ env: cxapi.Environment, sdk: ISDK, logGroupNames: string[] }> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declare the return struct separately please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
do { | ||
const response = await logGroupsAccessSettings.sdk.cloudWatchLogs().filterLogEvents({ | ||
logGroupName: logGroupName, | ||
nextToken, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do nextToken
and startTime
interact? Don't we need to hold on to the token as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the filterLogEvents
operation returns as many log events after the startTime
as can fit in 1 MB (up to 100 events since we specified that limit). If the first 90 events took up the 1 MB then we would receive a nextToken
to retrieve the final 10. In the case where we receive the nextToken
we still want to use the same startTime
, we only want to update the startTime
for the next time we process events (the next time we tick()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting (and odd). But okay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up removing the nextToken
since the intent is to just provide a sampling of 100 events if we have > 100.
import { ISDK } from '../aws-auth'; | ||
|
||
// how often we should read events from CloudWatchLogs | ||
const SLEEP = 2_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need some backoff as well, eh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Maybe I should reword this comment, it is the amount of time we wait after processing events before we start processing new events. So we:
- read all events starting with
startTime
for all log groups - print all the events
- SLEEP
- start back at 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just a little concerned about overwhelming CloudWatch Logs with filterLogEvents
calls, and getting throttled.
Typically, in a throttling event you do an exponential backoff to avoid overloading the backend service. I suppose there's an argument to be made that the SDK's retry settings would take care of this for us... but if that's your argument I'd like to see that explicitly stated somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment here to note that sleep is also how we are handling backoff. Now that we are not using nextToken
we will only make 1 request per log group. If any of the requests are throttled the SDKs builtin retry/backoff will take effect. If that doesn't resolve it, it will throw an error, wait for SLEEP
and then try again.
@@ -304,10 +310,12 @@ export class CdkToolkit { | |||
// -------------- -------- 'cdk deploy' done -------------- 'cdk deploy' done -------------- | |||
let latch: 'pre-ready' | 'open' | 'deploying' | 'queued' = 'pre-ready'; | |||
|
|||
const cloudWatchLogMonitor = options.traceLogs ? new CloudWatchLogEventMonitor() : undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So honestly, this doesn't have a lot to do with watch
, does it?
We're not monitoring changes that get triggered during watch
, or even as a result of watch
.
We're watching for activity that most likely occurs because a user clicks around in their UI.
Shouldn't this be a separate CLI command primarily, and only incidentally something that can be activated as part of cdk watch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @skinny85
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have currently any continuous running command in our CLI beyond watch
, so I don't see where it would fit besides watch
, and it slots into watch
beautifully.
If you want to create a separate command for this (cdk logs
, or something), I think that's reasonable, but that's a separate task and PR.
* Print out a cloudwatch event | ||
*/ | ||
private print(event: CloudWatchLogEvent): void { | ||
print(util.format('[%s] %s %s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this need some spacing/alignment here to make this readable?
How long are these log group names typically?
Also, do we need the log stream name in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this need some spacing/alignment here to make this readable?
I'm not sure, what type of spacing/alignment are you thinking? The current experience looks fine to me. I think the colors help to make it easy to read.
How long are these log group names typically?
They can be pretty long (for example /aws/lambda/AnotherHotswapTestingAppStack-Handler886CB40B-zR3TWjodpJbS
). I'm not sure how to trim it down though. Someone could just have a really long stack name and then there would be no way to differentiate the log groups in a stack.
Also, do we need the log stream name in there?
I did originally include the log stream name, but we decided it made the output even longer without really adding much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the current experience is fine. I would be even happier with substituting the name of the Log Group (which tends to be very long) with the physical name of the resource it corresponds too, but there's some complexity there, so I',m fine with it as-is.
I think the log group stream name is not needed, and, when combined with the log group name and the date, takes up 80% of a 150-character line.
private print(event: CloudWatchLogEvent): void { | ||
print(util.format('[%s] %s %s', | ||
colors.blue(event.logGroupName), | ||
colors.yellow(event.timestamp.toLocaleTimeString()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What timezone are these timestamps in? Are they correct? Wouldn't we want more control over the formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp of the event is using UTC (expressed as the number of milliseconds after Jan 1, 1970 00:00:00 UTC) so this should just format those to the users locale. So for example someone in the US would get something like 12/11/2012, 7:00:00 PM
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, everyone will see the time printed in their local timezone, which I think is what you want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure toLocaleTimeString()
does not do timezone conversion, and I think native d/t formatting is not a great idea.
But it will do for now, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely see the logs in PST, so it's either toLocaleTimeString()
, or perhaps CloudWatch returns them in the correct timezone...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested on machines with different timezones and I get whatever the local time is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditionally approved on the passing of a method, and having thought about backoff.
log groups from the deployed template rather than by parsing the current template
@@ -40,7 +39,7 @@ export async function tryHotswapDeployment( | |||
account: resolvedEnv.account, | |||
region: resolvedEnv.region, | |||
partition: (await sdk.currentAccount()).partition, | |||
urlSuffix: sdk.getEndpointSuffix, | |||
urlSuffix: sdk.getEndpointSuffix(resolvedEnv.region), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! Please, leave this as-is, and pass a lambda function there, like Rico suggested.
This is so that we avoid as many service calls as we can during hotswapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh sorry, fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
…ws#18159) This adds a new `--logs` flag on `cdk watch` which is set to `true` by default. Watch will monitor all CloudWatch Log groups in the application and stream the log events back to the users terminal. re aws#18122 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Is there any reason why this doesn't work during a regular deploy? I was looking into ways to make this happen, but it seems that it's only when |
This adds a new
--logs
flag oncdk watch
which is set totrue
bydefault. Watch will monitor all CloudWatch Log groups in the application
and stream the log events back to the users terminal.
re #18122
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license