Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-ecs: FirelensOptions configFileValue Required - ECS Now Enforcing - Recent Change Not Available on 1.x #21611

Closed
cameron-dunn-sublime opened this issue Aug 15, 2022 · 15 comments
Assignees
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. needs-triage This issue or PR still needs to be triaged.

Comments

@cameron-dunn-sublime
Copy link

Describe the bug

FirelensOptions.configFileValue should be optional. This was correctly updated in b79b2e47 , however this change has only been released on 2.x.

#16594 has also brought this basic issue up.

It appears AWS ECS may have begun enforcing that an empty string is an invalid value. Last Thursday I could create new task definition revisions which included the empty string, but as of today I began seeing "The 'config-file-value' value can't be empty or contain only whitespace. Specify a valid 'config-file-value' and try again."

I have opened up a support request within AWS for this issue because it impacts stacks we can't easily update, however we wish to continue with regular deployments (e.g. AWS Code Pipeline for upgrading images, which creates new task definition revisions).

Currently I believe any 1.x based CDK applications will be unable to create task definitions using FirelensOptions and not specifying a configFileValue. Basically what was a mostly non-impacting issue in CDK is now preventing deployments within ECS.

Expected Behavior

Can omit configFileValue within a 1.x CDK application.

Current Behavior

Must specify and empty string for configFileValue, which will then error in AWS ECS.

Reproduction Steps

        const fluentBitContainerDef : FirelensLogRouterDefinitionOptions = {
            image: ecs.RepositoryImage.fromRegistry("public.ecr.aws/aws-observability/aws-for-fluent-bit:2.21.0"),
            firelensConfig: {
                type: FirelensLogRouterType.FLUENTBIT,
                options: {
                    enableECSLogMetadata: true,
                    configFileValue: "", // Required but shouldn't be
                }
            },
...

Deploy a task with configFileValue: "" and see it fail in AWS ECS.

Possible Solution

  1. Release b79b2e47 to CDK 1.x
  2. Follow up with AWS ECS team to rollback new validation, because this check breaks creating new Task Definition revisions on stacks created/updated prior to (1). E.g. a stack which uses an AWS CodePipeline for updating container images can no long do so.

Additional Information/Context

No response

CDK CLI Version

1.137.0

Framework Version

No response

Node.js Version

v17.5.0

OS

OS X

Language

Typescript

Language Version

No response

Other information

b79b2e47
#16594
aws/aws-for-fluent-bit#352

@cameron-dunn-sublime cameron-dunn-sublime added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 15, 2022
@github-actions github-actions bot added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Aug 15, 2022
@cameron-dunn-sublime
Copy link
Author

CC @PettitWesley

@PettitWesley
Copy link
Contributor

PettitWesley commented Aug 15, 2022

aws/aws-for-fluent-bit#352

It appears AWS ECS may have begun enforcing that an empty string is an invalid value.

We chose to do this on purpose. I apologize for any inconvenience it caused. The thing is that empty string was never supposed to be valid in the API, so we consider adding the validation a fix. See the linked issue above. The feature was built assuming that the config file value is either NULL/missing or is a real valid config file. Empty string causes problems.

@cameron-dunn-sublime
Copy link
Author

Thanks, I totally understand that this really should have been missing, but we were forced to set it to a value by CDK. When I originally setup firelens I noticed it should have been not included, but didn't have a choice (hence the // Required but shouldn't be comment in my code sample).

This isn't really just an inconvenience, but a pretty major issue for me and my customers at the moment. We have some stacks which we cannot immediately do a stack update on, and now those stacks can't receive minor AWS CodePipeline updates.

Empty string causes problems.

Could you explain what the issues are? Again I totally appreciate that this value should not be included, and I am happy to work to get all of my stacks into that state, but my FireLens configuration/ECS tasks have been working fine and only adding this check has caused problems.

Maybe this check could be omitted for the time being, or an empty string could be converted to NULL/missing API side? Currently any customers on AWS CDK 1.x cannot omit the value, e.g. an 1.x CDK stacks with ECS will be broken (if they use FireLens)

@PettitWesley
Copy link
Contributor

@cameron-dunn-sublime see this: fluent/fluent-bit#5454

With an empty string import, your Fluent Bit config will not be fully read AND no error will be outputted.

@cameron-dunn-sublime
Copy link
Author

@PettitWesley I'm not sure what config that would refer to -- I don't think I'm setting any configuration, but maybe ECS is setting something for me? Does that only apply to Fluent Bit 2.25.0+? I am still using 2.21.0 and my logs definitely are flowing as desired. I know it would be kind of gross, but if this is only an issue in later versions of Fluent Bit maybe the check could allow or deny list based on any version info present in the image tag?

I actually tried upgrading to a new image recently and it failed. I didn't look into it any deeper, but perhaps this issue is why it failed. Once this is resolved on the CDK side I'll try that out again.

Again, I appreciate that this is considered a fix and the value in having the check, but this was a backwards incompatible fix. And CDK 1.x does not include your change, which means I still don't even have a route to fix this (and even with the change this will impact my already deployed environments). Your change applies cleanly onto the v1-release branch, so I believe it would be straightforward for CDK team to apply your change at least.

@PettitWesley
Copy link
Contributor

@cameron-dunn-sublime Thank you for your feedback. We are trying to determine how to handle this difficult situation for all customer segments.

To understand how FireLens works, please see:

There is a generated configuration file which imports (@include) the config you supply with configFileValue. Starting in roughly 2.25.0, Fluent Bit will experience the issue linked here if there is an @include that tries to import nothing: fluent/fluent-bit#5454

@cameron-dunn-sublime
Copy link
Author

Thank you for the links and explanation! The generated config makes sense.

I'm glad we know why my existing configuration does work.

We are trying to determine how to handle this difficult situation for all customer segments.

Please keep me updated on the results of this! I look forward to seeing how a customer obsessed team resolves this.

@PettitWesley
Copy link
Contributor

@cameron-dunn-sublime

I have opened up a support request within AWS for this issue because it impacts stacks we can't easily update

Can you please ask support to get this ticket to Wesley Pettit from the AWS ECS FireLens team. They should know how to do that.

Then I may be able to help unblock you specifically. No promises. Just that we can take a deeper look at your specific issues here and may be find a better temporary workaround.

@cameron-dunn-sublime
Copy link
Author

On it, thank you!


I was also able to upgrade my CDK application to v2.x. This is still impacting due to blocked stack updates, but on the CDK side I'm no longer blocked. IMHO it's still worth releasing b79b2e47 to v1.x for any others who are impacted and need a quick fix, but it doesn't impact me.

@cameron-dunn-sublime
Copy link
Author

Support also hasn't responded on the ticket at all (open for 21 hours). If it's helpful my email is in my GH profile, feel free to contact me there and I can also respond with the case ID.

@cameron-dunn-sublime
Copy link
Author

@PettitWesley sorry, I'm not getting any response or traction on my internal ticket. Case ID 10574387121 if you can access it, I guess it's fine to share that here.

@kaizencc
Copy link
Contributor

Hi @cameron-dunn-sublime, @PettitWesley.

The CDK team talked about this internally and decided not to put this in v1. Here is our reasoning:

CDK v1 has been in maintenance mode since June 1. CDK v2 has been GA since December 1. From our maintenance policy, “Maintenance: During this phase, AWS limits releases to address critical bug fixes and security issues only.” So the question is, is this a critical bug fix? The answer is no, at least not on the CDK side. Customers in v1 are broken, but if they must stay in v1 they have a path forward. They can use escape hatches to achieve what they want and override the CDK requirement. Even if we were to add the fix to the latest v1 release, customers would still need to upgrade their stacks to get the change, requiring manual effort.

Really, this situation is the precise reason to incentivize users to migrate to v2. Only in v2 are they promised the latest and greatest features and fixes. It looks like that was the remedy here, which is great news for our customers and for the CDK team.

Closing this issue, please feel free to follow up if you have questions on this issue -- just make sure to ping me so that I'll see it.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@kaizencc
Copy link
Contributor

@PettitWesley @cameron-dunn-sublime.

Apologies for going back on the previous comment. We discussed this at length synchronously today, and decided that our rule of thumb for changes in v1-main should be "something worked yesterday in v1 should work today in v1," at least until v1 is fully deprecated in june 2023. Under that rule, it makes sense that we shouldn't ask users to manually change their stacks or upgrade to v2.

In addition, using escape hatches is definitely a manual effort on the customer's part, but upgrading CDK is (hopefully) an automated task. So the fix should be added to CDK v1 in the release this week (for reference, #21710)

@cameron-dunn-sublime
Copy link
Author

Thank you @kaizencc I think that's the right call and way of thinking about it. I'm glad you're prioritizing the customer experience.


Just to close out the public side of this thread, I got a response from ECS team on my internal ticket about "The ECS check which was added and breaks existing/functional task definitions":
We did this on purpose to prevent customers from facing issues that are hard to debug and understand. And essentially we consider failing to add that validation at launch to be a miss. aws/aws-for-fluent-bit#352

I'm disappointed by the decision to knowingly make a breaking change w/o outreach (I'm curious if impact was measured at all). It doesn't make sense to break customers because of a miss at launch, especially since the issues mentioned don't exist for certain supported versions of the firelens image. And the ECS team knew that many customers may be in this state due to using a supported AWS tool, CDK. Bad rollouts happen, and gracefully dealing with the implications of those misses should be expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. needs-triage This issue or PR still needs to be triaged.
Projects
None yet
Development

No branches or pull requests

4 participants