Proposal for secrets metadata feature #8520

JanKrivanek · 2023-03-02T12:51:18Z

related to #8401

Context

Preliminary design proposal for the Sensitivity Metadata feature
This PR is meant to be a form of design review

JanKrivanek · 2023-03-06T18:41:04Z

Tagging @michaelcfanning as he has been researching and prototyping in this area quite extensively lately

michaelcfanning · 2023-03-22T23:04:58Z

documentation/design/security-metadata.md

+ * concatentaing property with other values or flattening item values or transforming items and then passing via other property - should such be ignored or redacted as whole, or redacting just the part formed from the sensitive data?
+ * values created via [Property functions](https://learn.microsoft.com/en-us/visualstudio/msbuild/property-functions) or [Item functions](https://learn.microsoft.com/en-us/visualstudio/msbuild/item-functions).
+
+## Out of scope


Out of scope

In previous exploration, we explicitly excluded processing path information, as this is very copious in the processing stream. It is theoretically possible someone might encode a password in a source file or directory name but seems unlikely.

Yeah - this is a good point. I can imagine someone would want to redact e.g. their login (#8493) - we need to find proper balance with acceptable performance (e.g. opt-in levels of redacting)

documentation/design/security-metadata.md

michaelcfanning · 2023-03-28T14:24:28Z

documentation/design/security-metadata.md

+    * and such is indicated by user as sensitive, the generated build event needs to be redacted.
+    * and such is not indicated by user as sensitive, but the R-value is indicated as sensitive - the data holder (property/item) is marked as holding sensitive data and treated accordingly.
+    * and such is not indicated by user as sensitive, the value is passed to sensitivity indicating regex or callback (in case any of those are configured by user) and if matched - the data holder (property/item) is marked as holding sensitive data and treated accordingly.
+* No other redacting of log events will be performed. This is not a strong requirement - we can introduce another opt-in level of strict inspection of all log events. The gain is very questionable though, while the performance impact is severe (internal experiments by @michaelcfanning measured by @rokonec indicate 4-times slow-down on mid-size build). Additionally to being perf-expensive, it can possibly get easily confused - e.g.:


We should double-check with @rokonec, I'm not sure this figure is correct. I thought it was more like a 40% slowdown (and this was only for the binary logging scenario. Logging to console and file was lower). We did not complete sufficient profiling in the binary logging case to understand whether we're looking at an intimidating floor as far as slowdown costs. Our prototype also uses the built-in .NET regex engine, by far the slowest option (between things like SRM, RE2, etc.). The .NET 7 platform (and its regex engine) looks like it will greatly improve time-to-execute for this scenario.

The measured slowdown ratios for protoype with 5 secret formats were 1.03 for binlog and 1.75 for the diagnostic level of text logger. the 4x possible slow down was estimated from the increase of the patterns #.

This can likely be tackled a lot, but still feels as candidate for extra opt-in (as compared to best-efforts quick solution scanning just most likely sources of input info).

documentation/design/security-metadata.md

Forgind

Interesting! I generally like the idea. Few thoughts:
Should we also permit redacting property/item names (not just the values)
Is the idea that a Secrets would work a bit like a PropertyGroup but with preset things you can opt into redacting? (Looking at things like <GH_token /> and wondering how that works)
I'm wondering if it's ok to wait until events reach loggers before redacting. On the one hand, that's less secure. On the other hand, if we implement the redaction logic correctly in our loggers, and customers trust their own custom loggers, then everything will be reacted properly, and we can probably do that more efficiently than trying to look at every single message sent to a logger...which also makes it easy to miss things.

I'm thinking the custom plugin version might be the cleanest option but also the most difficult for a user to implement. I personally like the option you said was not recommended 🙂 A single global property doesn't have great support for differing types, but we ultimately want to redact strings in various forms, so I think we could make it work, and properties are absurdly easy to set.

JanKrivanek · 2023-04-04T17:06:04Z

@Forgind great thanks for feedback and thoughts!

Should we also permit redacting property/item names (not just the values)

I'm strongly opposed to this :-)
This night look like being ok with having sensitive data within a source repository. The build logs are then not the main concern.

Is the idea that a Secrets would work a bit like a PropertyGroup but with preset things you can opt into redacting? (Looking at things like <GH_token /> and wondering how that works)

Exactly.
Though - as mentioned - this is just longer term thinking that's likely subject to many changes (or withdrawal)

I'm wondering if it's ok to wait until events reach loggers before redacting. On the one hand, that's less secure. On the other hand, if we implement the redaction logic correctly in our loggers, and customers trust their own custom loggers, then everything will be reacted properly, and we can probably do that more efficiently than trying to look at every single message sent to a logger...which also makes it easy to miss things.

Ideally we do not look on every single message - just the build events that we know were formed together using properties or items indicated as sensitive.
On the other hand within logger we already do not have this information - so we'd either need to change logging contract or inspect all events.

I'm thinking the custom plugin version might be the cleanest option but also the most difficult for a user to implement. I personally like the option you said was not recommended 🙂 A single global property doesn't have great support for differing types, but we ultimately want to redact strings in various forms, so I think we could make it work, and properties are absurdly easy to set.

This is actually an interesting point. This might partly alleviate the problem with need to parse all properties and items prior start of redacting. It might be good idea to allow simple setting mode (fast failing if secrets indication used via items).
I'm adding this thought into the doc

Forgind · 2023-04-04T20:50:43Z

I'm strongly opposed to this :-)
This night look like being ok with having sensitive data within a source repository. The build logs are then not the main concern.

That's an interesting point. I certainly don't think you should have sensitive names within an OS repository. I'm less sure about how important it is within a closed source repo, especially as you might well have secret project names or function names, and those would also be exposed to anyone with access to the source. I'm currently ok with either solution 🙂

On the other hand within logger we already do not have this information - so we'd either need to change logging contract or inspect all events.

I think the worst case scenario is where we tell people that we have a new feature that hides secrets, but we also ensure that it isn't a breaking change. A customer logs some custom events, and we don't realize they contain secrets, and they end up un-redacted, but because the customer knows we hide secrets, they publish their log somewhere public. I don't think there's a perfect non-breaking solution here, so I think your proposal is reasonable; I'm just trying to think through places things could go wrong.

JanKrivanek · 2023-04-05T06:30:39Z

I think the worst case scenario is where we tell people that we have a new feature that hides secrets, but we also ensure that it isn't a breaking change. A customer logs some custom events, and we don't realize they contain secrets, and they end up un-redacted, but because the customer knows we hide secrets, they publish their log somewhere public. I don't think there's a perfect non-breaking solution here, so I think your proposal is reasonable; I'm just trying to think through places things could go wrong.

Yes - this will allways be just best efforts (as frankly - ideally MSBuild doesn't handle secrets, there should be other means of handling auth/autz - valuts, roles, ...). We need to be clear about is, with clear messaging of what is not supported. Current proposal of not supported possibilities of spilling sensitive data: https://github.com/dotnet/msbuild/pull/8520/files#diff-fd2a314820444c86af18122afc82a49145ddbe37cd3730b7f3e5c01f86ff02cdR108-R113
That feels it covers the menstioned scenario - or do you think we should add something?

Initial proposal for secrets metadata feature

d207240

JanKrivanek marked this pull request as draft March 2, 2023 12:52

Reflect offline feedback from @rainersigwald

7df2601

Updating info

0c4393f

michaelcfanning reviewed Mar 22, 2023

View reviewed changes