Remote Configuration Proposal [Draft] #2374

tigrannajaryan · 2021-01-14T17:20:45Z

Goal

Allow remote configuration of Collectors by feeding the configuration to the Collectors from a remote configuration source. The source of the remote configuration must be possible to specify in the Collector's local config file via a pluggable component that can be implemented by the Collector core team or independently by third-party developers in the contrib repo.

Summary

Remote configuration implementation should be plugable at runtime (it should be possible for the end user to choose at runtime from a set of implementations defined at compile time).

We need to define a remote configuration interface (internal API) between Collector core config and remote configuration implementations. The interface should preferably have a "watcher" style, where the implementation notifies the Collector core about the availability of a new configuration (as opposed to Collector core periodically polling the implementation and asking for a new configuration).

The remote configuration interface may be defined on top of the current extension concept or may be completely separated from it (try and see which approach feels better). The exact internal API to be defined.

Note that this proposal does not define the behavior of an actual implementation that performs the fetching of the remote config. We specify an internal remote configuration API in this proposal and the corresponding code that uses this API in the Collector core to be implemented. Where and how exactly the config is fetched from is a concern of a particular implementation (and there be many such implementations in the core or contrib repos).

Remote configuration capability should be enabled via a command line switch (e.g. --remote-config=<filename.yaml>) that points to a local config file that contains the remote configuration setting (e.g. the implementation type, the endpoint to fetch the config from, etc). Note that if we choose top-level keys in this yaml file carefully we may allow this to be the same file as the current local config file (which has potential future uses such as fallback to local config without requiring 2 config files).

Operation

The Collector core will initialize the Remote configuration implementation and will provide attributes that identify the Collector, then will wait to be notified about the availability of a new config. The implementation may use these identifying attributes to fetch the configuration that is applicable to this particular Collector.

The Collector core will identify itself using the following attributes:

Static attributes defined at build time, such as Collector version, version of OS it is built for, commit hash, etc.
Dynamic attributes that the Collector will auto-detect at runtime, such as the OS version it runs on, the machine id it runs on (if available), etc.
User-defined attributes that are specified manually by the end user in the local config file used by the Collector (such as for example "environment=production").
Collector's unique instance id, specified in "service.instance.id" attribute. The Collector will attempt to obtain this from a persistent ID source (such as machine UUID), falling back to an ephemeral generated UID.

On startup the Collector with an enabled remote configuration option should wait for the remote configuration to arrive before the Collector's regular operation begins. This behavior may be configurable locally (e.g. how long to wait for).

After the Collector core receives a remote config it will attempt to reconfigure itself. If the reconfiguration fails the Collector will revert to the last known good config.

The reconfiguration requires graceful shutdown, reconfiguration and restart of the Collector so #483 and #1007 are a pre-requisite.

Local Config File Format that Specifies Remote Config Settings

Possible format for local file to be read via --remote-config option:

 remote_config:
   # The name of the extension that implements fetching of the
   # remote config.
   config_source: remote_config_source   
   # How long to wait for the config server to respond during
   # startup.
   # Setting to 0 will disable the waiting and will let the Collector
   # start immediately. In that case the Collector may work with only
   # the local config for a while and the remote config may be
   # applied some time after the startup is finished.
   wait_for_config: 30s
   # User defined identity attributes which identify this Collector
   # to the config server. A map of key-value pairs. Valid characters for
   # keys are: alphanumeric, underscore, dot.
   identify_self_attributes:
     environment: production
     service.namespace: onlineshop
     service.name: checkout
   # List of directories from which it is allowed to run external
   # executables. All components must honor this setting.
   allow_executable_dirs:
     - /usr/lib
     - ./
   # List of directories from which it is allowed to read data
   # All components must honor this setting.
   allow_read_dirs:
     - /var/log
   # A set of options that enable proxying of configuration requests.

Collector Identity Auto-Detection

We will reuse detectors from the resourcedetection processor. It currently detects GCP and AWS attributes and populates attributes such as "cloud.account.id", "host.id", etc.

Collector Unique Instance ID

We will try to fetch persistent machine id when it is available using a library like this. When persistent machine id is not available the Collector will generate random ephemeral UUID for its UID. Ephemeral UID is not very useful for remote long-term config purposes but is still useful for uniquely identifying the Collector at least during one session. This allows to tie status reports with the particular Collector instance and show reported effective config or config errors in the UI.

In the future we may add an ability for the Collector to inform the backend the UID is ephemeral so that the UI warns the user not to use it to create a partial config.

We may also add an ability to detect duplicate UIDs in the future, if we are not confident that the persistent or ephemeral UIDs are unique enough.

Security

Remotely controlled configuration is a security risk. Via remote configuration the Collector may be compelled to collect data and send to a destination. Collector today is capable of collecting data both passively by accepting it and actively by scraping metrics from locally and remotely running systems. In the future we also plan to introduce log collection capability that will allow to read local files.

In order to reduce this risk we make remote configuration capability disabled by default. It has to be explicitly enabled by the user using a setting in a local configuration file.

In the future we may have more dangerous capabilities, such as planned file log collection or ability to execute external processes for metric collection.

To reduce the risks all components should be limited to executing programs only in directories specified via allow_executable_dirs setting (any external program execution is prohibited if this setting is unspecified). All components are limited to read files only from directories specified via allow_read_dirs setting (any file read is prohibited if this setting is unspecified).

Component interfaces must be modified to include allow_executable_dirs and allow_read_dirs in the Start() function or in the factory creation. All existing components must be reviewed to ensure they honour this settings.

In addition we may want to consider jailing the process to certain directory so that we do not rely on component's honouring the security settings.

Risks

If the remote config source is unavailable then remote configuration capability can bring down the entire collection. There are ways to mitigate this (e.g. persist last known good configuration) but it is not foolproof and is additional work to do.

Tradeoff, Alternates and Future Possibilities

Persisting Effective Config

This is not included in the design because we do not always have a persistent filesystem available to the Collector. Instead we opted to always fetch the config on startup. This may delay the startup but removes the need to rely on the filesystem which may not be present.

Config Pipeline

Instead of specifying remote config functionality as a first-class concept, I considered making the configuration a pipeline data type (similar to how traces and metrics are a data type), and then have a receiver for this data type and an exporter for the data type and possibly processors that can modify the remote config.

While this may be a viable idea it is not clear at this stage if the flexibility that pipelines bring is necessary for the remote configs. It may be an overkill and unnecessary complication for the end user.

This idea is discarded for now, but we may return to it in the future if we see the need for more flexibility in remote configuration processing.

Push vs Pull

This proposal suggests the Collector to be notified when a new config is available. We could instead design the Internal Remote Config API in a way that requires Collector to poll the remote source for config changes.

This can simplify implementation but increase the time that the configuration changes become effective.

Proxying Config Requests

Collector can serve as a gateway for fetching the remote config for another Collector. It would require the Collector to forward the requests (possibly batching first) then wait for the config source response and return it to the requestor.

This is possible to do and is a reasonable architecture for Agents+Collector deployment scenario. It can be designed and implemented in the future but it is out of the scope for now.

Merging Local and Remote Partial Configs

Allow to merge locally-specified and remotely-received configurations to form an effective configuration. Requires specifying merging rules and complicates the implementation.

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2021-01-14T17:21:30Z

@alolita FYI.

shilicqupt · 2021-04-28T08:40:42Z

remote config is mean each component has hot reload ability? however Component interface only have Start and Stop func, may we need Reload func to support hot reload @tigrannajaryan

I find issue1007 mentioned it, i think i can help

portertech · 2021-09-22T16:15:32Z

A few comments:

I love the idea of a plug-able at runtime remote configuration implementation
I would prefer the remote configuration interface to be separate concept (not an extension)
I like the scope of this proposal, keeping it to the internal API
The "watcher" style would be great
The self identification attributes are wonderful, very similar to what we did w/ Sensu
As for security, in additional to local configuration to limit remote management capabilities, the self identification attributes combined with a remote configuration implementation with mTLS could provide the bases for a RBAC
I believe persisting the effective local configuration is necessary, to help ensure OT agent availability (even if the configuration is out-of-date etc)

It's my understanding that this proposal is blocked by reloadable components. I would love to figure out what would be required to move this forward, I'm very keen to see remote management in OT.

portertech · 2021-09-22T17:49:39Z

Notes from the 09/22/2021 9:00a - 10:00a PST SIG:

Parser provider and config source is an active path forward (full reload, all components)

tigrannajaryan · 2021-10-04T11:29:15Z

See a wider-scope proposal for general agent management here: #4165 (which configuration management is part of).

tigrannajaryan · 2021-10-12T18:26:11Z

@bogdandrutu @Aneurysm9 and I discussed how the Collector configuration can be extended to support the remote configuration needs. Please see the proposal here #4190 (and we can also discuss it in the first Workgroup meeting).

ymotongpoo · 2021-10-14T01:33:49Z

Question of the precondition: Is this proposal assuming that the collector binary is fixed and doesn't handles mechanism to deal with collector component management? Asking this question assuming the example case where the user tries to change the processor configs to filter out some metrics that requires additional filters.

tigrannajaryan · 2021-10-14T13:35:11Z

Question of the precondition: Is this proposal assuming that the collector binary is fixed and doesn't handles mechanism to deal with collector component management? Asking this question assuming the example case where the user tries to change the processor configs to filter out some metrics that requires additional filters.

Not necessarily "fixed" but compatible from the config file format perspective. So a particular config that is written for Collector v0.37.0 should like work fine for v0.38.0 (assuming no breaking changes made), even if a new component was added in v0.38.0. If a component was modified to accept new config settings (e.g. new filter type in filter processor) then typically it is done in backwards compatible manner so that when the new config setting is unspecified the behavior mimics that of the old version.

tigrannajaryan · 2021-10-18T17:45:32Z

Reminder that the Agent Management Workgroup meeting is tomorrow, on Tue at 11am PT.

Please join the Slack channel if you are interested in the topic: https://cloud-native.slack.com/archives/C02J58HR58R

bertysentry · 2021-12-15T19:16:48Z

Personally, I think that remote configuration introduces serious potential security breaches for a monitoring/observability tool, that are not limited to trigger the execution of an arbitrary command line, like for example:

stop logs pipelines so that the attackers actions are not visible
declare a new exporter to send trace data to the attacker, to analyze the internals of an application, maybe extract credentials or tokens from span attributes
enable components that have known vulnerabilities (like one would enable log4j

Remote configuration must be limited to authorized users or, as @portertech suggested, that mTLS is strongly encouraged.

tigrannajaryan · 2021-12-15T19:36:31Z

@bertysentry I think TLS is absolutely required and mTLS is highly desirable, but it doesn't help with the threat model that I am envisioning, particularly with a threat of a compromised remote location that is the source of the configuration. Encrypting the connection does not prevent malicious actors to inject a bad configuration at the compromised server. The Collector should assume the source of the remote config is not trustworthy (a zero trust model). There is more on this topic in OpAMP spec: https://github.com/open-telemetry/opamp-spec#security

bertysentry · 2021-12-15T20:02:02Z

Oh, I didn't know OpAMP, it looks awesome. So, this remote configuration for the OpenTelemetry Collector would be compliant with OpAMP?

the threat model that I am envisioning, particularly with a threat of a compromised remote location that is the source of the configuration

In your scenario, I fear that you won't be able to differentiate a compromised server from a non-compromised one. There are then only 2 options to mitigate the risk:

Limit what the collector (agent) can do (like: not execute arbitrary commands), but it's not enough as I explained in my examples earlier
Configuration files must be digitally signed, manually, against a CA that is accessible to all agents so they can check the received configuration is legit. The problem is that editing the configuration becomes an extremely tedious process.

Alternatively, you could rely on git: each collector (agent) clones the central server repository, which contains the configuration files. Periodically, the agent fetches and pull changes on the selected branch.

The good thing with git is that protocols and security are already completely specified and tooled. It works kinda like a blockchain making sure that an agent doesn't get crap from a malicious server. Also, with git you could imagine an admin working on a configuration on an agent, and then commit and push to the central server so other agents get the same changes.

This is all theoretical, but it could solve some of the issues 😉

tigrannajaryan · 2021-12-15T22:51:40Z

So, this remote configuration for the OpenTelemetry Collector would be compliant with OpAMP?

Yes, this is the current best understanding of the agent management workgroup. This is work in progress right now. Feel free to join the WG if you are interested: open-telemetry/community#860

In your scenario, I fear that you won't be able to differentiate a compromised server from a non-compromised one.

Exactly. We have to assume the server may be compromised and protect the Collector and the machine where the Collector is installed. I think this must be part of our threat model.

Limit what the collector (agent) can do (like: not execute arbitrary commands), but it's not enough as I explained in my examples earlier

I think some limitations are inevitable, we just need to decide what compromise is acceptable, and perhaps for different users the acceptable compromise is different. OpAMP spec for example says that execution should be subject to limitations defined in a local config files that is not overridable remotely. So, a user may opt-in to full arbitrary execution if they trust their server, some other users may disable the functionality.

Configuration files must be digitally signed, manually, against a CA that is accessible to all agents so they can check the received configuration is legit. The problem is that editing the configuration becomes an extremely tedious process.

This is indeed what OpAMP recommends for remotely downloadable code, such as the agent's executable or remotely downloadable executable addons. This is indeed achievable for bits that are centrally published and can be centrally signed by a CA, which typically is the case for executable files; e.g. we can sign all Otel Collector executable during the release process and make the Otel Collector's OpAMP implementation verify the signature of executables it downloads from OpAMP server.

Unfortunately, I don't think it is a reasonable process for configurations as you rightfully point out. Configurations are usually much more fluid and requiring them to signed by an independent trusted CA, which is outside of the control of the OpAMP server is really tedious enough to make it virtually unusable IMO. We need configurations to be under the control of the OpAMP server, the server needs to be able to compose the configuration and send to the agents. This means the server needs to also sign them, which means we cannot trust the signature.

bertysentry · 2021-12-15T23:51:01Z

I just realized there's an excellent implementation of Git in Golang (go-git on GitHub! You guys have to use it for distributed config management. 😅

bogdandrutu · 2022-06-08T21:55:38Z

This is legacy, long live https://github.com/open-telemetry/opamp-go

…itlab ci (open-telemetry#2374)

andrewhsu added area:config release:after-ga priority:p2 Medium enhancement New feature or request labels Jan 20, 2021

suleymanakbas91 mentioned this issue Mar 29, 2021

PoC OpenTelemetry - general setup kyma-project/kyma#10873

Closed

tigrannajaryan mentioned this issue Jul 13, 2021

config remote reload without rebuild the whole pipeline #3560

Closed

tigrannajaryan mentioned this issue Oct 4, 2021

Agent Management Proposal #4165

Closed

This was referenced Oct 6, 2021

Agent Management Workgroup formation open-telemetry/community#860

Closed

Config sources to support full remote config and individual config value substitution #4190

Closed

tigrannajaryan mentioned this issue Oct 26, 2021

Make Collector instance ID available to config.MapProvider #4272

Closed

tigrannajaryan mentioned this issue Dec 13, 2021

[extension/subprocess] Create subprocess extension (#6467) open-telemetry/opentelemetry-collector-contrib#6512

Closed

bertysentry mentioned this issue Dec 15, 2021

Review prometheusexecreceiver from security perspective open-telemetry/opentelemetry-collector-contrib#6722

Closed

bogdandrutu closed this as completed Jun 8, 2022

hughesjj added a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023

Break out CI_COMMIT_SHA as it seems to not be properly populated in g…

ae9297d

…itlab ci (open-telemetry#2374)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote Configuration Proposal [Draft] #2374

Remote Configuration Proposal [Draft] #2374

tigrannajaryan commented Jan 14, 2021

tigrannajaryan commented Jan 14, 2021

shilicqupt commented Apr 28, 2021 •

edited

Loading

portertech commented Sep 22, 2021 •

edited

Loading

portertech commented Sep 22, 2021

tigrannajaryan commented Oct 4, 2021

tigrannajaryan commented Oct 12, 2021

ymotongpoo commented Oct 14, 2021

tigrannajaryan commented Oct 14, 2021

tigrannajaryan commented Oct 18, 2021

bertysentry commented Dec 15, 2021

tigrannajaryan commented Dec 15, 2021

bertysentry commented Dec 15, 2021

tigrannajaryan commented Dec 15, 2021

bertysentry commented Dec 15, 2021

bogdandrutu commented Jun 8, 2022

Remote Configuration Proposal [Draft] #2374

Remote Configuration Proposal [Draft] #2374

Comments

tigrannajaryan commented Jan 14, 2021

Goal

Summary

Operation

Local Config File Format that Specifies Remote Config Settings

Collector Identity Auto-Detection

Collector Unique Instance ID

Security

Risks

Tradeoff, Alternates and Future Possibilities

Persisting Effective Config

Config Pipeline

Push vs Pull

Proxying Config Requests

Merging Local and Remote Partial Configs

tigrannajaryan commented Jan 14, 2021

shilicqupt commented Apr 28, 2021 • edited Loading

portertech commented Sep 22, 2021 • edited Loading

portertech commented Sep 22, 2021

tigrannajaryan commented Oct 4, 2021

tigrannajaryan commented Oct 12, 2021

ymotongpoo commented Oct 14, 2021

tigrannajaryan commented Oct 14, 2021

tigrannajaryan commented Oct 18, 2021

bertysentry commented Dec 15, 2021

tigrannajaryan commented Dec 15, 2021

bertysentry commented Dec 15, 2021

tigrannajaryan commented Dec 15, 2021

bertysentry commented Dec 15, 2021

bogdandrutu commented Jun 8, 2022

shilicqupt commented Apr 28, 2021 •

edited

Loading

portertech commented Sep 22, 2021 •

edited

Loading