Introduce datasources in package to configure inputs and streams #212

ruflin · 2020-02-11T08:41:53Z

In elastic/beats#15940 datasources, inputs and streams are introduced into the agent config. To make it possible to configure these in the UI and through the API, some changes to the manifest definitions of a package and datasets are needed.

Package manifest

Each package must specify the datasources it supports with the supported inputs inside. So far all the packages only support one datasource but I want to keep the door open for this to potentially change in the future. It also makes it possible to have the manifest config of a datasource be identical to the config which ends up in the agent config.

The package manifest datasource definition looks as following (nginx example):

datasources:
  -
    # Do we need a name for the data source?
    name: nginx

    # List of inputs this datasource supports
    inputs:
      -
        # An id can be given, in case the type used here is not unique
        # This is for selection in the stream
        # id: nginx
        type: metrics/nginx

        # Common configuration options for this input
        vars:
          - name: hosts
            description: Nginx hosts
            default:
              ["http://127.0.0.1"]
            # All the config options that are required should be shown in the UI
            required: true
          - name: period
            description: "Collection period. Valid values: 10s, 5m, 2h"
            default: "10s"
          - name: username
            type: text
          - name: password
            # This is the html input type?
            type: password

      -
        type: logs

        # Common configuration options for this input
        vars:

      -
        type: syslog

        # Common configuration options for this input
        vars:

Inside the datasource, the supported inputs are specified with the common variables across all streams which use a certain input. In the UI I expect that we show the required configs by default and all the others are under "Advanced" or similar.

Dataset manifest

With the datasources and inputs defined on the package level, each dataset can specify which inputs it supports. Most datasets will only support one input for now. For the nginx metrics this looks as following:

inputs:
  - type: "metric/nginx"

    # Only the variables have to be repeated that are not specified as part of the input
    vars:
      # All variables are specified in the input already

As an example with supporting multiple inputs, we have the nginx error logs:

inputs:
  - type: log
    vars:
      - name: paths
        required: true
        default:
          - /var/log/nginx/error.log*
        os.darwin:
          - /usr/local/var/log/nginx/error.log*
        os.windows:
          - c:/programdata/nginx/logs/error.log*

  - type: syslog
    vars:
      # Are udp and tcp syslog input two different inputs?
      - name: protocol.udp.host
        required: true
        default:
          - "localhost:9000"

The log and syslog input are supported (not the case today, just an example). One the dataset level also all additional variables for this dataset are specified. The ones already specified on the input level in the package don't have to be specified again.

Stream definition

Now that the dataset has its supported inputs and variables defined, the stream can be defined. The stream defines which input it uses from the dataset and its configuration variables. Here an example for nginx metrics:

input: metrics/nginx
metricsets: ["stubstatus"]
period: {{period}}
enabled: true

hosts: {{hosts}}

{{#if username}}
username: "{{username}}"
{{/if}}
{{#if password}}
password: "{{password}}"
{{/if}}

During creation time of the stream config the variables from the datasource inputs and local variables from the dataset are filled in.

A stream definition could also support multiple inputs as seen in the following example:


{{#if input == log}}
input: log

{{#each paths}}
paths: "{{this}}"
{{/each}}
exclude_files: [".gz$"]

processors:
  - add_locale: ~
{{/if}}

{{#if input == syslog}}
input: syslog

{{/if}}

Further changes

Rename agent/input to agent/stream as a stream is configured there.

In elastic/beats#15940 datasources, inputs and streams are introduced into the agent config. To make it possible to configure these in the UI and through the API, some changes to the manifest definitions of a package and datasets are needed. **Package manifest** Each package must specify the datasources it supports with the supported inputs inside. So far all the packages only support one datasource but I want to keep the door open for this to potentially change in the future. It also makes it possible to have the manifest config of a datasource be identical to the config which ends up in the agent config. The package manifest datasource definition looks as following (nginx example): ``` datasources: - # Do we need a name for the data source? name: nginx # List of inputs this datasource supports inputs: - # An id can be given, in case the type used here is not unique # This is for selection in the stream # id: nginx type: metrics/nginx # Common configuration options for this input vars: - name: hosts description: Nginx hosts default: ["http://127.0.0.1"] # All the config options that are required should be shown in the UI required: true - name: period description: "Collection period. Valid values: 10s, 5m, 2h" default: "10s" - name: username type: text - name: password # This is the html input type? type: password - type: logs # Common configuration options for this input vars: - type: syslog # Common configuration options for this input vars: ``` Inside the datasource, the supported inputs are specified with the common variables across all streams which use a certain input. In the UI I expect that we show the `required` configs by default and all the others are under "Advanced" or similar. **Dataset manifest** With the datasources and inputs defined on the package level, each dataset can specify which inputs it supports. Most datasets will only support one input for now. For the nginx metrics this looks as following: ``` inputs: - type: "metric/nginx" # Only the variables have to be repeated that are not specified as part of the input vars: # All variables are specified in the input already ``` As an example with supporting multiple inputs, we have the nginx error logs: ``` inputs: - type: log vars: - name: paths required: true default: - /var/log/nginx/error.log* os.darwin: - /usr/local/var/log/nginx/error.log* os.windows: - c:/programdata/nginx/logs/error.log* - type: syslog vars: # Are udp and tcp syslog input two different inputs? - name: protocol.udp.host required: true default: - "localhost:9000" ``` The log and syslog input are supported (not the case today, just an example). One the dataset level also all additional variables for this dataset are specified. The ones already specified on the input level in the package don't have to be specified again. **Stream definition** Now that the dataset has its supported inputs and variables defined, the stream can be defined. The stream defines which input it uses from the dataset and its configuration variables. Here an example for nginx metrics: ``` input: metrics/nginx metricsets: ["stubstatus"] period: {{period}} enabled: true hosts: {{hosts}} {{#if username}} username: "{{username}}" {{/if}} {{#if password}} password: "{{password}}" {{/if}} ``` During creation time of the stream config the variables from the datasource inputs and local variables from the dataset are filled in. A stream definition could also support multiple inputs as seen in the following example: ``` {{#if input == log}} input: log {{#each paths}} paths: "{{this}}" {{/each}} exclude_files: [".gz$"] processors: - add_locale: ~ {{/if}} {{#if input == syslog}} input: syslog {{/if}} ``` **Further changes** * Rename `agent/input` to `agent/stream` as a stream is configured there.

ruflin · 2020-02-11T08:43:23Z

@skh Could you give some feedback if the above info the the manifests and stream is enough to build a full config?
@jen-huang Will this info be sufficient to build the UI on it?
@ph Applying what we discussed in the config.

skh · 2020-02-11T11:24:09Z

Could you give some feedback if the above info the the manifests and stream is enough to build a full config?

Where can I find out what a full, correct agent config must look like?

ruflin · 2020-02-11T11:48:59Z

@skh elastic/beats#15940

ph · 2020-02-11T16:14:10Z

@ruflin format sound good, a few things:

Make sure you rename the type for metrics where the service is before metrics like this: docker/metrics.
As we discussed over zoom for simplification, lets make sure we do not have a field defined at the level of the input that can be redefined at the stream level.

jen-huang · 2020-02-11T19:43:18Z

dev/package-examples/nginx-1.2.0/dataset/access/manifest.yml

+        # Should we define this as array? How will the UI best make sense of it?
+        type: textarea
+        default:
+          - /var/log/nginx/access.log*


Should we define this as array?

Does this mean that users can have multiple paths defined? If so, how about adding a boolean field called multi that tells the UI that there can be multiple of the specified type:

type: text multi: true

For example, if we want to have multiple paths as default value, we can do like:

default: - /var/log/nginx/access.log* - /some/path/log/nginx/access.log*

or

default: ["/var/log/nginx/access.log*", "/some/path/log/nginx/access.log*"]

(not sure which yaml format we prefer but would be great to have it consistent across these examples 🙂)

Yes, there can be multiple default paths. Changed it do your suggestion.

jen-huang · 2020-02-11T19:44:39Z

dev/package-examples/nginx-1.2.0/dataset/access/manifest.yml

+        os.darwin:
+          - /usr/local/var/log/nginx/access.log*
+        os.windows:
+          - c:/programdata/nginx/logs/*access.log*


Bit confused about the format here, are these default values, but platform-specific? If so, should they be nested under default?

Yes, good point. They are platform specific defaults. As we should not mix array and object I wonder how the yaml should look like:

default: all: - /var/log/nginx/access.log* # I suggest to use ECS fields for this config options here: https://github.com/elastic/ecs/blob/master/schemas/os.yml # This would need to be based on a predefined definition on what can be filtered on os.darwin: - /usr/local/var/log/nginx/access.log* os.windows: - c:/programdata/nginx/logs/*access.log*

I don't like the all. Perhaps we find a better name?

I'm not fond of all either, and it seems silly to do default.default 🙂 Thinking about this again, what if we just nest default under each platform? I don't know if we'll want to add other platform-specific configuration for vars, but this schema could support that too since it essentially mirrors all the top-level vars properties. As an example, maybe paths is not required for windows for whatever reason:

- name: paths required: true default: - /var/log/nginx/access.log* os.darwin: default: - /usr/local/var/log/nginx/access.log* os.windows: required: false default: - c:/programdata/nginx/logs/*access.log*

LGTM. So in summary: All configs from the top level could also be nested under each platform. Lets try it.

What I suggest for now, lets ignore platform at least on the UI side.

jen-huang · 2020-02-11T19:48:11Z

dev/package-examples/nginx-1.2.0/manifest.yml

+datasources:
+  -
+    # Do we need a name for the data source?
+    name: nginx


I think name would be useful to populate the id in agent config, but we can just use the package name too

You mean prefixing the id? Because the id needs to be unique.

Yep, I meant prefixing

jen-huang · 2020-02-11T20:21:25Z

dev/package-examples/nginx-1.2.0/manifest.yml

+          - name: hosts
+            description: Nginx hosts
+            default:
+              ["http://127.0.0.1"]
+            # All the config options that are required should be shown in the UI
+            required: true


I wold like to see type defined explicitly for all vars. And if we like the multi suggestion in my previous comment, these should be added:

type: text multi: true

Agree. Adding.

jen-huang · 2020-02-11T20:25:00Z

dev/package-examples/nginx-1.2.0/manifest.yml

+          - name: period
+            description: "Collection period. Valid values: 10s, 5m, 2h"
+            default: "10s"


Add type: duration or type: text?

+1 for a duration type, we used integer priviously and it was not that great. :)

I like duration if that works for the UI.

jen-huang · 2020-02-11T20:32:38Z

dev/package-examples/nginx-1.2.0/manifest.yml

+            type: text
+          - name: password
+            # This is the html input type?
+            type: password


I think we need to decide all possible values for type field. They do not need to be HTML types, though. Here are the possible types I currently see (or can foresee being useful):

text -> maps to UI text input number -> maps to UI number input boolean -> maps to UI checkbox duration -> maps to UI text input this would let us do client-side validation but maybe overkill for now? url -> maps to UI text input this would let us do client-side validation but maybe overkill for now?

In the case of password vars, I would use text for them as they'll show up as simple text inputs in the UI.

I like the above proposed types. I was thinking of password instead of text to not have to show clear text passwords directly on the screen when users put them in.

BTW: If we have these predefined values for type we can also validate it already on the package side to make sure nothing odd shows up in the package itself.

For the URL types we have to be careful when we do this and we do it, in some modules we do use compound scheme (not sure if this is the exact term?) Like "http+npipe://./pipe/custom" see https://github.com/elastic/beats/blob/d75261469b5c2e675d7d331b86bfc8599d797ddc/metricbeat/mb/parse/url_test.go for a few examples.

I see url support as a stretch goal. In case of doubt we can always fall back to just string. Lets keep it simple for now.

We can add granular text types (duration, url, etc) but it doesn't mean that we need to do validation right now. Validation can be added later on from all fronts (UI, agent, package) once we align on rules.

Also ++ on keeping password type. This give the UI information to decide when to show clear text inputs and when to obfuscate.

As soon as we started the UI implementation and are happy with the format, we should open a PR here with the exact specs.

jen-huang · 2020-02-14T23:29:32Z

dev/package-examples/nginx-1.2.0/dataset/access/manifest.yml

-      - c:/programdata/nginx/logs/*access.log*
+# List of supported inputs
+inputs:
+  - type: log


@hbharding suggested today that it would be great to have descriptions in the UI. Could we add description prop for every input, under its type? I think these descriptions should only be a sentence or two, max. Not sure how we will localize these though..

+1. Added it. But did not add it here but on the package level where the overall inputs are defined. As a single input is shown for multiple datasets, I guess it makes more sense there.

jen-huang · 2020-02-14T23:32:41Z

dev/package-examples/nginx-1.2.0/dataset/access/manifest.yml

+inputs:
+  - type: log
+    vars:
+      - name: paths


Similar to my previous comment, it would be great to have description prop per var too. This would be really short descriptions that give the user info about what each var is used for.

ruflin

@jen-huang @ph I did one more push with the most recent reviews. I suggest we get this PR in and then I adjust all the other packages we have + making it fully available in the our API. Having the format should already allow the UI work to get started.

ruflin · 2020-02-17T07:58:44Z

dev/package-examples/nginx-1.2.0/dataset/access/manifest.yml

-      - c:/programdata/nginx/logs/*access.log*
+# List of supported inputs
+inputs:
+  - type: log


+1. Added it. But did not add it here but on the package level where the overall inputs are defined. As a single input is shown for multiple datasets, I guess it makes more sense there.

ruflin · 2020-02-17T07:59:09Z

dev/package-examples/nginx-1.2.0/dataset/access/manifest.yml

+inputs:
+  - type: log
+    vars:
+      - name: paths


ruflin · 2020-02-17T08:00:34Z

dev/package-examples/nginx-1.2.0/manifest.yml

+            type: text
+          - name: password
+            # This is the html input type?
+            type: password


As soon as we started the UI implementation and are happy with the format, we should open a PR here with the exact specs.

ph · 2020-02-17T14:06:44Z

@ruflin I am ok to merge this and iterate.

ruflin · 2020-02-18T11:52:48Z

I merged this change as it seems we are mostly happy with what we have here and will follow up with more changes for further discussions.

In elastic#212 the datasource structure was introduced in the manifests but not exposed through the APIs yet. This PR changes this exposing all the fields. So far it only exposes the fields and no validation is done yet. This needs to be added in the future. A test package was added to see the output of datasource configs and potential changes to it.

In #212 the datasource structure was introduced in the manifests but not exposed through the APIs yet. This PR changes this exposing all the fields. So far it only exposes the fields and no validation is done yet. This needs to be added in the future. A test package was added to see the output of datasource configs and potential changes to it.

ruflin self-assigned this Feb 11, 2020

jen-huang reviewed Feb 11, 2020

View reviewed changes

ruflin added 2 commits February 12, 2020 09:04

add review feedback

2498005

switch type naming

6805935

jen-huang reviewed Feb 14, 2020

View reviewed changes

adjust for PR review

7c8160c

ruflin commented Feb 17, 2020

View reviewed changes

ph approved these changes Feb 17, 2020

View reviewed changes

ruflin merged commit e965931 into elastic:master Feb 18, 2020

ruflin deleted the datasources branch February 18, 2020 11:52

ruflin mentioned this pull request Feb 18, 2020

Add code to expose datasource configs in API #216

Merged

Introduce datasources in package to configure inputs and streams #212

Introduce datasources in package to configure inputs and streams #212

Conversation

ruflin commented Feb 11, 2020

ruflin commented Feb 11, 2020

skh commented Feb 11, 2020

ruflin commented Feb 11, 2020

ph commented Feb 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin Feb 12, 2020 • edited Loading

Choose a reason for hiding this comment

jen-huang Feb 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ph commented Feb 17, 2020

ruflin commented Feb 18, 2020

ruflin Feb 12, 2020 •

edited

Loading

jen-huang Feb 13, 2020 •

edited

Loading