[Change Proposal] support specifiying alternate data_stream.dataset #145

leehinman · 2021-03-02T22:20:01Z

For REST APIs we would really like to have a single meta package where
we can define an input multiple times and in that input redirect the
data to a different data_stream.dataset. For example if the REST
API contained both Apache Access and AWS Cloudtrail data two inputs
would be defined. In addition to the variables needed to connect to
the REST API and collect the data, each input would also have an
option to select the data_stream.dataset from a list of available
data_stream.datasets This would allow you to send the Apache Access
data from the REST API to the apache.access dataset and take
advantage of the ingest node processing available. It would also
allow you to take advantage of the Apache Dashboards.

This new field would signal to Kibana to display a list of available
data_stream.datasets for the user to pick from. Also if a dataset
is selected Kibana would need to "install" that package so that the
ingest pipelines and dashboards are available.

This could be useful for kafka and syslog as well.

The text was updated successfully, but these errors were encountered:

ycombinator · 2021-03-02T23:22:30Z

Hmmm, I hope I'm understanding the use case correctly but if I am, I wonder if the recently-merged input groups change to the package spec might be of use here. The PR that implemented it (and related discussion) is #137 but you can also quickly see the idea in action in this sample package:

package-spec/test/packages/input_groups/manifest.yml

Lines 37 to 39 in 714ec42

    
           data_streams: 
        
             - ec2_logs 
        
             - ec2_metrics

. Note that this change hasn't been fully rolled out yet; the rollout is being tracked in #144.

leehinman · 2021-03-03T02:13:37Z

maybe, one thing I'm proposing is that the apache pipeline stays in the apache package(no duplicate in meta package), in the REST API meta package, user chooses "apache.access" from a list and data is then sent to that data stream and so the apache pipeline gets run on that data. Also no duplicating of dashboards. I don't understand the "input group" change well enough to see if we get that behavior.

ruflin · 2021-03-03T10:26:39Z

I think the two efforts are not related. If I understand @leehinman the above request mainly to "generic" inputs where multiple data sources come in at the same time. I think this is also related to the discussion around specifying the same input multiple times? So instead of having to set the dataset manually, the UI should have a drop down to select an existing dataset that already exists?

leehinman · 2021-03-08T19:28:00Z

Yes this is definitely related to adding the same input multiple times and I
think #110 will address that part.

The other part is how do we allow users to pick the input type without
a large amount of duplication. For example Apache access logs could
be coming from the log, httpjson, kafka or syslog input. We could add
each input to every package, but then you get an interface that looks
like elastic/integrations#545 For that screenshot the logic to pull
the data from the REST API and populate the message field was
duplicated for all 4 packages. It also means that the user has to
enter the information to connect to the REST API in each package,
which is a lot of duplication and a pain when the password needs to be
updated.

I'm hoping we can come up with a solution for inputs like httpjson,
kafka, syslog & Windows Event logs where multiple types of data could
be in the datastore that is accessed by the input. From a
configuration standpoint it would be nice to configure the basic
connection information for the input once. For example the REST API
it might be hostname, port, username & password. Then for each kind
of data we would have some way of getting just the data we want from
the datastore. For REST API that would be search, for kafka a topic,
etc. And then for each kind of data we should map it to a
data_stream.dataset. If thedata_stream.dataset is normally setup
(pipeline/dashboards/fields) by another package we need to track that
dependency. The reason for sending to a known data_stream.dataset
is so we don't have to duplicate the dashboards & pipelines.

ruflin · 2021-03-09T08:33:08Z

@sorantis Can you chime in here?

leehinman · 2021-03-11T20:08:47Z

@mukeshelastic any chance you could comment on how you think we should handle packaging for third-party REST API, kafka, syslog, etc. ?

sorantis · 2021-03-15T12:08:08Z

@leehinman the input group provides the ability to combine related data streams together (it's shown on see from the many examples in the granularity doc). This way the integration developer can combine all logs related data streams in one group called Logs (or multiple should there be a need to separate Operational Logs from Security Logs).
Following the proposed structure for integration packages, all these different inputs can be either combined under an input group or they can each represent an integration policy template.

There's an example of this new structure based AWS package. In this example a data stream is assigned explicitly to an input.

cc @ycombinator @mtojek @kaiyan-sheng

leehinman added the discuss Issue needs discussion label Mar 2, 2021

rw-access pushed a commit to rw-access/package-spec that referenced this issue Mar 23, 2021

Include package name in test reports (elastic#145)

8ec0619

jamiehynds mentioned this issue Apr 22, 2021

[Splunk] Create single Splunk package elastic/integrations#933

Closed

2 tasks

leehinman mentioned this issue May 10, 2021

Proposal: Make IngestPipeline tag be able to reference other data_streams/packages pipelines #580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Change Proposal] support specifiying alternate data_stream.dataset #145

[Change Proposal] support specifiying alternate data_stream.dataset #145

leehinman commented Mar 2, 2021

ycombinator commented Mar 2, 2021

leehinman commented Mar 3, 2021

ruflin commented Mar 3, 2021

leehinman commented Mar 8, 2021

ruflin commented Mar 9, 2021

leehinman commented Mar 11, 2021

sorantis commented Mar 15, 2021

[Change Proposal] support specifiying alternate data_stream.dataset #145

[Change Proposal] support specifiying alternate data_stream.dataset #145

Comments

leehinman commented Mar 2, 2021

ycombinator commented Mar 2, 2021

leehinman commented Mar 3, 2021

ruflin commented Mar 3, 2021

leehinman commented Mar 8, 2021

ruflin commented Mar 9, 2021

leehinman commented Mar 11, 2021

sorantis commented Mar 15, 2021