Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Change Proposal] support specifiying alternate data_stream.dataset #145

Open
leehinman opened this issue Mar 2, 2021 · 7 comments
Open
Labels
discuss Issue needs discussion

Comments

@leehinman
Copy link
Contributor

For REST APIs we would really like to have a single meta package where
we can define an input multiple times and in that input redirect the
data to a different data_stream.dataset. For example if the REST
API contained both Apache Access and AWS Cloudtrail data two inputs
would be defined. In addition to the variables needed to connect to
the REST API and collect the data, each input would also have an
option to select the data_stream.dataset from a list of available
data_stream.datasets This would allow you to send the Apache Access
data from the REST API to the apache.access dataset and take
advantage of the ingest node processing available. It would also
allow you to take advantage of the Apache Dashboards.

This new field would signal to Kibana to display a list of available
data_stream.datasets for the user to pick from. Also if a dataset
is selected Kibana would need to "install" that package so that the
ingest pipelines and dashboards are available.

This could be useful for kafka and syslog as well.

@leehinman leehinman added the discuss Issue needs discussion label Mar 2, 2021
@ycombinator
Copy link
Contributor

Hmmm, I hope I'm understanding the use case correctly but if I am, I wonder if the recently-merged input groups change to the package spec might be of use here. The PR that implemented it (and related discussion) is #137 but you can also quickly see the idea in action in this sample package:

data_streams:
- ec2_logs
- ec2_metrics
. Note that this change hasn't been fully rolled out yet; the rollout is being tracked in #144.

@leehinman
Copy link
Contributor Author

maybe, one thing I'm proposing is that the apache pipeline stays in the apache package(no duplicate in meta package), in the REST API meta package, user chooses "apache.access" from a list and data is then sent to that data stream and so the apache pipeline gets run on that data. Also no duplicating of dashboards. I don't understand the "input group" change well enough to see if we get that behavior.

@ruflin
Copy link
Member

ruflin commented Mar 3, 2021

I think the two efforts are not related. If I understand @leehinman the above request mainly to "generic" inputs where multiple data sources come in at the same time. I think this is also related to the discussion around specifying the same input multiple times? So instead of having to set the dataset manually, the UI should have a drop down to select an existing dataset that already exists?

@leehinman
Copy link
Contributor Author

Yes this is definitely related to adding the same input multiple times and I
think #110 will address that part.

The other part is how do we allow users to pick the input type without
a large amount of duplication. For example Apache access logs could
be coming from the log, httpjson, kafka or syslog input. We could add
each input to every package, but then you get an interface that looks
like elastic/integrations#545 For that screenshot the logic to pull
the data from the REST API and populate the message field was
duplicated for all 4 packages. It also means that the user has to
enter the information to connect to the REST API in each package,
which is a lot of duplication and a pain when the password needs to be
updated.

I'm hoping we can come up with a solution for inputs like httpjson,
kafka, syslog & Windows Event logs where multiple types of data could
be in the datastore that is accessed by the input. From a
configuration standpoint it would be nice to configure the basic
connection information for the input once. For example the REST API
it might be hostname, port, username & password. Then for each kind
of data we would have some way of getting just the data we want from
the datastore. For REST API that would be search, for kafka a topic,
etc. And then for each kind of data we should map it to a
data_stream.dataset. If thedata_stream.dataset is normally setup
(pipeline/dashboards/fields) by another package we need to track that
dependency. The reason for sending to a known data_stream.dataset
is so we don't have to duplicate the dashboards & pipelines.

@ruflin
Copy link
Member

ruflin commented Mar 9, 2021

@sorantis Can you chime in here?

@leehinman
Copy link
Contributor Author

@mukeshelastic any chance you could comment on how you think we should handle packaging for third-party REST API, kafka, syslog, etc. ?

@sorantis
Copy link

@leehinman the input group provides the ability to combine related data streams together (it's shown on see from the many examples in the granularity doc). This way the integration developer can combine all logs related data streams in one group called Logs (or multiple should there be a need to separate Operational Logs from Security Logs).
Following the proposed structure for integration packages, all these different inputs can be either combined under an input group or they can each represent an integration policy template.

There's an example of this new structure based AWS package. In this example a data stream is assigned explicitly to an input.

cc @ycombinator @mtojek @kaiyan-sheng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs discussion
Projects
None yet
Development

No branches or pull requests

4 participants