Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infrastructure Monitoring] Better data generation #119491

Closed
1 task done
jasonrhodes opened this issue Nov 23, 2021 · 12 comments
Closed
1 task done

[Infrastructure Monitoring] Better data generation #119491

jasonrhodes opened this issue Nov 23, 2021 · 12 comments
Labels
Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@jasonrhodes
Copy link
Member

jasonrhodes commented Nov 23, 2021

Epic for organizing work on how to generate data for development and testing. We will flesh this out over time.

@jasonrhodes jasonrhodes added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Nov 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@matschaffer
Copy link
Contributor

matschaffer commented Nov 25, 2021

Next steps (given sync with @miltonhultgren ):

  • Get this PR green & merge
  • Write up doc of plan/use-case (@miltonhultgren)
  • Figure out next problem space that needs a generator (could be logs, metrics or SM)
    • open question: how to make time series data more "interesting" - just a flat line right now
    • open question: where do we handle mappings for logs & metrics?

@matschaffer
Copy link
Contributor

Thinking that metrics UI could be a good next target given the efforts around alerting on high cardinality group-bys - cc @Zacqary

@miltonhultgren
Copy link
Contributor

One more thought from syncing with @matschaffer is that one good POC would be to rewrite one of the Stack Monitoring E2E/integration tests using synthtrace generated data.

@miltonhultgren
Copy link
Contributor

Just a ping on more problems that could benefit from data generation tooling #119658

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Nov 29, 2021

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Nov 29, 2021

I got into this topic after struggling with writing tests for our API based on "missing" test data. As I wrote to Jason:

We do have a bunch of archives, but it is not clear what data they contain and how relevant that data is for the thing I want to test. Often they are quite narrow. At the same time it feels hard to wire up Metricbeat to something and use es_archiver to create a new archive on the fly and there is also the concern of how much bloat we can put into a git repo.
Once the archive is there it is also not easy to change, you need to regenerate it. Similarly, if all you wanted was a small tweak to the mapping you still have to copy paste the whole thing.

Additionally es_archiver currently doesn't support data streams although this will likely get fixed in time #69061.

While there have been many attempts to solve part of this problem we haven't really landed on something that we can make a road map around.

I would like to have a tool to easily generate data with different mappings, and to be able to use that in test instead of relying on es_archiver would be nice.
A one liner goal I thought of is to "generate metrics X, Y and Z, based on log events between date A and date B with this (regular|irregular) frequency, and use these mappings when creating and storing them"

What the current tools seem to have in common is the idea of defining a time range of where "events" should happen, with some frequency and possibility for spikes (by having overlapping time ranges with different event frequencies) and then some layer that turns these "events" into Elasticsearch documents.

One thing that most tools miss though is a connection to the underlying mappings of those documents. Synthtrace for example loads the index + mappings with es_archiver before inserting their documents.
@weltenwort built a tool that puts most of its focus on that part of the problem which might also help us generate data for different schema versions since we generate it from the mappings of that version. Synthtrace does have some "hooks" where we could perhaps inject this kind of code (bootstrap and the document generator class itself).

My hope is that by defining our problems more we can put up some goals to reach and the Synthtrace route seems promising so far, given also that in the future we'll likely want to use APM data in our own tests as well.

What could the next steps be? What have we missed so far?

@elastic/infra-monitoring-ui

@jasonrhodes
Copy link
Member Author

Can we do an hour sync to present an overall set of findings here, and try to jumpstart this effort in a good direction? I want us to invest in this, but like you all are saying (I think), we need clear goals and to choose the ones that will have the highest ROI for us.

@matschaffer
Copy link
Contributor

I wouldn't mind showing off the stack monitoring simulation stuff so far. It's basic but it looks promising.

@miltonhultgren
Copy link
Contributor

@jasonrhodes Let's book something!

My vote is for focusing on the issue of "data generation from mapping", since much of the work in Stack Monitoring would benefit from being able to take 1 of the 3 different mappings we have (which are moving towards a single mapping) and generate data from that, run tests and check that things work.
Beyond that, all of our initiatives around curated views installed from integrations will likely come with mappings shipped. Being able to grab one of those mappings, generate a bunch of data and install the curated view Saved Object and run tests would be easier than setting up Fleet for each test.

@jasonrhodes
Copy link
Member Author

Please book an hour for after the new year, it can be during my "meeting block" / "focus time" if that works but likely it'll be before that anyway in order to work with other calendars. @matschaffer maybe if you can weigh in on some good times to target and work with @miltonhultgren to get an hour on the calendar? Thanks, all!

@smith
Copy link
Contributor

smith commented Jun 16, 2022

Closing this for now. If we put effort into improving data generation while creating or updating tests perhaps we can evolve to the right solution.

@smith smith closed this as completed Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests

5 participants