Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogsDB - Rally elastic/logs dataset generation #111009

Closed
salvatore-campagna opened this issue Jul 18, 2024 · 6 comments
Closed

LogsDB - Rally elastic/logs dataset generation #111009

salvatore-campagna opened this issue Jul 18, 2024 · 6 comments
Assignees
Labels
>enhancement :StorageEngine/Logs You know, for Logs Team:StorageEngine >test Issues or PRs that are addressing/adding tests

Comments

@salvatore-campagna
Copy link
Contributor

salvatore-campagna commented Jul 18, 2024

Description

We need to generate a suitable dataset to be used in our logging experiments with LogsDB. The dataset should be crafted so to resemble a real-world logging use-case.

The elastic/logs Rally track includes a "Data Generation" stage which has parameters that can be tweaked to generate a dataset with specific characteristics and a set of queries whose data access pattern can also be configured. The goal behind this issue is to find suitable parameters for this data generation stage and generate such a dataset resembling as closely as possible a logging real-world use-case.

Ideally we would like to work with a very large dataset so to have disks filled up enough so to benchmark querying in a real-world scenario.

Once we have such dataset we will use it to run experiments and understand how our deployments react when resources are scaled up or down and, to be more precise, how query latency is affect and if we have any error when running queries (especially out of memory errors).

Hardware expected to be used for such profile: https://instances.vantage.sh/aws/ec2/im4gn.4xlarge

@salvatore-campagna salvatore-campagna added >enhancement >test Issues or PRs that are addressing/adding tests :StorageEngine/Logs You know, for Logs labels Jul 18, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jul 18, 2024

Ideally we should generate 30 days of data so to have a storage footprint in the order of 10-20 TBs (this means a lot more json data... x10). For the search side we would need to query 1 day of data. With a test configured like this we could evaluate how we scale.

A possible configuration for the Data Generation Stage would be the following:

  • raw_data_volume_per_day: 500 GB so to have 15 TB in one month
  • max_generated_corpus_size: 15 TB
  • data_generation_clients: 16
  • max_total_download_gb: 2 TB
  • start_date: June, 1st 2024
  • end_date: June 30th 2024

When it come sto queries we will configure the following:

  • query_min_date: June 29th, 2024
  • query_max_date: June 30th, 2024

@salvatore-campagna salvatore-campagna added the Team:Performance Meta label for performance team label Jul 18, 2024
@elasticsearchmachine elasticsearchmachine removed the Team:Performance Meta label for performance team label Jul 18, 2024
@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jul 18, 2024

@elastic/es-perf I tagged you since I would like feedback about this test setup for the elastic/logs track to be used for an experiment. Our main concern is around the amount of data. 15 TB of raw data means a JSON dataset that is around 150 TB. I suspect indexing time might be quite long.

@salvatore-campagna salvatore-campagna self-assigned this Jul 18, 2024
@pjbertels
Copy link

What type of logs do you need? I found this https://github.com/logpai/loghub but the volumes are not what you are asking for.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jul 22, 2024

https://github.com/logpai/loghub

The instance we would like to use has 7.5 TB storage...so those looks way to small to me. Also, it has 64 GB of memory, which means plenty of ability to cache large chunks.

@salvatore-campagna
Copy link
Contributor Author

Closing this in favor of elastic/rally-tracks#631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :StorageEngine/Logs You know, for Logs Team:StorageEngine >test Issues or PRs that are addressing/adding tests
Projects
None yet
Development

No branches or pull requests

3 participants