LogsDB - Rally `elastic/logs` dataset generation #111009

salvatore-campagna · 2024-07-18T07:46:15Z

Description

We need to generate a suitable dataset to be used in our logging experiments with LogsDB. The dataset should be crafted so to resemble a real-world logging use-case.

The elastic/logs Rally track includes a "Data Generation" stage which has parameters that can be tweaked to generate a dataset with specific characteristics and a set of queries whose data access pattern can also be configured. The goal behind this issue is to find suitable parameters for this data generation stage and generate such a dataset resembling as closely as possible a logging real-world use-case.

Ideally we would like to work with a very large dataset so to have disks filled up enough so to benchmark querying in a real-world scenario.

Once we have such dataset we will use it to run experiments and understand how our deployments react when resources are scaled up or down and, to be more precise, how query latency is affect and if we have any error when running queries (especially out of memory errors).

Hardware expected to be used for such profile: https://instances.vantage.sh/aws/ec2/im4gn.4xlarge

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-07-18T07:46:39Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

salvatore-campagna · 2024-07-18T11:01:30Z

Ideally we should generate 30 days of data so to have a storage footprint in the order of 10-20 TBs (this means a lot more json data... x10). For the search side we would need to query 1 day of data. With a test configured like this we could evaluate how we scale.

A possible configuration for the Data Generation Stage would be the following:

raw_data_volume_per_day: 500 GB so to have 15 TB in one month
max_generated_corpus_size: 15 TB
data_generation_clients: 16
max_total_download_gb: 2 TB
start_date: June, 1st 2024
end_date: June 30th 2024

When it come sto queries we will configure the following:

query_min_date: June 29th, 2024
query_max_date: June 30th, 2024

salvatore-campagna · 2024-07-18T11:46:05Z

@elastic/es-perf I tagged you since I would like feedback about this test setup for the elastic/logs track to be used for an experiment. Our main concern is around the amount of data. 15 TB of raw data means a JSON dataset that is around 150 TB. I suspect indexing time might be quite long.

pjbertels · 2024-07-19T13:17:23Z

What type of logs do you need? I found this https://github.com/logpai/loghub but the volumes are not what you are asking for.

salvatore-campagna · 2024-07-22T08:32:02Z

https://github.com/logpai/loghub

The instance we would like to use has 7.5 TB storage...so those looks way to small to me. Also, it has 64 GB of memory, which means plenty of ability to cache large chunks.

salvatore-campagna · 2024-07-25T10:46:48Z

Closing this in favor of elastic/rally-tracks#631

salvatore-campagna added >enhancement >test Issues or PRs that are addressing/adding tests :StorageEngine/Logs You know, for Logs labels Jul 18, 2024

elasticsearchmachine added the Team:StorageEngine label Jul 18, 2024

salvatore-campagna added the Team:Performance Meta label for performance team label Jul 18, 2024

elasticsearchmachine removed the Team:Performance Meta label for performance team label Jul 18, 2024

salvatore-campagna self-assigned this Jul 18, 2024

salvatore-campagna closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LogsDB - Rally `elastic/logs` dataset generation #111009

LogsDB - Rally `elastic/logs` dataset generation #111009

salvatore-campagna commented Jul 18, 2024 •

edited

Loading

elasticsearchmachine commented Jul 18, 2024

salvatore-campagna commented Jul 18, 2024 •

edited

Loading

salvatore-campagna commented Jul 18, 2024 •

edited

Loading

pjbertels commented Jul 19, 2024

salvatore-campagna commented Jul 22, 2024 •

edited

Loading

salvatore-campagna commented Jul 25, 2024

LogsDB - Rally elastic/logs dataset generation #111009

LogsDB - Rally elastic/logs dataset generation #111009

Comments

salvatore-campagna commented Jul 18, 2024 • edited Loading

Description

elasticsearchmachine commented Jul 18, 2024

salvatore-campagna commented Jul 18, 2024 • edited Loading

salvatore-campagna commented Jul 18, 2024 • edited Loading

pjbertels commented Jul 19, 2024

salvatore-campagna commented Jul 22, 2024 • edited Loading

salvatore-campagna commented Jul 25, 2024

LogsDB - Rally `elastic/logs` dataset generation #111009

LogsDB - Rally `elastic/logs` dataset generation #111009

salvatore-campagna commented Jul 18, 2024 •

edited

Loading

salvatore-campagna commented Jul 18, 2024 •

edited

Loading

salvatore-campagna commented Jul 18, 2024 •

edited

Loading

salvatore-campagna commented Jul 22, 2024 •

edited

Loading