-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LogsDB - Rally elastic/logs
dataset generation
#111009
Comments
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Ideally we should generate 30 days of data so to have a storage footprint in the order of 10-20 TBs (this means a lot more json data... x10). For the search side we would need to query 1 day of data. With a test configured like this we could evaluate how we scale. A possible configuration for the Data Generation Stage would be the following:
When it come sto queries we will configure the following:
|
@elastic/es-perf I tagged you since I would like feedback about this test setup for the |
What type of logs do you need? I found this https://github.com/logpai/loghub but the volumes are not what you are asking for. |
The instance we would like to use has 7.5 TB storage...so those looks way to small to me. Also, it has 64 GB of memory, which means plenty of ability to cache large chunks. |
Closing this in favor of elastic/rally-tracks#631 |
Description
We need to generate a suitable dataset to be used in our logging experiments with LogsDB. The dataset should be crafted so to resemble a real-world logging use-case.
The
elastic/logs
Rally track includes a "Data Generation" stage which has parameters that can be tweaked to generate a dataset with specific characteristics and a set of queries whose data access pattern can also be configured. The goal behind this issue is to find suitable parameters for this data generation stage and generate such a dataset resembling as closely as possible a logging real-world use-case.Ideally we would like to work with a very large dataset so to have disks filled up enough so to benchmark querying in a real-world scenario.
Once we have such dataset we will use it to run experiments and understand how our deployments react when resources are scaled up or down and, to be more precise, how query latency is affect and if we have any error when running queries (especially out of memory errors).
Hardware expected to be used for such profile: https://instances.vantage.sh/aws/ec2/im4gn.4xlarge
The text was updated successfully, but these errors were encountered: