Fix code for custom dataset usage #395

acanadil · 2024-10-31T16:30:43Z

I recently tried this repo for benchmarking my local Milvus deployment. Once I tested a basic run to check everything was OK, I wanted to run the benchmark on a dataset I was using Glove. I prepared and adapted all the files required, in the specific format as your README.md says.

I run the following command:

vectordbbench milvusivfflat --uri http://MY_MILVUS_ENDPOINT:19530/ --case-type PerformanceCustomDataset --custom-dataset-dir datasets/glove/ --custom-dataset-size 1183514 --custom-dataset-dim 100 --custom-dataset-file-count 1 --custom-dataset-name Glove --custom-case-description TestDescription --custom-dataset-metric-type L2 --custom-case-load-timeout 36000 --custom-case-optimize-timeout 36000

Which returns the following error:

2024-10-31 15:39:16,604 | INFO: Task:
TaskConfig(db=<DB.Milvus: 'Milvus'>, db_config=MilvusConfig(db_label='2024-10-31T15:39:16.559028', version='', note='', uri=SecretStr('**********')), db_case_config=IVFFlatConfig(index=<IndexType.IVFFlat: 'IVF_FLAT'>, metric_type=None, nlist=2048, nprobe=256), case_config=CaseConfig(case_id=<CaseType.PerformanceCustomDataset: 101>, custom_case={}, k=100, concurrency_search_config=ConcurrencySearchConfig(num_concurrency=[1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], concurrency_duration=30)), stages=['drop_old', 'load', 'search_serial'])
 (cli.py:494) (82)
2024-10-31 15:39:16,605 | INFO: generated uuid for the tasks: 19976088627240ab854308c4ab247d4e (interface.py:66) (82)
Traceback (most recent call last):
  File "/usr/local/bin/vectordbbench", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/clients/milvus/cli.py", line 95, in MilvusIVFFlat
    run(
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/cli/cli.py", line 496, in run
    benchMarkRunner.run([task])
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/interface.py", line 73, in run
    self.running_task = Assembler.assemble_all(run_id, task_label, tasks, self.dataset_source)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/assembler.py", line 40, in assemble_all
    runners = [cls.assemble(run_id, task, source) for task in tasks]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/assembler.py", line 40, in <listcomp>
    runners = [cls.assemble(run_id, task, source) for task in tasks]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/assembler.py", line 17, in assemble
    c = c_cls(task.case_config.custom_case)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/cases.py", line 57, in case_cls
    return type2case.get(self)(**custom_configs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: PerformanceCustomDataset.__init__() missing 5 required positional arguments: 'name', 'description', 'load_timeout', 'optimize_timeout', and 'dataset_config'

It turns out that custom_configs in return type2case.get(self)(**custom_configs) (which corresponds to task.case_config.custom_case) is an empty dictionary. That's because of an error when creating the 'task' variable in cli.py file, on line 471.

This line sets custom_case to an empty dictionary, since there is no "custom_case" key on the arguments. In fact, there is an unused get_custom_case_config() function that takes the parameters variable and, if case_type is specified as a custom dataset, it does collect 'name', 'description', 'load_timeout', 'optimize_timeout', and other dataset configuration values into a dictionary, which is what is needed in this case.

def get_custom_case_config(parameters: dict) -> dict:
    custom_case_config = {}
    if parameters["case_type"] == "PerformanceCustomDataset":
        custom_case_config = {
            "name": parameters["custom_case_name"],
            "description": parameters["custom_case_description"],
            "load_timeout": parameters["custom_case_load_timeout"],
            "optimize_timeout": parameters["custom_case_optimize_timeout"],
            "dataset_config": {
                "name": parameters["custom_dataset_name"],
                "dir": parameters["custom_dataset_dir"],
                "size": parameters["custom_dataset_size"],
                "dim": parameters["custom_dataset_dim"],
                "metric_type": parameters["custom_dataset_metric_type"],
                "file_count": parameters["custom_dataset_file_count"],
                "use_shuffled": parameters["custom_dataset_use_shuffled"],
                "with_gt": parameters["custom_dataset_with_gt"],
            }
        }
    return custom_case_config

Changing this in the code makes the tool work fine:

2024-10-31 15:43:23,681 | INFO: Task:
TaskConfig(db=<DB.Milvus: 'Milvus'>, db_config=MilvusConfig(db_label='2024-10-31T15:43:23.635802', version='', note='', uri=SecretStr('**********')), db_case_config=IVFFlatConfig(index=<IndexType.IVFFlat: 'IVF_FLAT'>, metric_type=None, nlist=2048, nprobe=256), case_config=CaseConfig(case_id=<CaseType.PerformanceCustomDataset: 101>, custom_case={'name': 'Custom', 'description': 'TestDescription', 'load_timeout': 36000, 'optimize_timeout': 36000, 'dataset_config': {'name': 'Glove', 'dir': '/app/datasets/glove/', 'size': '1183514', 'dim': '100', 'metric_type': 'L2', 'file_count': '1', 'use_shuffled': False, 'with_gt': True}}, k=100, concurrency_search_config=ConcurrencySearchConfig(num_concurrency=[1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], concurrency_duration=30)), stages=['drop_old', 'load', 'search_serial'])
 (cli.py:495) (387)
2024-10-31 15:43:23,681 | INFO: generated uuid for the tasks: bc4de2b3b77a4517bc66377556e9831a (interface.py:66) (387)
2024-10-31 15:43:23,950 | INFO | DB             | CaseType     Dataset               Filter | task_label (task_runner.py:338)
2024-10-31 15:43:23,950 | INFO | -----------    | ------------ -------------------- ------- | -------    (task_runner.py:338)
2024-10-31 15:43:23,950 | INFO | Milvus-2024-10-31T15:43:23.635802 | Performance  Glove-Custom-1M         None | bc4de2b3b77a4517bc66377556e9831a (task_runner.py:338)
2024-10-31 15:43:23,950 | INFO: task submitted: id=bc4de2b3b77a4517bc66377556e9831a, bc4de2b3b77a4517bc66377556e9831a, case number: 1 (interface.py:231) (387)
2024-10-31 15:43:25,094 | INFO: [1/1] start case: {'label': <CaseLabel.Performance: 2>, 'dataset': {'data': {'name': 'Glove', 'size': 1183514, 'dim': 100, 'metric_type': <MetricType.L2: 'L2'>}}, 'db': 'Milvus-2024-10-31T15:43:23.635802'}, drop_old=True (interface.py:164) (454)
2024-10-31 15:43:25,095 | INFO: Starting run (task_runner.py:100) (454)
2024-10-31 15:43:25,360 | INFO: Milvus client drop_old collection: VectorDBBenchCollection (milvus.py:45) (454)
2024-10-31 15:43:25,382 | INFO: Milvus create collection: VectorDBBenchCollection (milvus.py:55) (454)
2024-10-31 15:43:26,180 | INFO: Read the entire file into memory: test.parquet (dataset.py:229) (454)
2024-10-31 15:43:26,219 | INFO: Read the entire file into memory: neighbors.parquet (dataset.py:229) (454)
2024-10-31 15:43:26,226 | INFO: Start performance case (task_runner.py:158) (454)

sre-ci-robot · 2024-11-01T07:14:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: acanadil, alwayslove2013
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fix code

5c56922

alwayslove2013 approved these changes Nov 1, 2024

View reviewed changes

alwayslove2013 merged commit 20c513d into zilliztech:main Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix code for custom dataset usage #395

Fix code for custom dataset usage #395

acanadil commented Oct 31, 2024

sre-ci-robot commented Nov 1, 2024

Fix code for custom dataset usage #395

Fix code for custom dataset usage #395

Conversation

acanadil commented Oct 31, 2024

sre-ci-robot commented Nov 1, 2024