Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix code for custom dataset usage #395

Merged
merged 1 commit into from
Nov 1, 2024

Conversation

acanadil
Copy link
Contributor

I recently tried this repo for benchmarking my local Milvus deployment. Once I tested a basic run to check everything was OK, I wanted to run the benchmark on a dataset I was using Glove. I prepared and adapted all the files required, in the specific format as your README.md says.

I run the following command:

vectordbbench milvusivfflat --uri http://MY_MILVUS_ENDPOINT:19530/ --case-type PerformanceCustomDataset --custom-dataset-dir datasets/glove/ --custom-dataset-size 1183514 --custom-dataset-dim 100 --custom-dataset-file-count 1 --custom-dataset-name Glove --custom-case-description TestDescription --custom-dataset-metric-type L2 --custom-case-load-timeout 36000 --custom-case-optimize-timeout 36000

Which returns the following error:

2024-10-31 15:39:16,604 | INFO: Task:
TaskConfig(db=<DB.Milvus: 'Milvus'>, db_config=MilvusConfig(db_label='2024-10-31T15:39:16.559028', version='', note='', uri=SecretStr('**********')), db_case_config=IVFFlatConfig(index=<IndexType.IVFFlat: 'IVF_FLAT'>, metric_type=None, nlist=2048, nprobe=256), case_config=CaseConfig(case_id=<CaseType.PerformanceCustomDataset: 101>, custom_case={}, k=100, concurrency_search_config=ConcurrencySearchConfig(num_concurrency=[1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], concurrency_duration=30)), stages=['drop_old', 'load', 'search_serial'])
 (cli.py:494) (82)
2024-10-31 15:39:16,605 | INFO: generated uuid for the tasks: 19976088627240ab854308c4ab247d4e (interface.py:66) (82)
Traceback (most recent call last):
  File "/usr/local/bin/vectordbbench", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/clients/milvus/cli.py", line 95, in MilvusIVFFlat
    run(
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/cli/cli.py", line 496, in run
    benchMarkRunner.run([task])
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/interface.py", line 73, in run
    self.running_task = Assembler.assemble_all(run_id, task_label, tasks, self.dataset_source)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/assembler.py", line 40, in assemble_all
    runners = [cls.assemble(run_id, task, source) for task in tasks]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/assembler.py", line 40, in <listcomp>
    runners = [cls.assemble(run_id, task, source) for task in tasks]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/assembler.py", line 17, in assemble
    c = c_cls(task.case_config.custom_case)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vectordb_bench/backend/cases.py", line 57, in case_cls
    return type2case.get(self)(**custom_configs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: PerformanceCustomDataset.__init__() missing 5 required positional arguments: 'name', 'description', 'load_timeout', 'optimize_timeout', and 'dataset_config'

It turns out that custom_configs in return type2case.get(self)(**custom_configs) (which corresponds to task.case_config.custom_case) is an empty dictionary. That's because of an error when creating the 'task' variable in cli.py file, on line 471.

This line sets custom_case to an empty dictionary, since there is no "custom_case" key on the arguments. In fact, there is an unused get_custom_case_config() function that takes the parameters variable and, if case_type is specified as a custom dataset, it does collect 'name', 'description', 'load_timeout', 'optimize_timeout', and other dataset configuration values into a dictionary, which is what is needed in this case.

def get_custom_case_config(parameters: dict) -> dict:
    custom_case_config = {}
    if parameters["case_type"] == "PerformanceCustomDataset":
        custom_case_config = {
            "name": parameters["custom_case_name"],
            "description": parameters["custom_case_description"],
            "load_timeout": parameters["custom_case_load_timeout"],
            "optimize_timeout": parameters["custom_case_optimize_timeout"],
            "dataset_config": {
                "name": parameters["custom_dataset_name"],
                "dir": parameters["custom_dataset_dir"],
                "size": parameters["custom_dataset_size"],
                "dim": parameters["custom_dataset_dim"],
                "metric_type": parameters["custom_dataset_metric_type"],
                "file_count": parameters["custom_dataset_file_count"],
                "use_shuffled": parameters["custom_dataset_use_shuffled"],
                "with_gt": parameters["custom_dataset_with_gt"],
            }
        }
    return custom_case_config

Changing this in the code makes the tool work fine:

2024-10-31 15:43:23,681 | INFO: Task:
TaskConfig(db=<DB.Milvus: 'Milvus'>, db_config=MilvusConfig(db_label='2024-10-31T15:43:23.635802', version='', note='', uri=SecretStr('**********')), db_case_config=IVFFlatConfig(index=<IndexType.IVFFlat: 'IVF_FLAT'>, metric_type=None, nlist=2048, nprobe=256), case_config=CaseConfig(case_id=<CaseType.PerformanceCustomDataset: 101>, custom_case={'name': 'Custom', 'description': 'TestDescription', 'load_timeout': 36000, 'optimize_timeout': 36000, 'dataset_config': {'name': 'Glove', 'dir': '/app/datasets/glove/', 'size': '1183514', 'dim': '100', 'metric_type': 'L2', 'file_count': '1', 'use_shuffled': False, 'with_gt': True}}, k=100, concurrency_search_config=ConcurrencySearchConfig(num_concurrency=[1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], concurrency_duration=30)), stages=['drop_old', 'load', 'search_serial'])
 (cli.py:495) (387)
2024-10-31 15:43:23,681 | INFO: generated uuid for the tasks: bc4de2b3b77a4517bc66377556e9831a (interface.py:66) (387)
2024-10-31 15:43:23,950 | INFO | DB             | CaseType     Dataset               Filter | task_label (task_runner.py:338)
2024-10-31 15:43:23,950 | INFO | -----------    | ------------ -------------------- ------- | -------    (task_runner.py:338)
2024-10-31 15:43:23,950 | INFO | Milvus-2024-10-31T15:43:23.635802 | Performance  Glove-Custom-1M         None | bc4de2b3b77a4517bc66377556e9831a (task_runner.py:338)
2024-10-31 15:43:23,950 | INFO: task submitted: id=bc4de2b3b77a4517bc66377556e9831a, bc4de2b3b77a4517bc66377556e9831a, case number: 1 (interface.py:231) (387)
2024-10-31 15:43:25,094 | INFO: [1/1] start case: {'label': <CaseLabel.Performance: 2>, 'dataset': {'data': {'name': 'Glove', 'size': 1183514, 'dim': 100, 'metric_type': <MetricType.L2: 'L2'>}}, 'db': 'Milvus-2024-10-31T15:43:23.635802'}, drop_old=True (interface.py:164) (454)
2024-10-31 15:43:25,095 | INFO: Starting run (task_runner.py:100) (454)
2024-10-31 15:43:25,360 | INFO: Milvus client drop_old collection: VectorDBBenchCollection (milvus.py:45) (454)
2024-10-31 15:43:25,382 | INFO: Milvus create collection: VectorDBBenchCollection (milvus.py:55) (454)
2024-10-31 15:43:26,180 | INFO: Read the entire file into memory: test.parquet (dataset.py:229) (454)
2024-10-31 15:43:26,219 | INFO: Read the entire file into memory: neighbors.parquet (dataset.py:229) (454)
2024-10-31 15:43:26,226 | INFO: Start performance case (task_runner.py:158) (454)

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: acanadil, alwayslove2013
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alwayslove2013 alwayslove2013 merged commit 20c513d into zilliztech:main Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants