Skip to content

Commit

Permalink
Fix merging of CLI args and Yaml configs in vdb_upload example (#1813)
Browse files Browse the repository at this point in the history
Currently CLI args always take precedence over Yaml config values, however since most of the CLI args have a default value in practice the Yaml config values are always ignored.

* Differentiate the explicit CLI args the user specified on the command line from the CLI args which include default values the user didn't specify. 
* Move default values out of the code blocks into global dicts
* Fix bug type-o causing Yaml schema definitions to be ignored.
* Resulting code is 100 lines shorter

Precedence order:
1. Explicit CLI args
2. Yaml config (if there is one)
3. Default CLI args

Closes #1752

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #1813
  • Loading branch information
dagardner-nv committed Jul 24, 2024
1 parent aff2f7a commit ad915cb
Show file tree
Hide file tree
Showing 5 changed files with 335 additions and 436 deletions.
4 changes: 2 additions & 2 deletions examples/llm/vdb_upload/langchain.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,15 @@
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.milvus import Milvus

from examples.llm.vdb_upload.vdb_utils import build_rss_urls
from examples.llm.vdb_upload.vdb_utils import DEFAULT_RSS_URLS
from morpheus.utils.logging_timer import log_time

logger = logging.getLogger(__name__)


def chain(model_name, save_cache):
with log_time(msg="Seeding with chain took {duration} ms. {rate_per_sec} docs/sec", log_fn=logger.debug) as log:
loader = RSSFeedLoader(urls=build_rss_urls())
loader = RSSFeedLoader(urls=DEFAULT_RSS_URLS.copy())

documents = loader.load()

Expand Down
25 changes: 14 additions & 11 deletions examples/llm/vdb_upload/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@
import os

import click
from vdb_upload.vdb_utils import build_cli_configs
from vdb_upload.vdb_utils import build_final_config
from vdb_upload.vdb_utils import build_config
from vdb_upload.vdb_utils import is_valid_service

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -144,7 +143,8 @@ def run():
default="http://localhost:19530",
help="URI for connecting to Vector Database server.",
)
def pipeline(**kwargs):
@click.pass_context
def pipeline(ctx: click.Context, **kwargs):
"""
Configure and run the data processing pipeline based on the specified command-line options.
Expand All @@ -154,25 +154,28 @@ def pipeline(**kwargs):
Parameters
----------
ctx: click.Context
Click context object.
**kwargs : dict
Keyword arguments containing command-line options.
Returns
-------
The result of the internal pipeline function call.
"""

vdb_config_path = kwargs.pop('vdb_config_path', None)
cli_source_conf, cli_embed_conf, cli_pipe_conf, cli_tok_conf, cli_vdb_conf = build_cli_configs(**kwargs)
final_config = build_final_config(vdb_config_path,
cli_source_conf,
cli_embed_conf,
cli_pipe_conf,
cli_tok_conf,
cli_vdb_conf)

# When a config file is provided, only merge the explicit flags set by the user
explicit_cli_args = {}
for (key, value) in kwargs.items():
if ctx.get_parameter_source(key) is not click.core.ParameterSource.DEFAULT:
explicit_cli_args[key] = value

config = build_config(vdb_conf_path=vdb_config_path, explicit_cli_args=explicit_cli_args, implicit_cli_args=kwargs)
# Call the internal pipeline function with the final config dictionary
from .pipeline import pipeline as _pipeline
return _pipeline(**final_config)
return _pipeline(**config)


@run.command()
Expand Down
3 changes: 1 addition & 2 deletions examples/llm/vdb_upload/vdb_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ vdb_pipeline:

sources:
- type: rss
name: "rss_cve"
name: "rss"
config:
batch_size: 128 # Number of rss feeds per batch
cache_dir: "./.cache/http"
Expand Down Expand Up @@ -75,7 +75,6 @@ vdb_pipeline:
output_batch_size: 2048 # Number of chunked documents per output batch
request_timeout_sec: 2.0
run_indefinitely: true
stop_after_rec: 0
strip_markup: true
web_scraper_config:
chunk_overlap: 51
Expand Down
Loading

0 comments on commit ad915cb

Please sign in to comment.