Skip to content

Commit

Permalink
Support data asset input (#2040)
Browse files Browse the repository at this point in the history
# Description
In local mode, we first attempt to use the configured
`document_nodes_file` if it's valid, otherwise, we fall back to
`documents_folder`. In cloud mode, `document_nodes_file` is used if
configured, without any validity checks because we cannot check if it is
a data asset. For cloud mode, both local files and data assets can be
used.

**Local Tests:**

1. happy path

     - skip split 

![image](https://github.com/microsoft/promptflow/assets/75061414/a937b8fc-4843-474f-a506-d3d1de6a1053)

     - has split: 
`documents_folder = "..\..\docs\reference\tools-reference"`

![image](https://github.com/microsoft/promptflow/assets/75061414/ca0d3a2c-14a8-424a-a927-8cbbc5a73537)

2. invalid file path and folder path

![image](https://github.com/microsoft/promptflow/assets/75061414/c0ee9255-f853-49e2-af09-fc0b61410d56)


**Cloud Tests:**
1. happy path (local and asset)
   - local path
skip split:
https://ml.azure.com/runs/tough_guava_3znwhjb9sc?wsid=/subscriptions/96aede12-2f73-41cb-b983-6d11a904839b/resourcegroups/promptflow/workspaces/yaopfeus&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
has split:
https://ml.azure.com/runs/happy_flower_kr0y3yy56z?wsid=/subscriptions/96aede12-2f73-41cb-b983-6d11a904839b/resourcegroups/promptflow/workspaces/yaopfeus&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
   - data asset path
skip split:
https://ml.azure.com/runs/frosty_raisin_tsrhm41knv?wsid=/subscriptions/96aede12-2f73-41cb-b983-6d11a904839b/resourcegroups/promptflow/workspaces/yaopfeus&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
has split:
https://ml.azure.com/runs/neat_engine_stv9pnc0tg?wsid=/subscriptions/96aede12-2f73-41cb-b983-6d11a904839b/resourcegroups/promptflow/workspaces/yaopfeus&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
2. invalid local file path and folder path
   - invalid path but not default path

![image](https://github.com/microsoft/promptflow/assets/75061414/3bb7fa53-e91d-4e0e-86d7-07503073460a)
   - default path

![image](https://github.com/microsoft/promptflow/assets/75061414/e14a0d7b-126a-41c6-914b-4d716da791e2)



3. invalid data asset path

https://ml.azure.com/runs/lucid_brake_dvwr49x28v?wsid=/subscriptions/96aede12-2f73-41cb-b983-6d11a904839b/resourcegroups/promptflow/workspaces/yaopfeus&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#

![image](https://github.com/microsoft/promptflow/assets/75061414/54d77749-6e3c-4cec-83a9-5e73e6bb26e1)


# All Promptflow Contribution checklist:
- [ ] **The pull request does not introduce [breaking changes].**
- [ ] **CHANGELOG is updated for new features, bug fixes or other
significant changes.**
- [ ] **I have read the [contribution guidelines](../CONTRIBUTING.md).**
- [ ] **Create an issue and link to the pull request to get dedicated
review from promptflow team. Learn more: [suggested
workflow](../CONTRIBUTING.md#suggested-workflow).**

## General Guidelines and Best Practices
- [ ] Title of the pull request is clear and informative.
- [ ] There are a small number of commits, each of which have an
informative message. This means that previously merged commits do not
appear in the history of the PR. For more information on cleaning up the
commits in your PR, [see this
page](https://github.com/Azure/azure-powershell/blob/master/documentation/development-docs/cleaning-up-commits.md).

### Testing Guidelines
- [ ] Pull request includes test coverage for the included changes.

---------

Co-authored-by: cs_lucky <si.chen@microsoft.com>
  • Loading branch information
chenslucky and cs_lucky authored Feb 19, 2024
1 parent 776b712 commit 5726662
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 4 deletions.
1 change: 1 addition & 0 deletions examples/gen_test_data/config.ini.example
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ document_chunk_size = 512
document_chunk_overlap = 100
; However, if you wish to bypass the document split process, simply provide the 'document_nodes_file', which is a JSONL file.
; When both `documents_folder` and `document_nodes_file` are configured, will use 'document_nodes_file' and ignore 'documents_folder'.
; For cloud mode, both local files and data assets can be used.
; document_nodes_file = "<your-node-file-path>"

; Test data gen flow configs
Expand Down
8 changes: 8 additions & 0 deletions examples/gen_test_data/gen_test_data/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,3 +185,11 @@ def convert_to_abs_path(file_path: str) -> str:
return abs
else:
return file_path


def local_path_exists(path):
return Path(path).exists()


def non_padding_path(path):
return not (path.startswith("<") and path.endswith(">"))
10 changes: 6 additions & 4 deletions examples/gen_test_data/gen_test_data/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

from common import clean_data, count_non_blank_lines, \
split_document, copy_flow_folder_and_set_node_inputs, \
print_progress, convert_to_abs_path # noqa: E402
print_progress, convert_to_abs_path, non_padding_path, local_path_exists # noqa: E402
from constants import TEXT_CHUNK, DETAILS_FILE_NAME # noqa: E402

logger = get_logger("data.gen")
Expand Down Expand Up @@ -276,10 +276,11 @@ def get_ml_client(subscription_id: str, resource_group: str, workspace_name: str
documents_folder = convert_to_abs_path(args.documents_folder)
flow_folder = convert_to_abs_path(args.flow_folder)
output_folder = convert_to_abs_path(args.output_folder)
validate_path_func = non_padding_path if args.cloud else local_path_exists

if document_nodes_file and Path(document_nodes_file).is_file():
if document_nodes_file and validate_path_func(document_nodes_file):
should_skip_split_documents = True
elif not documents_folder or not Path(documents_folder).is_dir():
elif not documents_folder or not validate_path_func(documents_folder):
parser.error(
"Either 'documents_folder' or 'document_nodes_file' should be specified correctly.\n"
f"documents_folder: '{documents_folder}'\ndocument_nodes_file: '{document_nodes_file}'"
Expand All @@ -295,7 +296,8 @@ def get_ml_client(subscription_id: str, resource_group: str, workspace_name: str
"Skip step 1 'Split documents to document nodes' as received document nodes from "
f"input file path '{document_nodes_file}'."
)
logger.info(f"Collected {count_non_blank_lines(document_nodes_file)} document nodes.")
if Path(document_nodes_file).is_file():
logger.info(f"Collected {count_non_blank_lines(document_nodes_file)} document nodes.")

copy_flow_folder_and_set_node_inputs(copied_flow_folder, args.flow_folder, args.node_inputs_override)

Expand Down

0 comments on commit 5726662

Please sign in to comment.