Skip to content

Commit

Permalink
Chunking API: Added missing descriptions for chunk_by_api, chunk_api_…
Browse files Browse the repository at this point in the history
…key, and chunking_endpoint (#210)
  • Loading branch information
Paul-Cornell authored Sep 5, 2024
1 parent 085bee8 commit 3378c42
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ A common chunking configuration is a critical element in the data processing pip

## Configs

* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`chunk_api_key`
* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`chunk_api_key`: If `chunk_by_api` is set to `True`, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`chunk_by_api`: Default: `False`.
* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`chunk_by_api`: Default: `False`. If set to `True`, uses Unstructured API services to run chunking. If set to `False`, runs chunking locally.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`chunk_combine_text_under_n_chars`: Combine consecutive chunks when the first does not exceed length `n` and the second will fit without exceeding the hard-maximum length. Only operative for the `by_title` chunking strategy.

Expand All @@ -23,7 +23,7 @@ A common chunking configuration is a critical element in the data processing pip

* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`chunk_overlap_all`: Applies overlap to chunks formed from whole elements as well as those formed by text-splitting oversized elements. The overlap length is taken from the `chunk_overlap` value.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`chunking_endpoint`
* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`chunking_endpoint`: If `chunk_by_api` is set to `True`, chunking requests are sent to this Unstructured API URL. By default, this URL is the Unstructured Serverless API URL: `https://api.unstructuredapp.io/general/v0/general`.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`chunking_strategy`: One of `basic` or `by_title`. When omitted, no chunking is performed. The `basic` strategy maximally fills each chunk with whole elements, up the specified size limits as specified by `max_characters` and `new_after_n_chars`. A single element that exceeds this length is divided into two or more chunks using text-splitting. A `Table` element is never combined with any other element and appears as a chunk of its own or as a sequence of `TableChunk` elements splitting is required. The `by_title` behaviors are the same except that section and optionally page boundaries are respected such that two consecutive elements from different sections appear in separate chunks.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ A standard partition configuration is a collection of parameters designed to ove

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`additional_partition_args`: A JSON string representation of any values to pass through to the `partition` function.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`encoding`: The encoding method used to decode the text input. If None, utf-8 will be used.
* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`encoding`: The encoding method used to decode the text input. By default, UTF-8 will be used.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`ocr_languages`: The languages present in the document, for use in partitioning, OCR, or both. Multiple languages indicate that the text could be in any of the specified languages.

Expand All @@ -16,7 +16,7 @@ A standard partition configuration is a collection of parameters designed to ove

## Configs for the Process

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`api_key`: api key needed to access the Unstructured api.
* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`api_key`: If `partition_by_api` is set to `True`, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`fields_include`: Fields to include in the output JSON. By default, the following fields are included: `element_id`, `text`, `type`, `metadata`, and `embeddings`.

Expand All @@ -28,6 +28,6 @@ A standard partition configuration is a collection of parameters designed to ove

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`metadata_include`: If provided, only the specified fields are preserved in the `metadata` output.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`partition_by_api`: Default: `False`. If set to `True`, uses Unstructured API services to run partitioning.
* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`partition_by_api`: Default: `False`. If set to `True`, uses Unstructured API services to run partitioning. If set to `False`, runs paritioning locally.

* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`partition_endpoint`: If using Unstructured API services, requests are sent to this API URL.
* <Icon icon="v"/><Icon icon="2"/>&nbsp;,<Icon icon="v"/><Icon icon="1"/>&nbsp;&nbsp;`partition_endpoint`: If `partition_by_api` is set to `True`, partitioning requests are sent to this Unstructured API URL.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ A common process configuration plays a pivotal role in overseeing the entire ing

## Configs

* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`disable_parallelism`: `True` if the `INGEST_DISABLE_PARALLELISM` environment variable is set to `true` (case-insensitive), otherwise `False` (the default).
* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`disable_parallelism`: `True` if the `INGEST_DISABLE_PARALLELISM` environment variable is set to `True` (case-insensitive), otherwise `False` (the default).

* <Icon icon="v"/><Icon icon="2"/>&nbsp;&nbsp;`download_only`: Default: `False`. If set to `True`, downloads any files that are not already present in the connector's specified download directory (`download_dir`), or `work_dir` if `download_dir` is not specified, or the default file path for `work_dir` if `work_dir` is not specified.

Expand Down

0 comments on commit 3378c42

Please sign in to comment.