diff --git a/snippets/ingest-configuration-shared/chunking-configuration.mdx b/snippets/ingest-configuration-shared/chunking-configuration.mdx index 06450b7c..29cacb57 100644 --- a/snippets/ingest-configuration-shared/chunking-configuration.mdx +++ b/snippets/ingest-configuration-shared/chunking-configuration.mdx @@ -3,9 +3,9 @@ A common chunking configuration is a critical element in the data processing pip ## Configs -*   `chunk_api_key` +*   `chunk_api_key`: If `chunk_by_api` is set to `True`, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls. -*   `chunk_by_api`: Default: `False`. +*   `chunk_by_api`: Default: `False`. If set to `True`, uses Unstructured API services to run chunking. If set to `False`, runs chunking locally. *   `chunk_combine_text_under_n_chars`: Combine consecutive chunks when the first does not exceed length `n` and the second will fit without exceeding the hard-maximum length. Only operative for the `by_title` chunking strategy. @@ -23,7 +23,7 @@ A common chunking configuration is a critical element in the data processing pip *   `chunk_overlap_all`: Applies overlap to chunks formed from whole elements as well as those formed by text-splitting oversized elements. The overlap length is taken from the `chunk_overlap` value. -*   `chunking_endpoint` +*   `chunking_endpoint`: If `chunk_by_api` is set to `True`, chunking requests are sent to this Unstructured API URL. By default, this URL is the Unstructured Serverless API URL: `https://api.unstructuredapp.io/general/v0/general`. *  ,  `chunking_strategy`: One of `basic` or `by_title`. When omitted, no chunking is performed. The `basic` strategy maximally fills each chunk with whole elements, up the specified size limits as specified by `max_characters` and `new_after_n_chars`. A single element that exceeds this length is divided into two or more chunks using text-splitting. A `Table` element is never combined with any other element and appears as a chunk of its own or as a sequence of `TableChunk` elements splitting is required. The `by_title` behaviors are the same except that section and optionally page boundaries are respected such that two consecutive elements from different sections appear in separate chunks. diff --git a/snippets/ingest-configuration-shared/partition-configuration.mdx b/snippets/ingest-configuration-shared/partition-configuration.mdx index 9151b7ef..0fce0485 100644 --- a/snippets/ingest-configuration-shared/partition-configuration.mdx +++ b/snippets/ingest-configuration-shared/partition-configuration.mdx @@ -4,7 +4,7 @@ A standard partition configuration is a collection of parameters designed to ove *  ,  `additional_partition_args`: A JSON string representation of any values to pass through to the `partition` function. -*  ,  `encoding`: The encoding method used to decode the text input. If None, utf-8 will be used. +*  ,  `encoding`: The encoding method used to decode the text input. By default, UTF-8 will be used. *  ,  `ocr_languages`: The languages present in the document, for use in partitioning, OCR, or both. Multiple languages indicate that the text could be in any of the specified languages. @@ -16,7 +16,7 @@ A standard partition configuration is a collection of parameters designed to ove ## Configs for the Process -*  ,  `api_key`: api key needed to access the Unstructured api. +*  ,  `api_key`: If `partition_by_api` is set to `True`, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls. *  ,  `fields_include`: Fields to include in the output JSON. By default, the following fields are included: `element_id`, `text`, `type`, `metadata`, and `embeddings`. @@ -28,6 +28,6 @@ A standard partition configuration is a collection of parameters designed to ove *  ,  `metadata_include`: If provided, only the specified fields are preserved in the `metadata` output. -*  ,  `partition_by_api`: Default: `False`. If set to `True`, uses Unstructured API services to run partitioning. +*  ,  `partition_by_api`: Default: `False`. If set to `True`, uses Unstructured API services to run partitioning. If set to `False`, runs paritioning locally. -*  ,  `partition_endpoint`: If using Unstructured API services, requests are sent to this API URL. \ No newline at end of file +*  ,  `partition_endpoint`: If `partition_by_api` is set to `True`, partitioning requests are sent to this Unstructured API URL. \ No newline at end of file diff --git a/snippets/ingest-configuration-shared/processor-configuration.mdx b/snippets/ingest-configuration-shared/processor-configuration.mdx index ad824cfd..2139c09b 100644 --- a/snippets/ingest-configuration-shared/processor-configuration.mdx +++ b/snippets/ingest-configuration-shared/processor-configuration.mdx @@ -2,7 +2,7 @@ A common process configuration plays a pivotal role in overseeing the entire ing ## Configs -*   `disable_parallelism`: `True` if the `INGEST_DISABLE_PARALLELISM` environment variable is set to `true` (case-insensitive), otherwise `False` (the default). +*   `disable_parallelism`: `True` if the `INGEST_DISABLE_PARALLELISM` environment variable is set to `True` (case-insensitive), otherwise `False` (the default). *   `download_only`: Default: `False`. If set to `True`, downloads any files that are not already present in the connector's specified download directory (`download_dir`), or `work_dir` if `download_dir` is not specified, or the default file path for `work_dir` if `work_dir` is not specified.