diff --git a/docs/getting-started/5-output-rails/README.md b/docs/getting-started/5-output-rails/README.md
index 36a2025a3..c8f0be042 100644
--- a/docs/getting-started/5-output-rails/README.md
+++ b/docs/getting-started/5-output-rails/README.md
@@ -30,12 +30,12 @@ NeMo Guardrails comes with a built-in [output self-checking rail](../../user-gui
Activating the `self check output` rail is similar to the `self check input` rail:
-1. Activate the `self check output` rail in *config.yml*.
-2. Add a `self_check_output` prompt in *prompts.yml*.
+1. Activate the `self check output` rail in `config.yml`.
+2. Add a `self_check_output` prompt in `prompts.yml`.
-### Activate the rail
+### Activate the Rail
-To activate the rail, include the `self check output` flow name in the output rails section of the *config.yml* file:
+To activate the rail, include the `self check output` flow name in the output rails section of the `config.yml` file:
```yaml
output:
@@ -43,9 +43,10 @@ output:
- self check output
```
-For reference, the full `rails` section in `config.yml` should look like the following:
+For reference, update the full `rails` section in `config.yml` to look like the following:
```yaml
+rails:
input:
flows:
- self check input
@@ -66,7 +67,7 @@ define subflow self check output
stop
```
-### Add a prompt
+### Add a Prompt
The self-check output rail needs a prompt to perform the check.
@@ -130,7 +131,7 @@ Summary: 3 LLM call(s) took 1.89 seconds and used 504 tokens.
print(info.llm_calls[2].prompt)
```
-```
+```text
Your task is to check if the bot message below complies with the company policy.
Company policy for the bot:
@@ -160,15 +161,68 @@ As we can see, the LLM did generate the message containing the word "idiot", how
The following figure depicts the process:
-
-

-
+```{image} ../../_static/puml/output_rails_fig_1.png
+```
+
+## Streaming Output
+
+By default, the output from the rail is synchronous.
+You can enable streaming to provide asynchronous responses and reduce the time to the first response.
+
+1. Modify the `rails` field in the `config.yml` file and add the `streaming` field to enable streaming:
+
+ ```{code-block} yaml
+ :emphasize-lines: 9-11,13
+
+ rails:
+ input:
+ flows:
+ - self check input
+
+ output:
+ flows:
+ - self check output
+ streaming:
+ chunk_size: 200
+ context_size: 50
+
+ streaming: True
+ ```
+
+1. Call the `stream_async` method and handle the chunked response:
+
+ ```python
+ from nemoguardrails import RailsConfig, LLMRails
+
+ config = RailsConfig.from_path("./config")
+
+ rails = LLMRails(config)
+
+ messages = [{"role": "user", "content": "How many days of vacation does a 10-year employee receive?"}]
+
+ async for chunk in rails.stream_async(messages=messages):
+ print(f"CHUNK: {chunk}")
+ ```
+
+ *Partial Output*
+
+ ```output
+ CHUNK: According
+ CHUNK: to
+ CHUNK: the
+ CHUNK: employee
+ CHUNK: handbook,
+ ...
+ ```
+
+For reference information about the related `config.yaml` file fields,
+refer to [](../../user-guides/configuration-guide.md#output-rails).
## Custom Output Rail
Build a custom output rail with a list of proprietary words that we want to make sure do not appear in the output.
-1. Create a *config/actions.py* file with the following content, which defines an action:
+1. Create a `config/actions.py` file with the following content, which defines an action:
```python
from typing import Optional
diff --git a/docs/index.rst b/docs/index.rst
deleted file mode 100644
index 0cdb28b6c..000000000
--- a/docs/index.rst
+++ /dev/null
@@ -1,102 +0,0 @@
-NVIDIA NeMo Guardrails
-====================================================
-
-.. toctree::
- :caption: NVIDIA NeMo Guardrails
- :name: NVIDIA NeMo Guardrails
- :maxdepth: 1
-
- introduction.md
- documentation.md
- getting-started/installation-guide
-
-.. toctree::
- :caption: Getting Started
- :name: Getting Started
- :maxdepth: 2
-
- getting-started/1-hello-world/README
- getting-started/2-core-colang-concepts/README
- getting-started/3-demo-use-case/README
- getting-started/4-input-rails/README
- getting-started/5-output-rails/README
- getting-started/6-topical-rails/README
- getting-started/7-rag/README
-
-.. toctree::
- :caption: Colang 2.0
- :name: Colang 2.0
- :maxdepth: 2
-
- colang-2/overview
- colang-2/whats-changed
- colang-2/getting-started/index
- colang-2/language-reference/index
-
-.. toctree::
- :caption: User Guides
- :name: User Guides
- :maxdepth: 2
-
- user-guides/configuration-guide
- user-guides/guardrails-library
- user-guides/guardrails-process
- user-guides/colang-language-syntax-guide
- user-guides/llm-support
- user-guides/python-api
- user-guides/cli
- user-guides/server-guide
- user-guides/langchain/index
- user-guides/detailed-logging/index
- user-guides/jailbreak-detection-heuristics/index
- user-guides/llm/index
- user-guides/multi-config-api/index
- user-guides/migration-guide
-
-.. toctree::
- :caption: Security
- :name: Security
- :maxdepth: 2
-
- security/guidelines
- security/red-teaming
-
-.. toctree::
- :caption: Evaluation
- :name: Evaluation
- :maxdepth: 2
-
- evaluation/README
- evaluation/llm-vulnerability-scanning
-
-.. toctree::
- :caption: Advanced User Guides
- :name: Advanced User Guides
- :maxdepth: 2
-
- user-guides/advanced/generation-options
- user-guides/advanced/prompt-customization
- user-guides/advanced/embedding-search-providers
- user-guides/advanced/using-docker
- user-guides/advanced/streaming
- user-guides/advanced/align-score-deployment
- user-guides/advanced/extract-user-provided-values
- user-guides/advanced/bot-message-instructions
- user-guides/advanced/event-based-api
- user-guides/advanced/llama-guard-deployment
- user-guides/advanced/nested-async-loop
- user-guides/advanced/vertexai-setup
- user-guides/advanced/nemoguard-contentsafety-deployment
- user-guides/advanced/nemoguard-topiccontrol-deployment
- user-guides/advanced/jailbreak-detection-heuristics-deployment
- user-guides/advanced/safeguarding-ai-virtual-assistant-blueprint
-
-.. toctree::
- :caption: Other
- :name: Other
- :maxdepth: 2
-
- architecture/index
- glossary
- faqs
- changes
diff --git a/docs/project.json b/docs/project.json
index caf937f91..6f93bccec 100644
--- a/docs/project.json
+++ b/docs/project.json
@@ -1 +1 @@
-{ "name": "nemo-guardrails-toolkit", "version": "0.11.1" }
+{ "name": "nemo-guardrails-toolkit", "version": "0.12.0" }
diff --git a/docs/user-guides/configuration-guide.md b/docs/user-guides/configuration-guide.md
index 8481854be..b2c409654 100644
--- a/docs/user-guides/configuration-guide.md
+++ b/docs/user-guides/configuration-guide.md
@@ -84,7 +84,7 @@ The meaning of the attributes is as follows:
You can use any LLM provider that is supported by LangChain, e.g., `ai21`, `aleph_alpha`, `anthropic`, `anyscale`, `azure`, `cohere`, `huggingface_endpoint`, `huggingface_hub`, `openai`, `self_hosted`, `self_hosted_hugging_face`. Check out the LangChain official documentation for the full list.
-In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and self-hosted NIM containers.
+In addition to the above LangChain providers, connecting to [Nvidia NIMs](https://docs.nvidia.com/nim/index.html) is supported using the engine `nvidia_ai_endpoints` or synonymously `nim`, for both Nvidia hosted NIMs (accessible through an Nvidia AI Enterprise license) and for locally downloaded and elf-hosted NIM containers.
```{note}
To use any of the providers, you must install additional packages; when you first try to use a configuration with a new provider, you typically receive an error from LangChain that instructs which packages you should install.
@@ -104,6 +104,7 @@ NIMs can be self hosted, using downloadable containers, or Nvidia hosted and acc
NeMo Guardrails supports connecting to NIMs as follows:
##### Self-hosted NIMs
+
To connect to self-hosted NIMs, set the engine to `nim`. Also make sure the model name matches one of the model names the hosted NIM supports (you can get a list of supported models using a GET request to v1/models endpoint).
```yaml
@@ -663,6 +664,86 @@ Output rails process a bot message. The message to be processed is available in
You can deactivate output rails temporarily for the next bot message, by setting the `$skip_output_rails` context variable to `True`.
+#### Streaming Output Configuration
+
+By default, the response from an output rail is synchronous.
+You can enable streaming to begin receiving responses from the output rail sooner.
+
+You must set the top-level `streaming: True` field in your `config.yml` file.
+
+For each output rail, add the `streaming` field and configuration parameters.
+
+```yaml
+rails:
+ output:
+ - rail name
+ streaming:
+ chunk_size: 200
+ context_size: 50
+ stream_first: True
+
+streaming: True
+```
+
+When streaming is enabled, the toolkit applies output rails to chunks of tokens.
+If a rail blocks a chunk of tokens, the toolkit returns a string in the following format:
+
+```output
+{"event": "ABORT", "data": {"reason": "Blocked by rails.}}
+```
+
+The following table describes the subfields for the `streaming` field:
+
+```{list-table}
+:header-rows: 1
+
+* - Field
+ - Description
+ - Default Value
+
+* - streaming.chunk_size
+ - Specifies the number of tokens for each chunk.
+ The toolkit applies output guardrails on each chunk of tokens.
+
+ Larger values provide more meaningful information for the rail to assess,
+ but can add latency while accumulating tokens for a full chunk.
+ The risk of higher latency is especially true if you specify `stream_first: False`.
+ - `200`
+
+* - streaming.context_size
+ - Specifies the number of tokens to keep from the previous chunk to provide context and continuity in processing.
+
+ Larger values provide continuity across chunks with minimal impact on latency.
+ Small values might fail to detect cross-chunk violations.
+ Specifying approximately 25% of `chunk_size` provides a good compromise.
+ - `50`
+
+* - streaming.stream_first
+ - When set to `False`, the toolkit applies the output rails to the chunks before streaming them to the client.
+ If you set this field to `False`, you can avoid streaming chunks of blocked content.
+
+ By default, the toolkit streams the chunks as soon as possible and before applying output rails to them.
+
+ - `True`
+```
+
+The following table shows how the number of tokens, chunk size, and context size interact to trigger the number of rails invocations.
+
+```{csv-table}
+:header: Input Length, Chunk Size, Context Size, Rails Invocations
+
+512,256,64,3
+600,256,64,3
+256,256,64,1
+1024,256,64,5
+1024,256,32,5
+1024,256,32,5
+1024,128,32,11
+512,128,32,5
+```
+
+Refer to [](../getting-started/5-output-rails/README.md#streaming-output) for a code sample.
+
### Retrieval Rails
Retrieval rails process the retrieved chunks, i.e., the `$relevant_chunks` variable.
diff --git a/docs/versions1.json b/docs/versions1.json
index 348caf8f4..c2e197536 100644
--- a/docs/versions1.json
+++ b/docs/versions1.json
@@ -1,7 +1,7 @@
[
{
"preferred": true,
- "version": "0.11.1",
- "url": "../0.11.1"
+ "version": "0.12.0",
+ "url": "../0.12.0"
}
]