diff --git a/docs/execution/best-practices.md b/docs/execution/best-practices.md new file mode 100644 index 00000000..0f4480f3 --- /dev/null +++ b/docs/execution/best-practices.md @@ -0,0 +1,27 @@ +# Best Practices for DocETL + +This guide outlines key best practices for using DocETL effectively, focusing on the most important aspects of pipeline creation, execution, and optimization. + +## Pipeline Design + +1. **Start Simple**: Begin with a basic pipeline and gradually add complexity as needed. +2. **Modular Design**: Break down complex tasks into smaller, manageable operations. +3. **Optimize Incrementally**: Optimize one operation at a time to ensure stability and verify improvements. + +## Schema and Prompt Design + +1. **Keep Schemas Simple**: Use simple output schemas whenever possible. Complex nested structures can be difficult for LLMs to produce consistently. +2. **Clear and Concise Prompts**: Write clear, concise prompts for LLM operations, providing relevant context from input data. Instruct quantities (e.g., 2-3 insights, one summary) to guide the LLM. +3. **Take advantage of Jinja Templating**: Use Jinja templating to dynamically generate prompts and provide context to the LLM. Feel free to use if statements, loops, and other Jinja features to customize prompts. +4. **Validate Outputs**: Use the `validate` field to ensure the quality and correctness of processed data. This consists of Python statements that validate the output and optionally retry the LLM if one or more statements fail. + +## Handling Large Documents and Entity Resolution + +1. **Chunk Large Inputs**: For documents exceeding token limits, consider using the optimizer to automatically chunk inputs. +2. **Use Resolve Operations**: Implement resolve operations before reduce operations when dealing with similar entities. Take care to write the compare prompts well to guide the LLM--often the optimizer-synthesized prompts are too generic. + +## Optimization and Execution + +1. **Use the Optimizer**: Leverage DocETL's optimizer for complex pipelines or when dealing with large documents. +2. **Leverage Caching**: Take advantage of DocETL's caching mechanism to avoid redundant computations. +3. **Monitor Resource Usage**: Keep an eye on API costs and processing time, especially when optimizing. diff --git a/mkdocs.yml b/mkdocs.yml index 8d35f59f..6d3f8c2b 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -18,8 +18,8 @@ nav: - Installation: installation.md - Tutorial: tutorial.md - Core Concepts: - - Operators: concepts/operators.md - - Schemas: concepts/schemas.md + - Operators & Validation: concepts/operators.md + - Output Schemas: concepts/schemas.md - Pipelines: concepts/pipelines.md - Optimization: concepts/optimization.md - LLM-Powered Operators: @@ -36,6 +36,7 @@ nav: - Execution: - Running Pipelines: execution/running-pipelines.md - Optimizing Pipelines: execution/optimizing-pipelines.md + - Best Practices: execution/best-practices.md # - Advanced Usage: # - User-Defined Functions: advanced/custom-operators.md # - Extending Optimizer Agents: advanced/extending-agents.md