docs/docs-raw-for-llm.txt

ContextGem - Effortless LLM extraction from documents
====================================================================================================

Copyright (c) 2025 Shcherbak AI AS
All rights reserved
Developed by Sergii Shcherbak

This software is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# ==== Documentation Content ====


# ==== motivation ====

Why ContextGem?
***************

ContextGem is an LLM framework designed to strike the right balance
between ease of use, customizability, and accuracy for structured data
and insights extraction from documents.

ContextGem offers the **easiest and fastest way** to build LLM
extraction workflows for document analysis through powerful
abstractions of most time consuming parts.


⏱️ Development Overhead of Other Frameworks
===========================================

Most popular LLM frameworks for extracting structured data from
documents require extensive boilerplate code to extract even basic
information. As a developer using these frameworks, you're typically
expected to:

📝 Prompt Engineering

* Write custom prompts from scratch for each extraction scenario

* Maintain different prompt templates for different extraction
  workflows

* Adapt prompts manually when extraction requirements change

🔧 Technical Implementation

* Define your own data models and implement validation logic

* Implement complex chaining for multi-LLM workflows

* Implement nested context extraction logic (*e.g. document > sections
  > paragraphs > entities*)

* Configure text segmentation logic for correct reference mapping

* Configure concurrent I/O processing logic to speed up complex
  extraction workflows

**Result:** All these limitations significantly increase development
time and complexity.


💡 The ContextGem Solution
==========================

ContextGem addresses these challenges by providing a flexible,
intuitive framework that extracts structured data and insights from
documents with minimal effort. Complex, most time-consuming parts are
handled with **powerful abstractions**, eliminating boilerplate code
and reducing development overhead.

With ContextGem, you benefit from a "batteries included" approach,
coupled with simple, intuitive syntax.


ContextGem and Other Open-Source LLM Frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+-----+-----------------------------------------------+------------+----------------------+
|     | Key built-in abstractions                     | **Context  | Other frameworks*    |
|     |                                               | Gem**      |                      |
|=====|===============================================|============|======================|
| 💎  | **Automated dynamic prompts**  Automatically  | 🟢         | ◯                    |
|     | constructs comprehensive prompts for your     |            |                      |
|     | specific extraction needs.                    |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Automated data modelling and validators**   | 🟢         | ◯                    |
|     | Automatically creates data models and         |            |                      |
|     | validation logic.                             |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Precise granular reference mapping          | 🟢         | ◯                    |
|     | (paragraphs & sentences)**  Automatically     |            |                      |
|     | maps extracted data to the relevant parts of  |            |                      |
|     | the document, which will always match in the  |            |                      |
|     | source document, with customizable            |            |                      |
|     | granularity.                                  |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Justifications (reasoning backing the       | 🟢         | ◯                    |
|     | extraction)**  Automatically provides         |            |                      |
|     | justifications for each extraction, with      |            |                      |
|     | customizable granularity.                     |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Neural segmentation (SaT)**  Automatically  | 🟢         | ◯                    |
|     | segments the document into paragraphs and     |            |                      |
|     | sentences using state-of-the-art SaT models,  |            |                      |
|     | compatible with many languages.               |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Multilingual support (I/O without           | 🟢         | ◯                    |
|     | prompting)**  Supports multiple languages in  |            |                      |
|     | input and output without additional           |            |                      |
|     | prompting.                                    |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Single, unified extraction pipeline         | 🟢         | 🟡                   |
|     | (declarative, reusable, fully serializable)** |            |                      |
|     | Allows to define a complete extraction        |            |                      |
|     | workflow in a single, unified, reusable       |            |                      |
|     | pipeline, using simple declarative syntax.    |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Grouped LLMs with role-specific tasks**     | 🟢         | 🟡                   |
|     | Allows to easily group LLMs with different    |            |                      |
|     | roles to process role- specific tasks in the  |            |                      |
|     | pipeline.                                     |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Nested context extraction**  Automatically  | 🟢         | 🟡                   |
|     | manages nested context based on the pipeline  |            |                      |
|     | definition (e.g. document > aspects > sub-    |            |                      |
|     | aspects > concepts).                          |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Unified, fully serializable results storage | 🟢         | 🟡                   |
|     | model (document)**  All extraction results    |            |                      |
|     | are stored on the document object, including  |            |                      |
|     | aspects, sub-aspects, and concepts. This      |            |                      |
|     | object is fully serializable, and all the     |            |                      |
|     | extraction results can be restored, with just |            |                      |
|     | one line of code.                             |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Extraction task calibration with examples** | 🟢         | 🟡                   |
|     | Allows to easily define and attach output     |            |                      |
|     | examples that guide the LLM's extraction      |            |                      |
|     | behavior, without manually modifying prompts. |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Built-in concurrent I/O processing**        | 🟢         | 🟡                   |
|     | Automatically manages concurrent I/O          |            |                      |
|     | processing to speed up complex extraction     |            |                      |
|     | workflows, with a simple switch               |            |                      |
|     | ("use_concurrency=True").                     |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Automated usage & costs tracking**          | 🟢         | 🟡                   |
|     | Automatically tracks usage (calls, tokens,    |            |                      |
|     | costs) of all LLM calls.                      |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Fallback and retry logic**  Built-in retry  | 🟢         | 🟢                   |
|     | logic and easily attachable fallback LLMs.    |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Multiple LLM providers**  Compatible with a | 🟢         | 🟢                   |
|     | wide range of commercial and locally hosted   |            |                      |
|     | LLMs.                                         |            |                      |
+-----+-----------------------------------------------+------------+----------------------+

   🟢 - fully supported - no additional setup required
   🟡 - partially supported - requires additional setup
   ◯ - not supported - requires custom logic

   * See ContextGem and other frameworks for specific implementation
   examples comparing ContextGem with other popular open-source LLM
   frameworks. (Comparison as of 24 March 2025.)


🎯 Focused Approach
===================

ContextGem is intentionally optimized for **in-depth single-document
analysis** to deliver maximum extraction accuracy and precision. While
this focused approach enables superior results for individual
documents, ContextGem currently does not support cross-document
querying or corpus-wide information retrieval. For these use cases,
traditional RAG (Retrieval-Augmented Generation) systems over document
collections (e.g. LlamaIndex) remain more appropriate.


# ==== vs_other_frameworks ====

ContextGem and other frameworks
*******************************

Due to ContextGem's powerful abstractions, it is the **easiest and
fastest way** to build LLM extraction workflows for document analysis.


✏️ Basic Example
================

Below is a basic example of an extraction workflow - *extraction of
anomalies from a document* - implemented side-by-side in ContextGem
and other frameworks. (All implementations are self-contained.
Comparison as of 24 March 2025.)

Even implementing this basic extraction workflow requires
significantly more effort in other frameworks:

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 📝 **Prompt engineering**: Crafting comprehensive prompts that guide
  the LLM effectively

* 🔄 **Output parsing logic**: Setting up parsers to handle the LLM's
  response

* 📄 **Reference mapping**: Writing custom logic for mapping
  references in the source document

In contrast, ContextGem handles all these complexities automatically.
Users simply describe what to extract in natural language, provide
basic configuration parameters, and the framework takes care of the
rest.

-[ **ContextGem** ]-

⚡ Fastest way

ContextGem is the fastest and easiest way to implement an LLM
extraction workflow. All the boilerplate code is handled behind the
scenes.

**Major time savers:**

* ⌨️ **Simple syntax**: ContextGem uses a simple, intuitive API that
  requires minimal code

* 📝 **Automatic prompt engineering**: ContextGem automatically
  constructs a prompt tailored to the extraction task

* 🔄 **Automatic model definition**: ContextGem automatically defines
  the Pydantic model for structured output

* 🧩 **Automatic output parsing**: ContextGem automatically parses the
  LLM's response

* 🔍 **Automatic reference tracking**: Precise references are
  automatically extracted and mapped to the original document

* 📏 **Flexible reference granularity**: References can be tracked at
  different levels (paragraphs, sentences)

Anomaly extraction example (ContextGem)

   # Quick Start Example - Extracting anomalies from a document, with source references and justifications

   import os

   from contextgem import Document, DocumentLLM, StringConcept

   # Sample document text (shortened for brevity)
   doc = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # 💎 anomaly
           "This agreement is governed by the laws of Norway...\n"
       ),
   )

   # Attach a document-level concept
   doc.concepts = [
       StringConcept(
           name="Anomalies",  # in longer contexts, this concept is hard to capture with RAG
           description="Anomalies in the document",
           add_references=True,
           reference_depth="sentences",
           add_justifications=True,
           justification_depth="brief",
           # see the docs for more configuration options
       )
       # add more concepts to the document, if needed
       # see the docs for available concepts: StringConcept, JsonObjectConcept, etc.
   ]
   # Or use `doc.add_concepts([...])`

   # Define an LLM for extracting information from the document
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or another provider/LLM
       api_key=os.environ.get(
           "CONTEXTGEM_OPENAI_API_KEY"
       ),  # your API key for the LLM provider
       # see the docs for more configuration options
   )

   # Extract information from the document
   doc = llm.extract_all(doc)  # or use async version `await llm.extract_all_async(doc)`

   # Access extracted information in the document object
   print(
       doc.concepts[0].extracted_items
   )  # extracted items with references & justifications
   # or `doc.get_concept_by_name("Anomalies").extracted_items`

-[ LangChain ]-

LangChain is a popular and versatile framework for building LLM
applications through composable components. It offers excellent
flexibility and a rich ecosystem of integrations. While powerful,
feature-rich, and widely adopted in the industry, it requires more
manual configuration and setup work for structured data extraction
tasks compared to ContextGem's streamlined approach.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
  response

* 🔍 **Manual reference mapping**: Writing custom logic for mapping
  references

Anomaly extraction example (LangChain)

   # LangChain implementation for extracting anomalies from a document, with source references and justifications

   import os
   from textwrap import dedent
   from typing import Optional

   from langchain.output_parsers import PydanticOutputParser
   from langchain.prompts import PromptTemplate
   from langchain_core.runnables import RunnableLambda, RunnablePassthrough
   from langchain_openai import ChatOpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class Anomaly(BaseModel):
       """An anomaly found in the document."""

       text: str = Field(description="The anomalous text found in the document")
       justification: str = Field(
           description="Brief justification for why this is an anomaly"
       )
       reference: str = Field(
           description="The sentence containing the anomaly"
       )  # LLM reciting a reference is error-prone and unreliable


   class AnomaliesList(BaseModel):
       """List of anomalies found in the document."""

       anomalies: list[Anomaly] = Field(
           description="List of anomalies found in the document"
       )


   def extract_anomalies_with_langchain(
       document_text: str, api_key: Optional[str] = None
   ) -> list[Anomaly]:
       """
       Extract anomalies from a document using LangChain.

       Args:
           document_text: The text content of the document
           api_key: OpenAI API key (defaults to environment variable)

       Returns:
           List of extracted anomalies with justifications and references
       """
       openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
       llm = ChatOpenAI(model="gpt-4o-mini", openai_api_key=openai_api_key, temperature=0)

       # Create a parser for structured output
       parser = PydanticOutputParser(pydantic_object=AnomaliesList)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       template = dedent(
           """
       You are an expert document analyzer. Your task is to identify any anomalies in the document.
       Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
       with the rest of the document's context and purpose.
       
       Document:
       {document_text}
       
       Identify all anomalies in the document. For each anomaly, provide:
       1. The anomalous text
       2. A brief justification explaining why it's an anomaly
       3. The complete sentence containing the anomaly for reference
       
       {format_instructions}
       """
       )

       prompt = PromptTemplate(
           template=template,
           input_variables=["document_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       # Create a runnable chain
       chain = (
           {"document_text": lambda x: x}
           | RunnablePassthrough.assign()
           | prompt
           | llm
           | RunnableLambda(lambda x: parser.parse(x.content))
       )

       # Run the chain and extract anomalies
       parsed_output = chain.invoke(document_text)

       return parsed_output.anomalies


   # Example usage
   # Sample document text (shortened for brevity)
   document_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # out-of-context / anomaly
       "This agreement is governed by the laws of Norway...\n"
   )

   # Extract anomalies
   anomalies = extract_anomalies_with_langchain(document_text)

   # Print results
   for anomaly in anomalies:
       print(f"Anomaly: {anomaly}")

-[ LlamaIndex ]-

LlamaIndex is a powerful and versatile framework for building LLM
applications with data, particularly excelling at RAG workflows and
document retrieval. It offers a comprehensive set of tools for data
indexing and querying. While highly effective for its intended use
cases, for structured data extraction tasks (non-RAG setup), it
requires more manual configuration and setup work compared to
ContextGem's streamlined approach.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
  response

* 🔍 **Manual reference mapping**: Writing custom logic for mapping
  references

Anomaly extraction example (LlamaIndex)

   # LlamaIndex implementation for extracting anomalies from a document, with source references and justifications

   import os
   from textwrap import dedent
   from typing import Optional

   from llama_index.core.output_parsers import PydanticOutputParser
   from llama_index.core.program import LLMTextCompletionProgram
   from llama_index.llms.openai import OpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class Anomaly(BaseModel):
       """An anomaly found in the document."""

       text: str = Field(description="The anomalous text found in the document")
       justification: str = Field(
           description="Brief justification for why this is an anomaly"
       )
       reference: str = Field(
           description="The sentence containing the anomaly"
       )  # LLM reciting a reference is error-prone and unreliable


   class AnomaliesList(BaseModel):
       """List of anomalies found in the document."""

       anomalies: list[Anomaly] = Field(
           description="List of anomalies found in the document"
       )


   def extract_anomalies_with_llama_index(
       document_text: str, api_key: Optional[str] = None
   ) -> list[Anomaly]:
       """
       Extract anomalies from a document using LlamaIndex.

       Args:
           document_text: The text content of the document
           api_key: OpenAI API key (defaults to environment variable)

       Returns:
           List of extracted anomalies with justifications and references
       """
       openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
       llm = OpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Your task is to identify any anomalies in the document.
       Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
       with the rest of the document's context and purpose.
       
       Document:
       {document_text}
       
       Identify all anomalies in the document. For each anomaly, provide:
       1. The anomalous text
       2. A brief justification explaining why it's an anomaly
       3. The complete sentence containing the anomaly for reference
       """
       )

       # Use PydanticOutputParser to directly parse the LLM output into our structured format
       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=AnomaliesList),
           prompt_template_str=prompt_template,
           llm=llm,
           verbose=True,
       )

       # Execute the program
       try:
           result = program(document_text=document_text)
           return result.anomalies
       except Exception as e:
           print(f"Error parsing LLM response: {e}")
           return []


   # Example usage
   # Sample document text (shortened for brevity)
   document_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # out-of-context / anomaly
       "This agreement is governed by the laws of Norway...\n"
   )

   # Extract anomalies
   anomalies = extract_anomalies_with_llama_index(document_text)

   # Print results
   for anomaly in anomalies:
       print(f"Anomaly: {anomaly}")

-[ LlamaIndex (RAG) ]-

LlamaIndex with RAG setup is a powerful and sophisticated framework
for document retrieval and analysis, offering exceptional capabilities
for knowledge-intensive applications. Its comprehensive architecture
excels at handling complex document interactions and information
retrieval tasks across large document collections. While it provides
robust and versatile capabilities for building advanced document-based
applications, it does require more manual configuration and
specialized setup for structured extraction tasks compared to
ContextGem's streamlined and intuitive approach.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
  response

* 🔍 **Complex reference mapping**: Getting precise references
  correctly requires additional config, such as setting up a sentence
  splitter,  CitationQueryEngine, adjusting chunk sizes, etc.

Anomaly extraction example (LlamaIndex RAG)

   # LlamaIndex (RAG) implementation for extracting anomalies from a document, with source references and justifications

   import os
   from textwrap import dedent
   from typing import Any, Optional

   from llama_index.core import Document, Settings, VectorStoreIndex
   from llama_index.core.base.response.schema import RESPONSE_TYPE
   from llama_index.core.node_parser import SentenceSplitter
   from llama_index.core.output_parsers import PydanticOutputParser
   from llama_index.core.query_engine import CitationQueryEngine
   from llama_index.core.response_synthesizers.base import BaseSynthesizer
   from llama_index.core.retrievers import VectorIndexRetriever
   from llama_index.llms.openai import OpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class Anomaly(BaseModel):
       text: str = Field(description="The anomalous text found in the document")
       justification: str = Field(
           description="Brief justification for why this is an anomaly"
       )
       # This field will hold the citation info (e.g., node references)
       source_id: Optional[str] = Field(
           description="Automatically added source reference", default=None
       )


   class AnomaliesList(BaseModel):
       anomalies: list[Anomaly] = Field(
           description="List of anomalies found in the document"
       )


   # Custom synthesizer that instructs the LLM to extract anomalies in JSON format.
   class AnomalyExtractorSynthesizer(BaseSynthesizer):
       def __init__(self, llm=None, nodes=None):
           super().__init__()
           self._llm = llm or Settings.llm
           # Nodes are still provided in case additional context is needed.
           self._nodes = nodes or []

       def _get_prompts(self) -> dict[str, Any]:
           return {}

       def _update_prompts(self, prompts: dict[str, Any]):
           return

       async def aget_response(
           self, query_str: str, text_chunks: list[str], **kwargs: Any
       ) -> RESPONSE_TYPE:
           return self.get_response(query_str, text_chunks, **kwargs)

       def get_response(
           self, query_str: str, text_chunks: list[str], **kwargs: Any
       ) -> str:
           all_text = "\n".join(text_chunks)

           # Prompt must be manually drafted
           # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
           prompt_str = dedent(
               """
           You are an expert document analyzer. Your task is to identify anomalies in the document.
           Anomalies are statements or phrases that seem out of place or inconsistent with the document's context.

           Document:
           {all_text}

           For each anomaly, provide:
           1. The anomalous text (only the specific phrase).
           2. A brief justification for why it is an anomaly.

           Format your answer as a JSON object:
           {{
               "anomalies": [
                   {{
                       "text": "anomalous text",
                       "justification": "reason for anomaly",
                   }}
               ]
           }}
           """
           )
           print(prompt_str)
           output_parser = PydanticOutputParser(output_cls=AnomaliesList)
           response = self._llm.complete(prompt_str.format(all_text=all_text))

           try:
               parsed_response = output_parser.parse(response.text)
               self._last_anomalies = parsed_response
               return parsed_response.model_dump_json()
           except Exception as e:
               print(f"Error parsing LLM response: {e}")
               print(f"Raw response: {response.text}")
               return "{}"


   def extract_anomalies_with_citations(
       document_text: str, api_key: Optional[str] = None
   ) -> list[Anomaly]:
       """
       Extract anomalies from a document using LlamaIndex with citation support.

       Args:
           document_text: The content of the document.
           api_key: OpenAI API key (if not provided, read from environment variable).

       Returns:
           List of extracted anomalies with automatically added source references.
       """
       openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
       llm = OpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0)
       Settings.llm = llm

       # Create a Document and split it into nodes
       doc = Document(text=document_text)
       splitter = SentenceSplitter(
           paragraph_separator="\n",
           chunk_size=100,
           chunk_overlap=0,
       )
       nodes = splitter.get_nodes_from_documents([doc])
       print(f"Document split into {len(nodes)} nodes")

       # Build a vector index and retriever using all nodes.
       index = VectorStoreIndex(nodes)
       retriever = VectorIndexRetriever(index=index, similarity_top_k=len(nodes))

       # Create a custom synthesizer.
       synthesizer = AnomalyExtractorSynthesizer(llm=llm, nodes=nodes)

       # Initialize CitationQueryEngine by passing the expected components.
       citation_query_engine = CitationQueryEngine(
           retriever=retriever,
           llm=llm,
           response_synthesizer=synthesizer,
           citation_chunk_size=100,  # Adjust as needed
           citation_chunk_overlap=10,  # Adjust as needed
       )

       try:
           response = citation_query_engine.query(
               "Extract all anomalies from this document"
           )
           # If the synthesizer stored the anomalies, attach the citation info
           if hasattr(synthesizer, "_last_anomalies"):
               anomalies = synthesizer._last_anomalies.anomalies
               formatted_citations = (
                   response.get_formatted_sources()
                   if hasattr(response, "get_formatted_sources")
                   else None
               )
               for anomaly in anomalies:
                   anomaly.source_id = formatted_citations
               return anomalies
           return []

       except Exception as e:
           print(f"Error querying document: {e}")
           return []


   # Example usage
   document_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # anomaly
       "This agreement is governed by the laws of Norway...\n"
   )

   anomalies = extract_anomalies_with_citations(document_text)
   for anomaly in anomalies:
       print(f"Anomaly: {anomaly}")

-[ Instructor ]-

Instructor is a popular framework that specializes in structured data
extraction with LLMs using Pydantic. It offers excellent type safety
and validation capabilities, making it a solid choice for many
extraction tasks. While powerful for structured outputs, Instructor
requires more manual setup for document analysis workflows.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🔍 **Manual reference mapping**: Writing custom logic for mapping
  references

Anomaly extraction example (Instructor)

   # Instructor implementation for extracting anomalies from a document, with source references and justifications

   import os
   from textwrap import dedent
   from typing import Optional

   import instructor
   from openai import OpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class Anomaly(BaseModel):
       """An anomaly found in the document."""

       text: str = Field(description="The anomalous text found in the document")
       justification: str = Field(
           description="Brief justification for why this is an anomaly"
       )
       source_text: str = Field(
           description="The sentence containing the anomaly"
       )  # LLM reciting a reference is error-prone and unreliable


   class AnomaliesList(BaseModel):
       """List of anomalies found in the document."""

       anomalies: list[Anomaly] = Field(
           description="List of anomalies found in the document"
       )


   def extract_anomalies_with_instructor(
       document_text: str, api_key: Optional[str] = None
   ) -> list[Anomaly]:
       """
       Extract anomalies from a document using Instructor.

       Args:
           document_text: The text content of the document
           api_key: OpenAI API key (defaults to environment variable)

       Returns:
           List of extracted anomalies with justifications and references
       """
       openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
       client = OpenAI(api_key=openai_api_key)
       instructor_client = instructor.from_openai(client)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Your task is to identify any anomalies in the document.
       Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
       with the rest of the document's context and purpose.
       
       Document:
       {document_text}
       
       Identify all anomalies in the document. For each anomaly, provide:
       1. The anomalous text - just the specific anomalous phrase
       2. A brief justification explaining why it's an anomaly
       3. The exact complete sentence containing the anomaly for reference
       
       Only identify real anomalies that truly don't belong in this type of document.
       """
       )

       # Extract structured data using Instructor
       response = instructor_client.chat.completions.create(
           model="gpt-4o-mini",
           response_model=AnomaliesList,
           messages=[
               {"role": "system", "content": "You are an expert document analyzer."},
               {"role": "user", "content": prompt},
           ],
           temperature=0,
       )
       return response.anomalies


   # Example usage
   # Sample document text (shortened for brevity)
   document_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # out-of-context / anomaly
       "This agreement is governed by the laws of Norway...\n"
   )

   # Extract anomalies
   anomalies = extract_anomalies_with_instructor(document_text)

   # Print results
   for anomaly in anomalies:
       print(f"Anomaly: {anomaly}")


🔬 Advanced Example
===================

As use cases grow more complex, the development overhead of
alternative frameworks becomes increasingly evident, while
ContextGem's abstractions deliver substantial time savings. As
extraction steps stack up, the implementation with other frameworks
quickly becomes *non-scalable*:

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts for
  each extraction step

* 🔧 **Manual model definition**: Defining Pydantic validation models
  for each element of extraction

* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
  response

* 🔍 **Manual reference mapping**: Writing custom logic for mapping
  references

* 📄 **Complex pipeline configuration**: Writing custom logic for
  pipeline configuration and extraction components

* 📊 **Implementing usage and cost tracking callbacks**, which quickly
  increases in complexity when multiple LLMs are used in the pipeline

* 🔄 **Complex concurrency setup**: Implementing complex concurrency
  logic with asyncio

* 📝 **Embedding examples in prompts**: Writing output examples
  directly in the custom prompts

* 📋 **Manual result aggregation**: Need to write code to collect and
  organize results

Below is a more advanced example of an extraction workflow - *using an
extraction pipeline for multiple documents, with concurrency and cost
tracking* - implemented side-by-side in ContextGem and other
frameworks. (All implementations are self-contained. Comparison as of
24 March 2025.)

-[ **ContextGem** ]-

⚡ Fastest way

ContextGem is the fastest and easiest way to implement an LLM
extraction workflow. All the boilerplate code is handled behind the
scenes.

**Major time savers:**

* ⌨️ **Simple syntax**: ContextGem uses a simple, intuitive API that
  requires minimal code

* 🔄 **Automatic model definition**: ContextGem automatically defines
  the Pydantic model for structured output

* 📝 **Automatic prompt engineering**: ContextGem automatically
  constructs a prompt tailored to the extraction task

* 🧩 **Automatic output parsing**: ContextGem automatically parses the
  LLM's response

* 🔍 **Automatic reference tracking**: Precise references are
  automatically extracted and mapped to the original document

* 📏 **Flexible reference granularity**: References can be tracked at
  different levels (paragraphs, sentences)

* 📄 **Easy pipeline definition**: Simple, declarative syntax for
  defining the extraction pipeline involving multiple LLMs, in a few
  lines of code

* 💰 **Automated usage and cost tracking**: Built-in token counting
  and cost calculation without additional setup

* 🔄 **Built-in concurrency**: Concurrent execution of extraction
  steps with a simple switch "use_concurrency=True"

* 📊 **Easy example definition**: Output examples can be easily
  defined without modifying any prompts

* 📋 **Built-in result aggregation**: Results are automatically
  collected and organized in a unified storage model (document)

Extraction pipeline example (ContextGem)

   # Advanced Usage Example - analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency and cost tracking

   import os

   from contextgem import (
       Aspect,
       DateConcept,
       Document,
       DocumentLLM,
       DocumentLLMGroup,
       DocumentPipeline,
       JsonObjectConcept,
       JsonObjectExample,
       LLMPricing,
       NumericalConcept,
       RatingConcept,
       RatingScale,
       StringConcept,
       StringExample,
   )

   # Construct documents

   # Document 1 - Consultancy Agreement (shortened for brevity)
   doc1 = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "All intellectual property created during the provision of services shall belong to the Customer...\n"
           "This agreement is governed by the laws of Norway...\n"
           "Annex 1: Data processing agreement...\n"
           "Annex 2: Statement of Work...\n"
           "Annex 3: Service Level Agreement...\n"
       ),
   )

   # Document 2 - Service Level Agreement (shortened for brevity)
   doc2 = Document(
       raw_text=(
           "Service Level Agreement\n"
           "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
           "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
           "The Provider shall deliver IT support services as outlined in Schedule A...\n"
           "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
           "The Provider guarantees [99.9%] uptime for all critical systems...\n"
           "Either party may terminate with 60 days written notice...\n"
           "This agreement is governed by the laws of California...\n"
           "Schedule A: Service Descriptions...\n"
           "Schedule B: Response Time Requirements...\n"
       ),
   )

   # Create a reusable document pipeline for extraction
   contract_pipeline = DocumentPipeline()

   # Define aspects and aspect-level concepts in the pipeline
   # Concepts in the aspects will be extracted from the extracted aspect context
   contract_pipeline.aspects = [  # or use .add_aspects([...])
       Aspect(
           name="Contract Parties",
           description="Clauses defining the parties to the agreement",
           concepts=[  # define aspect-level concepts, if any
               StringConcept(
                   name="Party names and roles",
                   description="Names of all parties entering into the agreement and their roles",
                   examples=[  # optional
                       StringExample(
                           content="X (Client)",  # guidance regarding the expected output format
                       )
                   ],
               )
           ],
       ),
       Aspect(
           name="Term",
           description="Clauses defining the term of the agreement",
           concepts=[
               NumericalConcept(
                   name="Contract term",
                   description="The term of the agreement in years",
                   numeric_type="int",  # or "float", or "any" for auto-detection
                   add_references=True,  # extract references to the source text
                   reference_depth="paragraphs",
               )
           ],
       ),
   ]

   # Define document-level concepts
   # Concepts in the document will be extracted from the whole document content
   contract_pipeline.concepts = [  # or use .add_concepts()
       DateConcept(
           name="Effective date",
           description="The effective date of the agreement",
       ),
       StringConcept(
           name="Contract type",
           description="The type of agreement",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
       ),
       StringConcept(
           name="Governing law",
           description="The law that governs the agreement",
       ),
       JsonObjectConcept(
           name="Attachments",
           description="The titles and concise descriptions of the attachments to the agreement",
           structure={"title": str, "description": str | None},
           examples=[  # optional
               JsonObjectExample(  # guidance regarding the expected output format
                   content={
                       "title": "Appendix A",
                       "description": "Code of conduct",
                   }
               ),
           ],
       ),
       RatingConcept(
           name="Duration adequacy",
           description="Contract duration adequacy considering the subject matter and best practices.",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
           rating_scale=RatingScale(start=1, end=10),
           add_justifications=True,  # add justifications for the rating
           justification_depth="balanced",  # provide a balanced justification
           justification_max_sents=3,
       ),
   ]

   # Assign pipeline to the documents
   # You can re-use the same pipeline for multiple documents
   doc1.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document
   doc2.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document

   # Create an LLM group for data extraction and reasoning
   llm_extractor = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="extractor_text",  # signifies the LLM is used for data extraction tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=0.150,
           output_per_1m_tokens=0.600,
       ),
   )
   llm_reasoner = DocumentLLM(
       model="openai/o3-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="reasoner_text",  # signifies the LLM is used for reasoning tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=1.10,
           output_per_1m_tokens=4.40,
       ),
   )
   # The LLM group is used for all extraction tasks within the pipeline
   llm_group = DocumentLLMGroup(llms=[llm_extractor, llm_reasoner])

   # Extract all information from the documents at once
   doc1 = llm_group.extract_all(
       doc1, use_concurrency=True
   )  # use concurrency to speed up extraction
   doc2 = llm_group.extract_all(
       doc2, use_concurrency=True
   )  # use concurrency to speed up extraction
   # Or use async variants .extract_all_async(...)

   # Get the extracted data
   print("Some extracted data from doc 1:")
   print("Contract Parties > Party names and roles:")
   print(
       doc1.get_aspect_by_name("Contract Parties")
       .get_concept_by_name("Party names and roles")
       .extracted_items
   )
   print("Attachments:")
   print(doc1.get_concept_by_name("Attachments").extracted_items)
   # ...

   print("\nSome extracted data from doc 2:")
   print("Term > Contract term:")
   print(
       doc2.get_aspect_by_name("Term")
       .get_concept_by_name("Contract term")
       .extracted_items[0]
       .value
   )
   print("Duration adequacy:")
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].value)
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].justification)
   # ...

   # Output processing costs (requires setting the pricing details for each LLM)
   print("\nProcessing costs:")
   print(llm_group.get_cost())

-[ LangChain ]-

LangChain provides a powerful and flexible framework for building LLM
applications with excellent composability and a rich ecosystem of
integrations. While it offers great versatility for many use cases, it
does require additional manual setup and configuration for complex
extraction workflows.

**Development overhead:**

* 📝 **Manual prompt engineering**: Must craft detailed prompts for
  each extraction step

* 🔧 **Manual model definition**: Need to define Pydantic models and
  output parsers for structured data

* 🧩 **Complex chain configuration**: Requires manual setup of chains
  and their connections involving multiple LLMs

* 🔍 **Manual reference mapping**: Must implement custom logic to
  track source references

* 🔄 **Complex concurrency setup**: Implementing concurrent processing
  requires additional setup with asyncio

* 💰 **Cost tracking setup**: Requires custom logic for cost tracking
  for each LLM

* 💾 **No unified storage model**: Need to write additional code to
  collect and organize results

Extraction pipeline example (LangChain)

   # LangChain implementation of analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency, and cost tracking
   # Jupyter notebook compatible version

   import asyncio
   import os
   import time
   from dataclasses import dataclass, field
   from textwrap import dedent
   from typing import Optional

   import nest_asyncio

   nest_asyncio.apply()

   from langchain.callbacks import get_openai_callback
   from langchain.output_parsers import PydanticOutputParser
   from langchain.prompts import PromptTemplate
   from langchain_core.runnables import (
       RunnableLambda,
       RunnableParallel,
       RunnablePassthrough,
   )
   from langchain_openai import ChatOpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class PartyInfo(BaseModel):
       """Information about contract parties"""

       name: str = Field(description="Name of the party")
       role: str = Field(description="Role of the party (e.g., Client, Provider)")


   class Term(BaseModel):
       """Contract term information"""

       duration_years: int = Field(description="Duration in years")
       reference: str = Field(
           description="Reference text from document"
       )  # LLM reciting a reference is error-prone and unreliable


   class Attachment(BaseModel):
       """Contract attachment information"""

       title: str = Field(description="Title of the attachment")
       description: Optional[str] = Field(
           description="Brief description of the attachment"
       )


   class ContractRating(BaseModel):
       """Rating with justification"""

       score: int = Field(description="Rating score (1-10)")
       justification: str = Field(description="Justification for the rating")


   class ContractInfo(BaseModel):
       """Complete contract information"""

       contract_type: str = Field(description="Type of contract")
       effective_date: Optional[str] = Field(description="Effective date of the contract")
       governing_law: Optional[str] = Field(description="Governing law of the contract")


   class AspectExtraction(BaseModel):
       """Result of aspect extraction"""

       aspect_text: str = Field(
           description="Extracted text for this aspect"
       )  # this does not provide granular structured content, such as specific paragraphs and sentences


   class PartyExtraction(BaseModel):
       """Party extraction results"""

       parties: list[PartyInfo] = Field(description="List of parties in the contract")


   class TermExtraction(BaseModel):
       """Term extraction results"""

       terms: list[Term] = Field(description="Contract term details")


   class AttachmentExtraction(BaseModel):
       """Attachment extraction results"""

       attachments: list[Attachment] = Field(description="List of contract attachments")


   class DurationRatingExtraction(BaseModel):
       """Duration adequacy rating"""

       rating: ContractRating = Field(description="Rating of contract duration adequacy")


   # Configuration models must be manually defined
   @dataclass
   class ExtractorConfig:
       """Configuration for a specific extractor"""

       name: str
       description: str
       model_name: str = "gpt-4o-mini"  # Default model


   @dataclass
   class PipelineConfig:
       """Complete pipeline configuration"""

       # Aspect extractors
       party_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Contract Parties",
               description="Clauses defining the parties to the agreement",
           )
       )

       term_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Term", description="Clauses defining the term of the agreement"
           )
       )

       # Document-level extractors
       contract_info_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Contract Information",
               description="Basic contract information including type, date, and governing law",
           )
       )

       attachment_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Attachments",
               description="Contract attachments and their descriptions",
           )
       )

       duration_rating_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Duration Rating",
               description="Rating of contract duration adequacy",
               model_name="o3-mini",  # Using a more capable model for judgment
           )
       )


   # LLM configuration
   def get_llm(model_name="gpt-4o-mini", api_key=None):
       """Get a ChatOpenAI instance with the specified configuration"""
       # Skipped temperature etc. for brevity, as e.g. temperature is not supported by o3-mini
       api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "")
       return ChatOpenAI(model=model_name, openai_api_key=api_key)


   # Chain components must be manually defined
   def create_aspect_extractor(aspect_name, aspect_description, model_name="gpt-4o-mini"):
       """Create a chain to extract text related to a specific aspect"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=AspectExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract the text related to the following aspect from the document.
           
           Document:
           {document_text}
           
           Aspect: {aspect_name}
           Description: {aspect_description}
           
           Extract all text related to this aspect.
           {format_instructions}
           """
           ),
           input_variables=["document_text", "aspect_name", "aspect_description"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )  # this does not provide granular structured content, such as specific paragraphs and sentences

       chain = prompt | llm | parser

       # Return a callable that works with both sync and async code
       def extractor(doc):
           return chain.invoke(
               {
                   "document_text": doc,
                   "aspect_name": aspect_name,
                   "aspect_description": aspect_description,
               }
           )

       # Add an async version that will be used when awaited
       async def async_extractor(doc):
           return await chain.ainvoke(
               {
                   "document_text": doc,
                   "aspect_name": aspect_name,
                   "aspect_description": aspect_description,
               }
           )

       extractor.ainvoke = async_extractor
       return extractor


   def create_party_extractor(model_name="gpt-4o-mini"):
       """Create a chain to extract party information"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=PartyExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract all party information from the following contract text.
           
           Contract text:
           {aspect_text}
           
           For each party, extract their name and role in the agreement.
           {format_instructions}
           """
           ),
           input_variables=["aspect_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   def create_term_extractor(model_name="gpt-4o-mini"):
       """Create a chain to extract term information"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=TermExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract term information from the following contract text.
           
           Contract text:
           {aspect_text}
           
           Extract the contract term duration in years. Include the relevant reference text.
           {format_instructions}
           """
           ),
           input_variables=["aspect_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   def create_contract_info_extractor(model_name="gpt-4o-mini"):
       """Create a chain to extract basic contract information"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=ContractInfo)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract the following information from the contract document.
           
           Contract document:
           {document_text}
           
           Extract the contract type, effective date if mentioned, and governing law if specified.
           {format_instructions}
           """
           ),
           input_variables=["document_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   def create_attachment_extractor(model_name="gpt-4o-mini"):
       """Create a chain to extract attachment information"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=AttachmentExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract information about all attachments, annexes, 
           schedules, or appendices mentioned in the contract.
           
           Contract document:
           {document_text}
           
           For each attachment, extract:
           1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2")
           2. A brief description of what the attachment contains (if mentioned in the document)
           
           Example format:
           {{"title": "Appendix A", "description": "Code of conduct"}}
           
           {format_instructions}
           """
           ),
           input_variables=["document_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   def create_duration_rating_extractor(model_name="o3-mini"):
       """Create a chain to rate contract duration adequacy"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=DurationRatingExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert contract analyst. Evaluate the adequacy of the contract duration 
           considering the subject matter and best practices.
           
           Contract document:
           {document_text}
           
           Rate the duration adequacy on a scale of 1-10, where:
           1 = Extremely inadequate duration
           10 = Perfectly adequate duration
           
           Provide a brief justification for your rating (2-3 sentences).
           {format_instructions}
           """
           ),
           input_variables=["document_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   # Complete pipeline definition
   def create_document_pipeline(config=PipelineConfig()):
       """Create a complete document analysis pipeline and return it along with its components"""

       # Create aspect extractors
       party_aspect_extractor = create_aspect_extractor(
           config.party_extractor.name,
           config.party_extractor.description,
           config.party_extractor.model_name,
       )

       term_aspect_extractor = create_aspect_extractor(
           config.term_extractor.name,
           config.term_extractor.description,
           config.term_extractor.model_name,
       )

       # Create concept extractors for aspects
       party_extractor = create_party_extractor(config.party_extractor.model_name)
       term_extractor = create_term_extractor(config.term_extractor.model_name)

       # Create document-level extractors
       contract_info_extractor = create_contract_info_extractor(
           config.contract_info_extractor.model_name
       )
       attachment_extractor = create_attachment_extractor(
           config.attachment_extractor.model_name
       )
       duration_rating_extractor = create_duration_rating_extractor(
           config.duration_rating_extractor.model_name
       )

       # Create aspect extraction pipeline
       party_pipeline = (
           RunnablePassthrough()
           | party_aspect_extractor
           | RunnableLambda(lambda x: {"aspect_text": x.aspect_text})
           | party_extractor
       )

       term_pipeline = (
           RunnablePassthrough()
           | term_aspect_extractor
           | RunnableLambda(lambda x: {"aspect_text": x.aspect_text})
           | term_extractor
       )

       # Create document-level extraction pipeline
       document_extraction = RunnableParallel(
           contract_info=contract_info_extractor,
           attachments=attachment_extractor,
           duration_rating=duration_rating_extractor,
       )

       # Combine into complete pipeline
       complete_pipeline = RunnableParallel(
           parties=party_pipeline, terms=term_pipeline, document_info=document_extraction
       )

       # Create a components dictionary for easy access
       components = {
           "party_pipeline": party_pipeline,
           "term_pipeline": term_pipeline,
           "contract_info_extractor": contract_info_extractor,
           "attachment_extractor": attachment_extractor,
           "duration_rating_extractor": duration_rating_extractor,
       }

       return complete_pipeline, components


   # Cost tracking
   class CostTracker:
       """Track LLM costs across multiple extractions"""

       def __init__(self):
           self.costs = {
               "gpt-4o-mini": {
                   "input_per_1m": 0.15,
                   "output_per_1m": 0.60,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
               "o3-mini": {
                   "input_per_1m": 1.10,
                   "output_per_1m": 4.40,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
           }
           self.total_cost = 0.0

       def track_usage(self, model_name, input_tokens, output_tokens):
           """Track token usage for a model"""
           # Extract base model name
           base_model = model_name.split("/")[-1] if "/" in model_name else model_name

           if base_model in self.costs:
               self.costs[base_model]["input_tokens"] += input_tokens
               self.costs[base_model]["output_tokens"] += output_tokens

               # Calculate costs separately for input and output tokens
               input_cost = input_tokens * (
                   self.costs[base_model]["input_per_1m"] / 1000000
               )
               output_cost = output_tokens * (
                   self.costs[base_model]["output_per_1m"] / 1000000
               )

               self.total_cost += input_cost + output_cost

       def get_costs(self):
           """Get cost summary"""
           model_costs = {}
           for model, data in self.costs.items():
               if data["input_tokens"] > 0 or data["output_tokens"] > 0:
                   input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000)
                   output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000)
                   model_costs[model] = {
                       "input_cost": input_cost,
                       "output_cost": output_cost,
                       "total_cost": input_cost + output_cost,
                       "input_tokens": data["input_tokens"],
                       "output_tokens": data["output_tokens"],
                   }

           return {
               "model_costs": model_costs,
               "total_cost": self.total_cost,
           }


   # Document processing functions
   async def process_document_async(
       document_text, pipeline_and_components, cost_tracker=None, use_concurrency=True
   ):
       """Process a document asynchronously and track costs"""
       pipeline, components = pipeline_and_components  # Unpack the pipeline and components
       results = {}

       # Track tokens used across all calls
       total_tokens = {
           "gpt-4o-mini": {"input": 0, "output": 0},
           "o3-mini": {"input": 0, "output": 0},
       }

       # Use the provided components
       async def process_parties():
           """Process parties using the party pipeline"""
           with get_openai_callback() as cb:
               party_results = await components["party_pipeline"].ainvoke(document_text)
               total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens
               total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens
           return party_results

       async def process_terms():
           """Process terms using the term pipeline"""
           with get_openai_callback() as cb:
               term_results = await components["term_pipeline"].ainvoke(document_text)
               total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens
               total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens
           return term_results

       async def process_contract_info():
           """Process contract info"""
           with get_openai_callback() as cb:
               info_results = await components["contract_info_extractor"].ainvoke(
                   document_text
               )
               total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens
               total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens
           return info_results

       async def process_attachments():
           """Process attachments"""
           with get_openai_callback() as cb:
               attachment_results = await components["attachment_extractor"].ainvoke(
                   document_text
               )
               total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens
               total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens
           return attachment_results

       async def process_duration_rating():
           """Process duration rating"""
           with get_openai_callback() as cb:
               duration_results = await components["duration_rating_extractor"].ainvoke(
                   document_text
               )
               # Duration rating is done with o3-mini
               total_tokens["o3-mini"]["input"] += cb.prompt_tokens
               total_tokens["o3-mini"]["output"] += cb.completion_tokens
           return duration_results

       # Run extractions based on concurrency preference
       if use_concurrency:
           # Process all extractions concurrently for maximum speed
           parties, terms, contract_info, attachments, duration_rating = (
               await asyncio.gather(
                   process_parties(),
                   process_terms(),
                   process_contract_info(),
                   process_attachments(),
                   process_duration_rating(),
               )
           )
       else:
           # Process extractions sequentially
           parties = await process_parties()
           terms = await process_terms()
           contract_info = await process_contract_info()
           attachments = await process_attachments()
           duration_rating = await process_duration_rating()

       # Update cost tracker if provided
       if cost_tracker:
           for model, tokens in total_tokens.items():
               cost_tracker.track_usage(model, tokens["input"], tokens["output"])

       # Structure results in an easy-to-use format
       results["contract_type"] = contract_info.contract_type
       results["governing_law"] = contract_info.governing_law
       results["effective_date"] = contract_info.effective_date
       results["parties"] = parties.parties
       results["term_years"] = terms.terms[0].duration_years if terms.terms else None
       results["term_reference"] = terms.terms[0].reference if terms.terms else None
       results["attachments"] = attachments.attachments
       results["duration_rating"] = duration_rating.rating

       return results


   def process_document(
       document_text, pipeline_and_components, cost_tracker=None, use_concurrency=True
   ):
       """
       Process a document and track costs.
       This is a Jupyter-compatible version that uses the existing event loop
       instead of creating a new one with asyncio.run().
       """
       # Get the current event loop
       loop = asyncio.get_event_loop()
       # Run the async function in the current event loop
       return loop.run_until_complete(
           process_document_async(
               document_text, pipeline_and_components, cost_tracker, use_concurrency
           )
       )


   # Example usage
   # Sample contract texts (shortened for brevity)
   doc1_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "All intellectual property created during the provision of services shall belong to the Customer...\n"
       "This agreement is governed by the laws of Norway...\n"
       "Annex 1: Data processing agreement...\n"
       "Annex 2: Statement of Work...\n"
       "Annex 3: Service Level Agreement...\n"
   )

   doc2_text = (
       "Service Level Agreement\n"
       "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
       "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
       "The Provider shall deliver IT support services as outlined in Schedule A...\n"
       "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
       "The Provider guarantees [99.9%] uptime for all critical systems...\n"
       "Either party may terminate with 60 days written notice...\n"
       "This agreement is governed by the laws of California...\n"
       "Schedule A: Service Descriptions...\n"
       "Schedule B: Response Time Requirements...\n"
   )


   # Function to pretty-print document results
   def print_document_results(doc_name, results):
       print(f"\nResults from {doc_name}:")
       print(f"Contract Type: {results['contract_type']}")
       print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}")
       print(f"Term: {results['term_years']} years")
       print(
           f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}"
       )
       print(f"Governing Law: {results['governing_law']}")
       print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}")
       print(f"Duration Rating: {results['duration_rating'].score}/10")
       print(f"Rating Justification: {results['duration_rating'].justification}")


   # Create cost tracker
   cost_tracker = CostTracker()

   # Create pipeline with default configuration - returns both pipeline and components
   pipeline, pipeline_components = create_document_pipeline()

   # Process documents
   print("Processing document 1 with concurrency...")
   start_time = time.time()
   doc1_results = process_document(
       doc1_text, (pipeline, pipeline_components), cost_tracker, use_concurrency=True
   )
   print(f"Processing time: {time.time() - start_time:.2f} seconds")

   print("Processing document 2 with concurrency...")
   start_time = time.time()
   doc2_results = process_document(
       doc2_text, (pipeline, pipeline_components), cost_tracker, use_concurrency=True
   )
   print(f"Processing time: {time.time() - start_time:.2f} seconds")

   # Print results
   print_document_results("Document 1 (Consultancy Agreement)", doc1_results)
   print_document_results("Document 2 (Service Level Agreement)", doc2_results)

   # Print cost information
   print("\nProcessing costs:")
   costs = cost_tracker.get_costs()
   for model, model_data in costs["model_costs"].items():
       print(f"\n{model}:")
       print(f"  Input cost: ${model_data['input_cost']:.4f}")
       print(f"  Output cost: ${model_data['output_cost']:.4f}")
       print(f"  Total cost: ${model_data['total_cost']:.4f}")
   print(f"\nTotal across all models: ${costs['total_cost']:.4f}")

-[ LlamaIndex ]-

LlamaIndex provides a robust data framework for LLM applications with
excellent capabilities for knowledge retrieval and RAG. It offers
powerful tools for working with documents and structured data, though
implementing complex extraction workflows may require some additional
configuration to fully leverage its capabilities.

**Development overhead:**

* 📝 **Manual prompt engineering**: Must craft detailed prompts for
  each extraction task

* 🔧 **Manual model definition**: Need to define Pydantic models and
  output parsers for structured data

* 🧩 **Pipeline setup**: Requires manual configuration of extraction
  pipeline components involving multiple LLMs

* 🔍 **Limited reference tracking**: Basic source tracking, but
  requires additional work for fine-grained references

* 📊 **Embedding examples in prompts**: Examples must be manually
  incorporated into prompts

* 🔄 **Complex concurrency setup**: Implementing concurrent processing
  requires additional setup with asyncio

* 💰 **Cost tracking setup**: Requires custom logic for cost tracking
  for each LLM

* 💾 **No unified storage model**: Need to write additional code to
  collect and organize results

Extraction pipeline example (LlamaIndex)

   # LlamaIndex implementation of analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency, and cost tracking
   # Jupyter notebook compatible version

   import asyncio
   import os
   from textwrap import dedent
   from typing import Optional

   import nest_asyncio

   nest_asyncio.apply()

   from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
   from llama_index.core.output_parsers import PydanticOutputParser
   from llama_index.core.program import LLMTextCompletionProgram
   from llama_index.llms.openai import OpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class PartyInfo(BaseModel):
       """Information about contract parties"""

       name: str = Field(description="Name of the party")
       role: str = Field(description="Role of the party (e.g., Client, Provider)")


   class Term(BaseModel):
       """Contract term information"""

       duration_years: int = Field(description="Duration in years")
       reference: str = Field(
           description="Reference text from document"
       )  # LLM reciting a reference is error-prone and unreliable


   class Attachment(BaseModel):
       """Contract attachment information"""

       title: str = Field(description="Title of the attachment")
       description: Optional[str] = Field(
           description="Brief description of the attachment"
       )


   class ContractRating(BaseModel):
       """Rating with justification"""

       score: int = Field(description="Rating score (1-10)")
       justification: str = Field(description="Justification for the rating")


   class ContractInfo(BaseModel):
       """Complete contract information"""

       contract_type: str = Field(description="Type of contract")
       effective_date: Optional[str] = Field(description="Effective date of the contract")
       governing_law: Optional[str] = Field(description="Governing law of the contract")


   class AspectExtraction(BaseModel):
       """Result of aspect extraction"""

       aspect_text: str = Field(
           description="Extracted text for this aspect"
       )  # this does not provide granular structured content, such as specific paragraphs and sentences


   class PartyExtraction(BaseModel):
       """Party extraction results"""

       parties: list[PartyInfo] = Field(description="List of parties in the contract")


   class TermExtraction(BaseModel):
       """Term extraction results"""

       terms: list[Term] = Field(description="Contract term details")


   class AttachmentExtraction(BaseModel):
       """Attachment extraction results"""

       attachments: list[Attachment] = Field(description="List of contract attachments")


   class DurationRatingExtraction(BaseModel):
       """Duration adequacy rating"""

       rating: ContractRating = Field(description="Rating of contract duration adequacy")


   # Cost tracking class
   class CostTracker:
       """Track LLM costs across multiple extractions"""

       def __init__(self):
           self.costs = {
               "gpt-4o-mini": {
                   "input_per_1m": 0.15,
                   "output_per_1m": 0.60,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
               "o3-mini": {
                   "input_per_1m": 1.10,
                   "output_per_1m": 4.40,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
           }
           self.total_cost = 0.0

       def track_usage(self, model_name, input_tokens, output_tokens):
           """Track token usage for a model"""
           # Extract base model name
           base_model = model_name.split("/")[-1] if "/" in model_name else model_name

           if base_model in self.costs:
               self.costs[base_model]["input_tokens"] += input_tokens
               self.costs[base_model]["output_tokens"] += output_tokens

               # Calculate costs separately for input and output tokens
               input_cost = input_tokens * (
                   self.costs[base_model]["input_per_1m"] / 1000000
               )
               output_cost = output_tokens * (
                   self.costs[base_model]["output_per_1m"] / 1000000
               )

               self.total_cost += input_cost + output_cost

       def get_costs(self):
           """Get cost summary"""
           model_costs = {}
           for model, data in self.costs.items():
               if data["input_tokens"] > 0 or data["output_tokens"] > 0:
                   input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000)
                   output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000)
                   model_costs[model] = {
                       "input_cost": input_cost,
                       "output_cost": output_cost,
                       "total_cost": input_cost + output_cost,
                       "input_tokens": data["input_tokens"],
                       "output_tokens": data["output_tokens"],
                   }

           return {
               "model_costs": model_costs,
               "total_cost": self.total_cost,
           }


   # Helper functions for extractors
   def get_llm(model_name="gpt-4o-mini", api_key=None, temperature=0, token_counter=None):
       """Get an OpenAI instance with the specified configuration"""
       api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "")

       # Create callback manager with token counter if provided
       callback_manager = None
       if token_counter is not None:
           callback_manager = CallbackManager([token_counter])

       return OpenAI(
           model=model_name,
           api_key=api_key,
           temperature=temperature,
           callback_manager=callback_manager,
       )


   def create_aspect_extractor(
       aspect_name, aspect_description, model_name="gpt-4o-mini", token_counter=None
   ):
       """Create an extractor to extract text related to a specific aspect"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           f"""
       You are an expert document analyzer. Extract the text related to the following aspect from the document.
       
       Document:
       {{document_text}}
       
       Aspect: {aspect_name}
       Description: {aspect_description}
       
       Extract all text related to this aspect.
       """
       )  # this does not provide granular structured content, such as specific paragraphs and sentences

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=AspectExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_party_extractor(model_name="gpt-4o-mini", token_counter=None):
       """Create an extractor for party information"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Extract all party information from the following contract text.
       
       Contract text:
       {aspect_text}
       
       For each party, extract their name and role in the agreement.
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=PartyExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_term_extractor(model_name="gpt-4o-mini", token_counter=None):
       """Create an extractor for term information"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Extract term information from the following contract text.
       
       Contract text:
       {aspect_text}
       
       Extract the contract term duration in years. Include the relevant reference text.
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=TermExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_contract_info_extractor(model_name="gpt-4o-mini", token_counter=None):
       """Create an extractor for basic contract information"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Extract the following information from the contract document.
       
       Contract document:
       {document_text}
       
       Extract the contract type, effective date if mentioned, and governing law if specified.
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=ContractInfo),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_attachment_extractor(model_name="gpt-4o-mini", token_counter=None):
       """Create an extractor for attachment information"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Extract information about all attachments, annexes, 
       schedules, or appendices mentioned in the contract.
       
       Contract document:
       {document_text}
       
       For each attachment, extract:
       1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2")
       2. A brief description of what the attachment contains (if mentioned in the document)
       
       Example format:
       {"title": "Appendix A", "description": "Code of conduct"}
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=AttachmentExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_duration_rating_extractor(model_name="o3-mini", token_counter=None):
       """Create an extractor to rate contract duration adequacy"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert contract analyst. Evaluate the adequacy of the contract duration 
       considering the subject matter and best practices.
       
       Contract document:
       {document_text}
       
       Rate the duration adequacy on a scale of 1-10, where:
       1 = Extremely inadequate duration
       10 = Perfectly adequate duration
       
       Provide a brief justification for your rating (2-3 sentences).
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=DurationRatingExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   # Main document processing functions
   async def process_document_async(
       document_text, cost_tracker=None, use_concurrency=True
   ):
       """Process a document asynchronously and track costs"""
       results = {}

       # Create separate token counting handlers for each model
       gpt4o_token_counter = TokenCountingHandler()
       o3_token_counter = TokenCountingHandler()

       # Create extractors with appropriate token counters
       party_aspect_extractor = create_aspect_extractor(
           "Contract Parties",
           "Clauses defining the parties to the agreement",
           token_counter=gpt4o_token_counter,
       )
       term_aspect_extractor = create_aspect_extractor(
           "Term",
           "Clauses defining the term of the agreement",
           token_counter=gpt4o_token_counter,
       )
       party_extractor = create_party_extractor(token_counter=gpt4o_token_counter)
       term_extractor = create_term_extractor(token_counter=gpt4o_token_counter)
       contract_info_extractor = create_contract_info_extractor(
           token_counter=gpt4o_token_counter
       )
       attachment_extractor = create_attachment_extractor(
           token_counter=gpt4o_token_counter
       )

       # Use separate token counter for o3-mini
       duration_rating_extractor = create_duration_rating_extractor(
           model_name="o3-mini", token_counter=o3_token_counter
       )

       # Define processing functions using native async methods
       async def process_party_aspect():
           response = await party_aspect_extractor.acall(document_text=document_text)
           return response

       async def process_term_aspect():
           response = await term_aspect_extractor.acall(document_text=document_text)
           return response

       # Get aspect texts
       if use_concurrency:
           party_aspect, term_aspect = await asyncio.gather(
               process_party_aspect(), process_term_aspect()
           )
       else:
           party_aspect = await process_party_aspect()
           term_aspect = await process_term_aspect()

       async def process_parties():
           party_results = await party_extractor.acall(
               aspect_text=party_aspect.aspect_text
           )
           return party_results

       async def process_terms():
           term_results = await term_extractor.acall(aspect_text=term_aspect.aspect_text)
           return term_results

       async def process_contract_info():
           contract_info = await contract_info_extractor.acall(document_text=document_text)
           return contract_info

       async def process_attachments():
           attachments = await attachment_extractor.acall(document_text=document_text)
           return attachments

       async def process_duration_rating():
           duration_rating = await duration_rating_extractor.acall(
               document_text=document_text
           )
           return duration_rating

       # Run extractions based on concurrency preference
       if use_concurrency:
           parties, terms, contract_info, attachments, duration_rating = (
               await asyncio.gather(
                   process_parties(),
                   process_terms(),
                   process_contract_info(),
                   process_attachments(),
                   process_duration_rating(),
               )
           )
       else:
           parties = await process_parties()
           terms = await process_terms()
           contract_info = await process_contract_info()
           attachments = await process_attachments()
           duration_rating = await process_duration_rating()

       # Get token usage from the token counter and update cost tracker
       if cost_tracker:
           cost_tracker.track_usage(
               "gpt-4o-mini",
               gpt4o_token_counter.prompt_llm_token_count,
               gpt4o_token_counter.completion_llm_token_count,
           )
           cost_tracker.track_usage(
               "o3-mini",
               o3_token_counter.prompt_llm_token_count,
               o3_token_counter.completion_llm_token_count,
           )

       # Structure results in an easy-to-use format
       results["contract_type"] = contract_info.contract_type
       results["governing_law"] = contract_info.governing_law
       results["effective_date"] = contract_info.effective_date
       results["parties"] = parties.parties
       results["term_years"] = terms.terms[0].duration_years if terms.terms else None
       results["term_reference"] = terms.terms[0].reference if terms.terms else None
       results["attachments"] = attachments.attachments
       results["duration_rating"] = duration_rating.rating

       return results


   def process_document(document_text, cost_tracker=None, use_concurrency=True):
       """
       Process a document and track costs.
       This is a Jupyter-compatible version that uses the existing event loop
       instead of creating a new one with asyncio.run().
       """
       loop = asyncio.get_event_loop()
       return loop.run_until_complete(
           process_document_async(document_text, cost_tracker, use_concurrency)
       )


   # Function to pretty-print document results
   def print_document_results(doc_name, results):
       print(f"\nResults from {doc_name}:")
       print(f"Contract Type: {results['contract_type']}")
       print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}")
       print(f"Term: {results['term_years']} years")
       print(
           f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}"
       )
       print(f"Governing Law: {results['governing_law']}")
       print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}")
       print(f"Duration Rating: {results['duration_rating'].score}/10")
       print(f"Rating Justification: {results['duration_rating'].justification}")


   # Example usage
   # Sample contract texts (shortened for brevity)
   doc1_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "All intellectual property created during the provision of services shall belong to the Customer...\n"
       "This agreement is governed by the laws of Norway...\n"
       "Annex 1: Data processing agreement...\n"
       "Annex 2: Statement of Work...\n"
       "Annex 3: Service Level Agreement...\n"
   )

   doc2_text = (
       "Service Level Agreement\n"
       "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
       "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
       "The Provider shall deliver IT support services as outlined in Schedule A...\n"
       "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
       "The Provider guarantees [99.9%] uptime for all critical systems...\n"
       "Either party may terminate with 60 days written notice...\n"
       "This agreement is governed by the laws of California...\n"
       "Schedule A: Service Descriptions...\n"
       "Schedule B: Response Time Requirements...\n"
   )


   # Create cost tracker
   cost_tracker = CostTracker()

   # Process documents
   print("Processing document 1 with concurrency...")
   doc1_results = process_document(doc1_text, cost_tracker, use_concurrency=True)

   print("Processing document 2 with concurrency...")
   doc2_results = process_document(doc2_text, cost_tracker, use_concurrency=True)

   # Print results
   print_document_results("Document 1 (Consultancy Agreement)", doc1_results)
   print_document_results("Document 2 (Service Level Agreement)", doc2_results)

   # Print cost information
   print("\nProcessing costs:")
   costs = cost_tracker.get_costs()
   for model, model_data in costs["model_costs"].items():
       print(f"\n{model}:")
       print(f"  Input cost: ${model_data['input_cost']:.4f}")
       print(f"  Output cost: ${model_data['output_cost']:.4f}")
       print(f"  Total cost: ${model_data['total_cost']:.4f}")
   print(f"\nTotal across all models: ${costs['total_cost']:.4f}")

-[ Instructor ]-

Instructor is a powerful library focused on structured outputs from
LLMs with strong typing support through Pydantic. It excels at
extracting structured data with validation, but requires additional
work to build complex extraction pipelines.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🧩 **Manual pipeline assembly**: Requires custom code to connect
  extraction components involving multiple LLMs

* 🔍 **Manual reference mapping**: Must implement custom logic to
  track source references

* 📊 **Embedding examples in prompts**: Examples must be manually
  incorporated into prompts

* 🔄 **Complex concurrency setup**: Implementing concurrent processing
  requires additional setup with asyncio

* 💰 **Cost tracking setup**: Requires custom logic for cost tracking
  for each LLM

Extraction pipeline example (Instructor)

   # Instructor implementation of analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency, and cost tracking
   # Jupyter notebook compatible version

   import asyncio
   import os
   from dataclasses import dataclass, field
   from textwrap import dedent
   from typing import Optional

   import instructor
   import nest_asyncio
   from openai import AsyncOpenAI, OpenAI
   from pydantic import BaseModel, Field

   nest_asyncio.apply()


   # Pydantic models must be manually defined
   class PartyInfo(BaseModel):
       """Information about contract parties"""

       name: str = Field(description="Name of the party")
       role: str = Field(description="Role of the party (e.g., Client, Provider)")


   class Term(BaseModel):
       """Contract term information"""

       duration_years: int = Field(description="Duration in years")
       reference: str = Field(
           description="Reference text from document"
       )  # LLM reciting a reference is error-prone and unreliable


   class Attachment(BaseModel):
       """Contract attachment information"""

       title: str = Field(description="Title of the attachment")
       description: Optional[str] = Field(
           description="Brief description of the attachment"
       )


   class ContractRating(BaseModel):
       """Rating with justification"""

       score: int = Field(description="Rating score (1-10)")
       justification: str = Field(description="Justification for the rating")


   class ContractInfo(BaseModel):
       """Complete contract information"""

       contract_type: str = Field(description="Type of contract")
       effective_date: Optional[str] = Field(description="Effective date of the contract")
       governing_law: Optional[str] = Field(description="Governing law of the contract")


   class AspectExtraction(BaseModel):
       """Result of aspect extraction"""

       aspect_text: str = Field(
           description="Extracted text for this aspect"
       )  # this does not provide granular structured content, such as specific paragraphs and sentences


   class PartyExtraction(BaseModel):
       """Party extraction results"""

       parties: list[PartyInfo] = Field(description="List of parties in the contract")


   class TermExtraction(BaseModel):
       """Term extraction results"""

       terms: list[Term] = Field(description="Contract term details")


   class AttachmentExtraction(BaseModel):
       """Attachment extraction results"""

       attachments: list[Attachment] = Field(description="List of contract attachments")


   class DurationRatingExtraction(BaseModel):
       """Duration adequacy rating"""

       rating: ContractRating = Field(description="Rating of contract duration adequacy")


   # Configuration models must be manually defined
   @dataclass
   class ExtractorConfig:
       """Configuration for a specific extractor"""

       name: str
       description: str
       model_name: str = "gpt-4o-mini"  # Default model


   @dataclass
   class PipelineConfig:
       """Complete pipeline configuration"""

       # Aspect extractors
       party_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Contract Parties",
               description="Clauses defining the parties to the agreement",
           )
       )

       term_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Term", description="Clauses defining the term of the agreement"
           )
       )

       # Document-level extractors
       contract_info_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Contract Information",
               description="Basic contract information including type, date, and governing law",
           )
       )

       attachment_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Attachments",
               description="Contract attachments and their descriptions",
           )
       )

       duration_rating_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Duration Rating",
               description="Rating of contract duration adequacy",
               model_name="o3-mini",  # Using a more capable model for judgment
           )
       )


   # LLM client setup
   def get_client(api_key=None):
       """Get an OpenAI client with instructor integrated"""
       api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "")
       client = OpenAI(api_key=api_key)
       return instructor.from_openai(client)


   async def get_async_client(api_key=None):
       """Get an AsyncOpenAI client with instructor integrated"""
       api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "")
       client = AsyncOpenAI(api_key=api_key)
       return instructor.from_openai(client)


   # Helper function to execute completions with token tracking
   async def execute_with_tracking(model, messages, response_model, cost_tracker=None):
       """
       Execute a completion request with token tracking.
       """
       # Create the Instructor client
       client = await get_async_client()

       # Make a single API call with Instructor
       response = await client.chat.completions.create(
           model=model, response_model=response_model, messages=messages
       )

       # Access the raw response to get token usage
       if cost_tracker and hasattr(response, "_raw_response"):
           raw_response = response._raw_response
           if hasattr(raw_response, "usage"):
               prompt_tokens = raw_response.usage.prompt_tokens
               completion_tokens = raw_response.usage.completion_tokens
               cost_tracker.track_usage(model, prompt_tokens, completion_tokens)

       return response


   def execute_sync(model, messages, response_model):
       """Execute a completion request synchronously"""
       client = get_client()
       return client.chat.completions.create(
           model=model, response_model=response_model, messages=messages
       )


   # Unified extraction functions
   def extract_aspect(
       document_text,
       aspect_name,
       aspect_description,
       model_name="gpt-4o-mini",
       is_async=False,
       cost_tracker=None,
   ):
       """Extract text related to a specific aspect"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract the text related to the following aspect from the document.
       
       Document:
       {document_text}
       
       Aspect: {aspect_name}
       Description: {aspect_description}
       
       Extract all text related to this aspect.
       """
       )  # this does not provide granular structured content, such as specific paragraphs and sentences

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(
               model_name, messages, AspectExtraction, cost_tracker
           )
       else:
           return execute_sync(model_name, messages, AspectExtraction)


   def extract_parties(
       aspect_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None
   ):
       """Extract party information"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract all party information from the following contract text.
       
       Contract text:
       {aspect_text}
       
       For each party, extract their name and role in the agreement.
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(
               model_name, messages, PartyExtraction, cost_tracker
           )
       else:
           return execute_sync(model_name, messages, PartyExtraction)


   def extract_terms(
       aspect_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None
   ):
       """Extract term information"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract term information from the following contract text.
       
       Contract text:
       {aspect_text}
       
       Extract the contract term duration in years. Include the relevant reference text.
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(model_name, messages, TermExtraction, cost_tracker)
       else:
           return execute_sync(model_name, messages, TermExtraction)


   def extract_contract_info(
       document_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None
   ):
       """Extract basic contract information"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract the following information from the contract document.
       
       Contract document:
       {document_text}
       
       Extract the contract type, effective date if mentioned, and governing law if specified.
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(model_name, messages, ContractInfo, cost_tracker)
       else:
           return execute_sync(model_name, messages, ContractInfo)


   def extract_attachments(
       document_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None
   ):
       """Extract attachment information"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract information about all attachments, annexes, 
       schedules, or appendices mentioned in the contract.
       
       Contract document:
       {document_text}
       
       For each attachment, extract:
       1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2")
       2. A brief description of what the attachment contains (if mentioned in the document)
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(
               model_name, messages, AttachmentExtraction, cost_tracker
           )
       else:
           return execute_sync(model_name, messages, AttachmentExtraction)


   def extract_duration_rating(
       document_text, model_name="o3-mini", is_async=False, cost_tracker=None
   ):
       """Rate contract duration adequacy"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert contract analyst. Evaluate the adequacy of the contract duration 
       considering the subject matter and best practices.
       
       Contract document:
       {document_text}
       
       Rate the duration adequacy on a scale of 1-10, where:
       1 = Extremely inadequate duration
       10 = Perfectly adequate duration
       
       Provide a brief justification for your rating (2-3 sentences).
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert contract analyst."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(
               model_name, messages, DurationRatingExtraction, cost_tracker
           )
       else:
           return execute_sync(model_name, messages, DurationRatingExtraction)


   # Cost tracking
   class CostTracker:
       """Track LLM costs across multiple extractions"""

       def __init__(self):
           self.costs = {
               "gpt-4o-mini": {
                   "input_per_1m": 0.15,
                   "output_per_1m": 0.60,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
               "o3-mini": {
                   "input_per_1m": 1.10,
                   "output_per_1m": 4.40,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
           }
           self.total_cost = 0.0

       def track_usage(self, model_name, input_tokens, output_tokens):
           """Track token usage for a model"""
           # Extract base model name
           base_model = model_name.split("/")[-1] if "/" in model_name else model_name

           if base_model in self.costs:
               self.costs[base_model]["input_tokens"] += input_tokens
               self.costs[base_model]["output_tokens"] += output_tokens

               # Calculate costs separately for input and output tokens
               input_cost = input_tokens * (
                   self.costs[base_model]["input_per_1m"] / 1000000
               )
               output_cost = output_tokens * (
                   self.costs[base_model]["output_per_1m"] / 1000000
               )

               self.total_cost += input_cost + output_cost

       def get_costs(self):
           """Get cost summary"""
           model_costs = {}
           for model, data in self.costs.items():
               if data["input_tokens"] > 0 or data["output_tokens"] > 0:
                   input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000)
                   output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000)
                   model_costs[model] = {
                       "input_cost": input_cost,
                       "output_cost": output_cost,
                       "total_cost": input_cost + output_cost,
                       "input_tokens": data["input_tokens"],
                       "output_tokens": data["output_tokens"],
                   }

           return {
               "model_costs": model_costs,
               "total_cost": self.total_cost,
           }


   # Document processing functions
   async def process_document_async(
       document_text, config=None, cost_tracker=None, use_concurrency=True
   ):
       """Process a document asynchronously and track costs"""
       if config is None:
           config = PipelineConfig()

       results = {}

       # Define processing functions
       async def process_party_pipeline():
           # Extract party aspect
           party_aspect = await extract_aspect(
               document_text,
               config.party_extractor.name,
               config.party_extractor.description,
               model_name=config.party_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

           # Extract parties from the aspect
           parties = await extract_parties(
               party_aspect.aspect_text,
               model_name=config.party_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

           return parties

       async def process_term_pipeline():
           # Extract term aspect
           term_aspect = await extract_aspect(
               document_text,
               config.term_extractor.name,
               config.term_extractor.description,
               model_name=config.term_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

           # Extract terms from the aspect
           terms = await extract_terms(
               term_aspect.aspect_text,
               model_name=config.term_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

           return terms

       async def process_contract_info():
           return await extract_contract_info(
               document_text,
               model_name=config.contract_info_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

       async def process_attachments():
           return await extract_attachments(
               document_text,
               model_name=config.attachment_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

       async def process_duration_rating():
           return await extract_duration_rating(
               document_text,
               model_name=config.duration_rating_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

       # Run extractions based on concurrency preference
       if use_concurrency:
           # Process all extractions concurrently for maximum speed
           parties, terms, contract_info, attachments, duration_rating = (
               await asyncio.gather(
                   process_party_pipeline(),
                   process_term_pipeline(),
                   process_contract_info(),
                   process_attachments(),
                   process_duration_rating(),
               )
           )
       else:
           # Process extractions sequentially
           parties = await process_party_pipeline()
           terms = await process_term_pipeline()
           contract_info = await process_contract_info()
           attachments = await process_attachments()
           duration_rating = await process_duration_rating()

       # Structure results in the same format as the LangChain implementation
       results["contract_type"] = contract_info.contract_type
       results["governing_law"] = contract_info.governing_law
       results["effective_date"] = contract_info.effective_date
       results["parties"] = parties.parties
       results["term_years"] = terms.terms[0].duration_years if terms.terms else None
       results["term_reference"] = terms.terms[0].reference if terms.terms else None
       results["attachments"] = attachments.attachments
       results["duration_rating"] = duration_rating.rating

       return results


   def process_document(
       document_text, config=None, cost_tracker=None, use_concurrency=True
   ):
       """
       Process a document and track costs.
       """
       # Get the current event loop
       loop = asyncio.get_event_loop()
       # Run the async function in the current event loop
       return loop.run_until_complete(
           process_document_async(document_text, config, cost_tracker, use_concurrency)
       )


   # Example usage
   # Sample contract texts (shortened for brevity)
   doc1_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "All intellectual property created during the provision of services shall belong to the Customer...\n"
       "This agreement is governed by the laws of Norway...\n"
       "Annex 1: Data processing agreement...\n"
       "Annex 2: Statement of Work...\n"
       "Annex 3: Service Level Agreement...\n"
   )

   doc2_text = (
       "Service Level Agreement\n"
       "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
       "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
       "The Provider shall deliver IT support services as outlined in Schedule A...\n"
       "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
       "The Provider guarantees [99.9%] uptime for all critical systems...\n"
       "Either party may terminate with 60 days written notice...\n"
       "This agreement is governed by the laws of California...\n"
       "Schedule A: Service Descriptions...\n"
       "Schedule B: Response Time Requirements...\n"
   )


   # Function to pretty-print document results
   def print_document_results(doc_name, results):
       print(f"\nResults from {doc_name}:")
       print(f"Contract Type: {results['contract_type']}")
       print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}")
       print(f"Term: {results['term_years']} years")
       print(
           f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}"
       )
       print(f"Governing Law: {results['governing_law']}")
       print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}")
       print(f"Duration Rating: {results['duration_rating'].score}/10")
       print(f"Rating Justification: {results['duration_rating'].justification}")


   # Create cost tracker
   cost_tracker = CostTracker()

   # Create pipeline with default configuration
   config = PipelineConfig()

   # Process documents
   print("Processing document 1 with concurrency...")
   doc1_results = process_document(doc1_text, config, cost_tracker, use_concurrency=True)

   print("Processing document 2 with concurrency...")
   doc2_results = process_document(doc2_text, config, cost_tracker, use_concurrency=True)

   # Print results
   print_document_results("Document 1 (Consultancy Agreement)", doc1_results)
   print_document_results("Document 2 (Service Level Agreement)", doc2_results)

   # Print cost information
   print("\nProcessing costs:")
   costs = cost_tracker.get_costs()
   for model, model_data in costs["model_costs"].items():
       print(f"\n{model}:")
       print(f"  Input cost: ${model_data['input_cost']:.4f}")
       print(f"  Output cost: ${model_data['output_cost']:.4f}")
       print(f"  Total cost: ${model_data['total_cost']:.4f}")
   print(f"\nTotal across all models: ${costs['total_cost']:.4f}")


# ==== how_it_works ====

How it works
************


📏 Leveraging LLM Context Windows
=================================

ContextGem leverages LLMs' long context windows to deliver superior
extraction accuracy. Unlike RAG approaches that often struggle with
complex concepts and nuanced insights, ContextGem is betting on the
continuously expanding context capacity, evolving capabilities of
modern LLMs, and constantly decreasing LLM costs. This approach
enables direct information extraction from full documents, eliminating
retrieval inconsistencies and capturing the complete context necessary
for accurate understanding.


🧩 Core Components
==================

ContextGem's main elements are the Document, Aspect, and Concept
models:


📄 **Document**
---------------

"Document" model contains text and/or visual content representing a
specific document. Documents can vary in type and purpose, including
but not limited to:
   * *Contracts*: legal agreements defining terms and obligations.

   * *Invoices*: financial documents detailing transactions and
     payments.

   * *Curricula Vitae (CVs)*: resumes outlining an individual's
     professional experience and qualifications.

   * *General documents*: any other types of documents that may
     contain text or images.


🔍 **Aspect**
-------------

"Aspect" model contains text representing a defined area or topic
within a document (or another aspect) that requires focused attention.
Each aspect reflects a specific subject or theme. For example:
   * *Contract aspects*: payment terms, parties involved, or
     termination clauses.

   * *Invoice aspects*: due dates, line-item breakdowns, or tax
     details.

   * *CV aspects*: work experience, education, or skills.

Aspects may have sub-aspects, for more granular extraction with nested
context. This hierarchical structure allows for progressive refinement
of focus areas, enabling precise extraction of information from
complex documents while maintaining the contextual relationships
between different levels of content.


💡 **Concept**
--------------

Concept model contains a unit of information or an entity, derived
from an aspect or the broader document context. Concepts represent a
wide range of data points and insights, from simple entities (names,
dates, monetary values) to complex evaluations, conclusions, and
answers to specific questions. Concepts can be:
   * *Factual extractions*: such as a penalty clause in a contract, a
     total amount due in an invoice, or a certification in a CV.

   * *Analytical insights*: such as risk assessments, compliance
     evaluations, or pattern identifications.

   * *Reasoned conclusions*: such as determining whether a document
     meets specific criteria or answers particular questions.

   * *Interpretative judgments*: such as ratings, classifications, or
     qualitative assessments based on document content.

Concepts may be attached to an aspect or a document. The context for
the concept extraction will be the aspect or document, respectively.
This flexible attachment allows for both targeted extraction from
specific document sections and broader analysis across the entire
document content. When attached to aspects, concepts benefit from the
focused context, enabling more precise extraction of domain-specific
information. When attached to documents, concepts can leverage the
complete context to identify patterns, anomalies, or insights that
span multiple sections.

Multiple concept types are supported: "StringConcept",
"BooleanConcept", "NumericalConcept", "DateConcept",
"JsonObjectConcept", "RatingConcept"


Component Examples
^^^^^^^^^^^^^^^^^^

+-----------------+----------------------+----------------------+----------------------+----------------------+
|                 | Document             | Aspect               | Sub-aspect           | Concept              |
|=================|======================|======================|======================|======================|
| **Legal**       | *Software License    | Intellectual         | Patent               | Indemnification      |
|                 | Agreement*           | Property Rights      | Indemnification      | Coverage Scope ("Js  |
|                 |                      |                      |                      | onObjectConcept")    |
+-----------------+----------------------+----------------------+----------------------+----------------------+
| **Financial**   | *Quarterly Earnings  | Revenue Analysis     | Regional Performance | Year-over-Year       |
|                 | Report*              |                      |                      | Growth Rate          |
|                 |                      |                      |                      | ("NumericalConcept") |
+-----------------+----------------------+----------------------+----------------------+----------------------+
| **Healthcare**  | *Medical Research    | Methodology          | Patient Selection    | Inclusion/Exclusion  |
|                 | Paper*               |                      | Criteria             | Validity             |
|                 |                      |                      |                      | ("BooleanConcept")   |
+-----------------+----------------------+----------------------+----------------------+----------------------+
| **Technical**   | *System Architecture | Security Framework   | Authentication       | Implementation Risk  |
|                 | Document*            |                      | Protocols            | Rating               |
|                 |                      |                      |                      | ("RatingConcept")    |
+-----------------+----------------------+----------------------+----------------------+----------------------+
| **HR**          | *Employee Handbook*  | Leave Policy         | Parental Leave       | Eligibility Start    |
|                 |                      |                      | Benefits             | Date ("DateConcept") |
+-----------------+----------------------+----------------------+----------------------+----------------------+


🔄 Extraction Workflow
======================

ContextGem uses the following models to extract information from
documents:


🤖 **DocumentLLM**
------------------

**A single configurable LLM with a specific role to extract specific
information from the document.**

The "llm_role" of an LLM is an abstraction to differentiate between
tasks of different complexity. For example, if an aspect/concept is
assigned "llm_role="extractor_text"", it means that the aspect/concept
is extracted from the document using the LLM with the role
"extractor_text". This helps to channel different tasks to different
LLMs, ensuring that the task is handled by the most appropriate model.
Usually, domain expertise is required to determine the most
appropriate role for a specific aspect/concept. But for simple use
cases, you can skip the role assignment completely, in which case the
role will default to "extractor_text".

An LLM can have any of the pre-defined roles assigned to it,
irrespective of whether it is actually a "reasoning" model (e.g.
o3-mini) or not (e.g. gpt-4o) - it is up to you to decide based on the
capabilities of the LLM and the complexity of the task.

An LLM can have a configurable fallback LLM with the same role.

See "DocumentLLM" for more details.


🤖🤖 **DocumentLLMGroup**
-------------------------

**A group of LLMs with different unique roles to extract different
information from the document.**

For more complex and granular extraction workflows, an LLM group can
be used to extract different information from the same document using
different LLMs with different roles. For example, a simpler LLM e.g.
gpt-4o-mini can be used to extract specific aspects of the document,
and a more powerful LLM e.g. o3-mini will handle the extraction of
complex concepts that require reasoning over the aspects' context.

Each LLM can have its own backend and configuration, and one fallback
LLM with the same role.

See "DocumentLLMGroup" for more details.


LLM Group Workflow Example
^^^^^^^^^^^^^^^^^^^^^^^^^^

+-----------------+----------------------+----------------------+----------------------+
|                 | LLM 1                | LLM 2                | LLM 3                |
|                 | ("extractor_text")   | ("reasoner_text")    | ("extractor_vision") |
|=================|======================|======================|======================|
| *Model*         | gpt-4o-mini          | gpt-4o               | gpt-4o-mini          |
+-----------------+----------------------+----------------------+----------------------+
| *Task*          | Extract payment      | Detect anomalies in  | Extract invoice      |
|                 | terms from a         | the payment terms    | amounts              |
|                 | contract             |                      |                      |
+-----------------+----------------------+----------------------+----------------------+
| *Fallback LLM*  | gpt-3.5-turbo        | claude-3-5-sonnet    | gpt-4o               |
| (optional)      |                      |                      |                      |
+-----------------+----------------------+----------------------+----------------------+

[image: ContextGem - How it works infographics][image]


ℹ️ What ContextGem Doesn't Offer (Yet)
======================================

While ContextGem excels at structured data extraction from individual
documents, it's important to understand its intentional design
boundaries:


**Not a RAG framework**
-----------------------

ContextGem focuses on in-depth single-document analysis, leveraging
long context windows of LLMs for maximum accuracy and precision. It
does not offer RAG capabilities for cross-document querying or corpus-
wide information retrieval. For these use cases, traditional RAG
systems such as LlamaIndex remain more appropriate.


**Not an agent framework**
--------------------------

ContextGem is not designed as an agent framework. Based on our
research into practical extraction workflows, we believe that in-depth
single-document data extraction can be handled more efficiently with
non-agentic LLM workflows. For use cases that require agents, we
recommend using frameworks like LangChain. ContextGem can still be
easily integrated as a tool within agent frameworks due to its simple
API and clear output structure, making it an excellent choice for
document extraction tasks within larger agent-based systems.


# ==== installation ====

Installation
************


🔧 Prerequisites
================

Before installing ContextGem, ensure you have:

* Python 3.10-3.13

* pip (Python package installer)


📦 Installation Methods
=======================


From PyPI
---------

The simplest way to install ContextGem is via pip:

   pip install -U contextgem


From Source
-----------

To install from source:

   git clone https://github.com/shcherbak-ai/contextgem.git
   cd contextgem
   pip install -e .


Development Installation
------------------------

For development, we use Poetry:

   # Install poetry if you don't have it
   pip install poetry

   # Install dependencies including development extras
   poetry install --with dev

   # Activate the virtual environment
   poetry shell


✅ Verifying Installation
=========================

To verify that ContextGem is installed correctly, run:

   python -c "import contextgem; print(contextgem.__version__)"


# ==== quickstart ====

Quickstart examples
*******************

This guide will help you get started with ContextGem by walking
through basic extraction examples.

Below are complete, self-contained examples showing how to extract
data from a document using ContextGem.


🔄 Extraction Process
=====================

ContextGem follows a simple extraction process:

1. Create a "Document" instance with your content

2. Define "Aspect" instances for sections of interest

3. Define concept instances ("StringConcept", "BooleanConcept",
   "NumericalConcept", "DateConcept", "JsonObjectConcept",
   "RatingConcept") for specific data points to extract, and attach
   them to "Aspect" (for aspect context) or "Document" (for document
   context).

4. Use "DocumentLLM" or "DocumentLLMGroup" to perform the extraction

5. Access the extracted data in the document object


📋 Aspect Extraction from Document
==================================

Tip:

  Aspect extraction is useful for identifying and extracting specific
  sections or topics from documents. Common use cases include:

  * Extracting specific clauses from legal contracts

  * Identifying specific sections from financial reports

  * Isolating relevant topics from research papers

  * Extracting product features from technical documentation

   # Quick Start Example - Extracting aspect from a document

   import os

   from contextgem import Aspect, Document, DocumentLLM

   # Example document instance
   # Document content is shortened for brevity
   doc = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "This agreement is governed by the laws of Norway...\n"
       ),
   )

   # Define an aspect with optional concept(s), using natural language
   doc_aspect = Aspect(
       name="Governing law",
       description="Clauses defining the governing law of the agreement",
       reference_depth="sentences",
   )

   # Add aspects to the document
   doc.add_aspects([doc_aspect])
   # (add more aspects to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
   )

   # Extract information from the document
   extracted_aspects = llm.extract_aspects_from_document(doc)
   # or use async version llm.extract_aspects_from_document_async(doc)

   # Access extracted information
   print("Governing law aspect:")
   print(
       extracted_aspects[0].extracted_items
   )  # extracted aspect items with references to sentences
   # or doc.get_aspect_by_name("Governing law").extracted_items


🌳 Extracting Aspect with Sub-Aspects
=====================================

Tip:

  Sub-aspect extraction helps organize complex topics into logical
  components. Common use cases include:

  * Breaking down termination clauses in employment contracts into
    company rights, employee rights, and severance terms

  * Dividing financial report sections into revenue streams, expenses,
    and forecasts

  * Organizing product specifications into technical details,
    compatibility, and maintenance requirements

   # Quick Start Example - Extracting an aspect with sub-aspects

   import os

   from contextgem import Aspect, Document, DocumentLLM

   # Sample document (content shortened for brevity)
   contract_text = """
   EMPLOYMENT AGREEMENT
   ...
   8. TERMINATION
   8.1 Termination by the Company. The Company may terminate the Employee's employment for Cause at any time upon written notice. 
   "Cause" shall mean: (i) Employee's material breach of this Agreement; (ii) Employee's conviction of a felony; or 
   (iii) Employee's willful misconduct that causes material harm to the Company.
   8.2 Termination by the Employee. The Employee may terminate employment for Good Reason upon 30 days' written notice to the Company. 
   "Good Reason" shall mean a material reduction in Employee's base salary or a material diminution in Employee's duties.
   8.3 Severance. If the Employee's employment is terminated by the Company without Cause or by the Employee for Good Reason, 
   the Employee shall be entitled to receive severance pay equal to six (6) months of the Employee's base salary.
   ...
   """

   doc = Document(raw_text=contract_text)

   # Define termination aspect with practical sub-aspects
   termination_aspect = Aspect(
       name="Termination",
       description="Provisions related to the termination of employment",
       aspects=[  # assign sub-aspects (optional)
           Aspect(
               name="Company Termination Rights",
               description="Conditions under which the company can terminate employment",
           ),
           Aspect(
               name="Employee Termination Rights",
               description="Conditions under which the employee can terminate employment",
           ),
           Aspect(
               name="Severance Terms",
               description="Compensation or benefits provided upon termination",
           ),
       ],
   )

   # Add the aspect to the document. Sub-aspects are added with the parent aspect.
   doc.add_aspects([termination_aspect])
   # (add more aspects to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ.get(
           "CONTEXTGEM_OPENAI_API_KEY"
       ),  # your API key of the LLM provider
   )

   # Extract all information from the document
   doc = llm.extract_all(doc)

   # Get results with references in the document object
   print("\nTermination aspect:\n")
   termination_aspect = doc.get_aspect_by_name("Termination")
   for sub_aspect in termination_aspect.aspects:
       print(sub_aspect.name)
       for item in sub_aspect.extracted_items:
           print(item.value)
       print("\n")


🔍 Concept Extraction from Aspect
=================================

Tip:

  Concept extraction from aspects helps identify specific data points
  within already extracted sections or topics. Common use cases
  include:

  * Extracting payment amounts from a contract's payment terms

  * Extracting liability cap from a contract's liability section

  * Isolating timelines from delivery terms

  * Extracting a list of features from a product description

  * Identifying programming languages from a CV's experience section

   # Quick Start Example - Extracting a concept from an aspect

   import os

   from contextgem import Aspect, Document, DocumentLLM, StringConcept, StringExample

   # Example document instance
   # Document content is shortened for brevity
   doc = Document(
       raw_text=(
           "Employment Agreement\n"
           "This agreement between TechCorp Inc. (Employer) and Jane Smith (Employee)...\n"
           "The employment shall commence on January 15, 2023 and continue until terminated...\n"
           "The Employee shall work as a Senior Software Engineer reporting to the CTO...\n"
           "The Employee shall receive an annual salary of $120,000 paid monthly...\n"
           "The Employee is entitled to 20 days of paid vacation per year...\n"
           "The Employee agrees to a notice period of 30 days for resignation...\n"
           "This agreement is governed by the laws of California...\n"
       ),
   )

   # Define an aspect with a specific concept, using natural language
   doc_aspect = Aspect(
       name="Compensation",
       description="Clauses defining the compensation and benefits for the employee",
       reference_depth="sentences",
   )

   # Define a concept within the aspect
   aspect_concept = StringConcept(
       name="Annual Salary",
       description="The annual base salary amount specified in the employment agreement",
       examples=[  # optional
           StringExample(
               content="$X per year",  # guidance regarding format
           )
       ],
       add_references=True,
       reference_depth="sentences",
   )

   # Add the concept to the aspect
   doc_aspect.add_concepts([aspect_concept])
   # (add more concepts to the aspect, if needed)

   # Add the aspect to the document
   doc.add_aspects([doc_aspect])
   # (add more aspects to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
   )

   # Extract information from the document
   doc = llm.extract_all(doc)
   # or use async version llm.extract_all_async(doc)

   # Access extracted information in the document object
   print("Compensation aspect:")
   print(
       doc.get_aspect_by_name("Compensation").extracted_items
   )  # extracted aspect items with references to sentences
   print("Annual Salary concept:")
   print(
       doc.get_aspect_by_name("Compensation")
       .get_concept_by_name("Annual Salary")
       .extracted_items
   )  # extracted concept items with references to sentences


📝 Concept Extraction from Document (text)
==========================================

Tip:

  Concept extraction from text documents locates specific information
  directly from text. Common use cases include:

  * Extracting anomalies from entire legal documents

  * Identifying financial figures across multiple report sections

  * Extracting citations and references from academic papers

  * Identifying product specifications from technical manuals

  * Extracting contact information from business documents

   # Quick Start Example - Extracting a concept from a document

   import os

   from contextgem import Document, DocumentLLM, JsonObjectConcept, JsonObjectExample

   # Example document instance
   # Document content is shortened for brevity
   doc = Document(
       raw_text=(
           "Statement of Work\n"
           "Project: Cloud Migration Initiative\n"
           "Client: Acme Corporation\n"
           "Contractor: TechSolutions Inc.\n\n"
           "Project Timeline:\n"
           "Start Date: March 1, 2025\n"
           "End Date: August 31, 2025\n\n"
           "Deliverables:\n"
           "1. Infrastructure assessment report (Due: March 15, 2025)\n"
           "2. Migration strategy document (Due: April 10, 2025)\n"
           "3. Test environment setup (Due: May 20, 2025)\n"
           "4. Production migration (Due: July 15, 2025)\n"
           "5. Post-migration support (Due: August 31, 2025)\n\n"
           "Budget: $250,000\n"
           "Payment Schedule: 20% upfront, 30% at midpoint, 50% upon completion\n"
       ),
   )

   # Define a document-level concept using e.g. JsonObjectConcept
   # This will extract structured data from the entire document
   doc_concept = JsonObjectConcept(
       name="Project Details",
       description="Key project information including timeline, deliverables, and budget",
       structure={
           "project_name": str,
           "client": str,
           "contractor": str,
           "budget": str,
           "payment_terms": str,
       },  # simply use a dictionary with type hints (including generic aliases and union types)
       add_references=True,
       reference_depth="paragraphs",
   )

   # Add the concept to the document
   doc.add_concepts([doc_concept])
   # (add more concepts to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
   )

   # Extract information from the document
   extracted_concepts = llm.extract_concepts_from_document(doc)
   # or use async version llm.extract_concepts_from_document_async(doc)

   # Access extracted information
   print("Project Details:")
   print(
       extracted_concepts[0].extracted_items
   )  # extracted concept items with references to paragraphs
   # Or doc.get_concept_by_name("Project Details").extracted_items


🖼️ Concept Extraction from Document (vision)
============================================

Tip:

  Concept extraction using vision capabilities processes documents
  with complex layouts or images. Common use cases include:

  * Extracting data from scanned contracts or receipts

  * Identifying information from charts and graphs in reports

  * Identifying visual product features from marketing materials

   # Quick Start Example - Extracting concept from a document with an image

   import os
   from pathlib import Path

   from contextgem import Document, DocumentLLM, Image, NumericalConcept, image_to_base64

   # Path adapted for testing
   current_file = Path(__file__).resolve()
   root_path = current_file.parents[4]
   image_path = root_path / "tests" / "invoices" / "invoice.jpg"

   # Create an image instance
   doc_image = Image(mime_type="image/jpg", base64_data=image_to_base64(image_path))

   # Example document instance holding only the image
   doc = Document(
       images=[doc_image],  # may contain multiple images
   )

   # Define a concept to extract the invoice total amount
   doc_concept = NumericalConcept(
       name="Invoice Total",
       description="The total amount to be paid as shown on the invoice",
       numeric_type="float",
       llm_role="extractor_vision",  # use vision model
   )

   # Add concept to the document
   doc.add_concepts([doc_concept])
   # (add more concepts to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # Using a model with vision capabilities
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
       role="extractor_vision",  # mark LLM as vision model
   )

   # Extract information from the document
   extracted_concepts = llm.extract_concepts_from_document(doc)
   # or use async version: await llm.extract_concepts_from_document_async(doc)

   # Access extracted information
   print("Invoice Total:")
   print(extracted_concepts[0].extracted_items)  # extracted concept items
   # or doc.get_concept_by_name("Invoice Total").extracted_items


💬 Lightweight LLM Chat Interface
=================================

Note:

  While ContextGem is primarily designed for advanced structured data
  extraction, it also provides a lightweight, unified interface for
  interacting with LLMs via natural language - across both text and
  vision - with built-in fallback support.

   # Using LLMs for chat (text + vision), with fallback LLM support

   import os

   from contextgem import DocumentLLM

   # from contextgem import Image

   main_model = DocumentLLM(
       model="openai/gpt-4o",  # or another provider/model
       api_key=os.getenv("CONTEXTGEM_OPENAI_API_KEY"),  # your API key for the LLM provider
   )

   # Optional: fallback LLM
   fallback_model = DocumentLLM(
       model="openai/gpt-4o-mini",  # or another provider/model
       api_key=os.getenv("CONTEXTGEM_OPENAI_API_KEY"),  # your API key for the LLM provider
       is_fallback=True,
   )
   main_model.fallback_llm = fallback_model

   response = main_model.chat(
       "Hello",
       # images=[Image(...)]
   )
   # or `response = await main_model.chat_async(...)`

   print(response)


# ==== advanced_usage ====

Advanced usage examples
***********************

Below are complete, self-contained examples demonstrating advanced
usage of ContextGem.


🔍 Extracting Aspects Containing Concepts
=========================================

Tip:

  Concept extraction is useful for extracting specific data points
  from a document or an aspect. For example, a "Payment terms" aspect
  in a contract may have multiple concepts:

  * "Payment amount"

  * "Payment due date"

  * "Payment method"

   # Advanced Usage Example - extracting a single aspect with inner concepts from a legal document

   import os

   from contextgem import Aspect, Document, DocumentLLM, StringConcept, StringExample

   # Create a document instance with e.g. a legal contract text
   # The text is shortened for brevity
   doc = Document(
       raw_text=(
           "EMPLOYMENT AGREEMENT\n\n"
           'This Employment Agreement (the "Agreement") is made and entered into as of January 15, 2023 (the "Effective Date"), '
           'by and between ABC Corporation, a Delaware corporation (the "Company"), and Jane Smith, an individual (the "Employee").\n\n'
           "1. EMPLOYMENT TERM\n"
           "The Company hereby employs the Employee, and the Employee hereby accepts employment with the Company, upon the terms and "
           "conditions set forth in this Agreement. The term of this Agreement shall commence on the Effective Date and shall continue "
           'for a period of two (2) years, unless earlier terminated in accordance with Section 8 (the "Term").\n\n'
           "2. POSITION AND DUTIES\n"
           "During the Term, the Employee shall serve as Chief Technology Officer of the Company, with such duties and responsibilities "
           "as are commensurate with such position.\n\n"
           "8. TERMINATION\n"
           "8.1 Termination by the Company. The Company may terminate the Employee's employment for Cause at any time upon written notice. "
           "\"Cause\" shall mean: (i) Employee's material breach of this Agreement; (ii) Employee's conviction of a felony; or "
           "(iii) Employee's willful misconduct that causes material harm to the Company.\n"
           "8.2 Termination by the Employee. The Employee may terminate employment for Good Reason upon 30 days' written notice to the Company. "
           "\"Good Reason\" shall mean a material reduction in Employee's base salary or a material diminution in Employee's duties.\n"
           "8.3 Severance. If the Employee's employment is terminated by the Company without Cause or by the Employee for Good Reason, "
           "the Employee shall be entitled to receive severance pay equal to six (6) months of the Employee's base salary.\n\n"
           "IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first written above.\n\n"
           "ABC CORPORATION\n\n"
           "By: ______________________\n"
           "Name: John Johnson\n"
           "Title: CEO\n\n"
           "EMPLOYEE\n\n"
           "______________________\n"
           "Jane Smith"
       )
   )

   # Define an aspect focused on termination clauses
   termination_aspect = Aspect(
       name="Termination Provisions",
       description="Analysis of contract termination conditions, notice requirements, and severance terms.",
       reference_depth="paragraphs",
   )

   # Define concepts for the termination aspect
   termination_for_cause = StringConcept(
       name="Termination for Cause",
       description="Conditions under which the company can terminate the employee for cause.",
       examples=[  # optional, examples help the LLM to understand the concept better
           StringExample(content="Employee may be terminated for misconduct"),
           StringExample(content="Termination for breach of contract"),
       ],
       add_references=True,
       reference_depth="sentences",
   )
   notice_period = StringConcept(
       name="Notice Period",
       description="Required notification period before employment termination.",
       add_references=True,
       reference_depth="sentences",
   )
   severance_terms = StringConcept(
       name="Severance Package",
       description="Compensation and benefits provided upon termination.",
       add_references=True,
       reference_depth="sentences",
   )

   # Add concepts to the aspect
   termination_aspect.add_concepts([termination_for_cause, notice_period, severance_terms])

   # Add the aspect to the document
   doc.add_aspects([termination_aspect])

   # Create an LLM for extracting data from the document
   llm = DocumentLLM(
       model="openai/gpt-4o",  # You can use models from other providers as well, e.g. "anthropic/claude-3-5-sonnet"
       api_key=os.environ.get(
           "CONTEXTGEM_OPENAI_API_KEY"
       ),  # your API key for OpenAI or another LLM provider
   )

   # Extract all information from the document
   doc = llm.extract_all(doc)

   # Access the extracted information in the document object
   print("=== Termination Provisions Analysis ===")
   print(f"Extracted {len(doc.aspects[0].extracted_items)} items from the aspect")

   # Access extracted aspect concepts in the document object
   for concept in doc.aspects[0].concepts:
       print(f"--- {concept.name} ---")
       for item in concept.extracted_items:
           print(f"• {item.value}")
           print(f"  Reference sentences: {len(item.reference_sentences)}")


📊 Extracting Aspects and Concepts from a Document
==================================================

Tip:

  This example demonstrates how to extract both document-level
  concepts and aspect-specific concepts from a document with
  references. Using concurrency can significantly speed up extraction
  when working with multiple aspects and concepts.Document-level
  concepts apply to the entire document (like "Is Privacy Policy" or
  "Last Updated Date"), while aspect-specific concepts are tied to
  particular sections or themes within the document.

   # Advanced Usage Example - Extracting aspects and concepts from a document, with references,
   # using concurrency

   import os

   from aiolimiter import AsyncLimiter

   from contextgem import (
       Aspect,
       BooleanConcept,
       DateConcept,
       Document,
       DocumentLLM,
       JsonObjectConcept,
       StringConcept,
   )

   # Example privacy policy document (shortened for brevity)
   doc = Document(
       raw_text=(
           "Privacy Policy\n\n"
           "Last Updated: March 15, 2024\n\n"
           "1. Data Collection\n"
           "We collect various types of information from our users, including:\n"
           "- Personal information (name, email address, phone number)\n"
           "- Device information (IP address, browser type, operating system)\n"
           "- Usage data (pages visited, time spent on site)\n"
           "- Location data (with your consent)\n\n"
           "2. Data Usage\n"
           "We use your information to:\n"
           "- Provide and improve our services\n"
           "- Send you marketing communications (if you opt-in)\n"
           "- Analyze website performance\n"
           "- Comply with legal obligations\n\n"
           "3. Data Sharing\n"
           "We may share your information with:\n"
           "- Service providers (for processing payments and analytics)\n"
           "- Law enforcement (when legally required)\n"
           "- Business partners (with your explicit consent)\n\n"
           "4. Data Retention\n"
           "We retain personal data for 24 months after your last interaction with our services. "
           "Analytics data is kept for 36 months.\n\n"
           "5. User Rights\n"
           "You have the right to:\n"
           "- Access your personal data\n"
           "- Request data deletion\n"
           "- Opt-out of marketing communications\n"
           "- Lodge a complaint with supervisory authorities\n\n"
           "6. Contact Information\n"
           "For privacy-related inquiries, contact our Data Protection Officer at privacy@example.com\n"
       ),
   )

   # Define all document-level concepts in a single declaration
   document_concepts = [
       BooleanConcept(
           name="Is Privacy Policy",
           description="Verify if this document is a privacy policy",
           singular_occurrence=True,  # explicitly enforce singular extracted item (optional)
       ),
       DateConcept(
           name="Last Updated Date",
           description="The date when the privacy policy was last updated",
           singular_occurrence=True,  # explicitly enforce singular extracted item (optional)
       ),
       StringConcept(
           name="Contact Information",
           description="Contact details for privacy-related inquiries",
           add_references=True,
           reference_depth="sentences",
       ),
   ]

   # Define all aspects with their concepts in a single declaration
   aspects = [
       Aspect(
           name="Data Collection",
           description="Information about what types of data are collected from users",
           concepts=[
               JsonObjectConcept(
                   name="Collected Data Types",
                   description="List of different types of data collected from users",
                   structure={
                       "personal_info": list[str],
                       "technical_info": list[str],
                       "usage_info": list[str],
                   },  # simply use a dictionary with type hints (including generic aliases and union types)
                   add_references=True,
                   reference_depth="sentences",
               )
           ],
       ),
       Aspect(
           name="Data Retention",
           description="Information about how long different types of data are retained",
           concepts=[
               JsonObjectConcept(
                   name="Retention Periods",
                   description="The durations for which different types of data are retained",
                   structure={
                       "personal_info": str | None,
                       "technical_info": str | None,
                       "usage_info": str | None,
                   },  # use `str | None` type hints to allow for None values if not specified
                   add_references=True,
                   reference_depth="sentences",
                   singular_occurrence=True,  # explicitly enforce singular extracted item (optional)
               )
           ],
       ),
       Aspect(
           name="Data Subject Rights",
           description="Information about the rights users have regarding their data",
           concepts=[
               StringConcept(
                   name="Data Subject Rights",
                   description="Rights available to users regarding their personal data",
                   add_references=True,
                   reference_depth="sentences",
               )
           ],
       ),
   ]

   # Add aspects and concepts to the document
   doc.add_aspects(aspects)
   doc.add_concepts(document_concepts)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o",  # or another LLM from e.g. Anthropic, Ollama, etc.
       api_key=os.environ.get(
           "CONTEXTGEM_OPENAI_API_KEY"
       ),  # your API key for the applicable LLM provider
       async_limiter=AsyncLimiter(
           3, 3
       ),  # customize async limiter for concurrency (optional)
   )

   # Extract all information from the document, using concurrency
   doc = llm.extract_all(doc, use_concurrency=True)

   # Access / print extracted information on the document object

   print("Document Concepts:")
   for concept in doc.concepts:
       print(f"{concept.name}:")
       for item in concept.extracted_items:
           print(f"• {item.value}")
       print()

   print("Aspects and Concepts:")
   for aspect in doc.aspects:
       print(f"[{aspect.name}]")
       for item in aspect.extracted_items:
           print(f"• {item.value}")
       print()
       for concept in aspect.concepts:
           print(f"{concept.name}:")
           for item in concept.extracted_items:
               print(f"• {item.value}")
       print()


🔄 Using a Multi-LLM Pipeline to Extract Data from Several Documents
====================================================================

Tip:

  A pipeline is a reusable configuration of extraction steps. You can
  use the same pipeline to extract data from multiple documents.For
  example, if your app extracts data from invoices, you can configure
  a pipeline once, and then use it for each incoming invoice.

   # Advanced Usage Example - analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency and cost tracking

   import os

   from contextgem import (
       Aspect,
       DateConcept,
       Document,
       DocumentLLM,
       DocumentLLMGroup,
       DocumentPipeline,
       JsonObjectConcept,
       JsonObjectExample,
       LLMPricing,
       NumericalConcept,
       RatingConcept,
       RatingScale,
       StringConcept,
       StringExample,
   )

   # Construct documents

   # Document 1 - Consultancy Agreement (shortened for brevity)
   doc1 = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "All intellectual property created during the provision of services shall belong to the Customer...\n"
           "This agreement is governed by the laws of Norway...\n"
           "Annex 1: Data processing agreement...\n"
           "Annex 2: Statement of Work...\n"
           "Annex 3: Service Level Agreement...\n"
       ),
   )

   # Document 2 - Service Level Agreement (shortened for brevity)
   doc2 = Document(
       raw_text=(
           "Service Level Agreement\n"
           "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
           "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
           "The Provider shall deliver IT support services as outlined in Schedule A...\n"
           "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
           "The Provider guarantees [99.9%] uptime for all critical systems...\n"
           "Either party may terminate with 60 days written notice...\n"
           "This agreement is governed by the laws of California...\n"
           "Schedule A: Service Descriptions...\n"
           "Schedule B: Response Time Requirements...\n"
       ),
   )

   # Create a reusable document pipeline for extraction
   contract_pipeline = DocumentPipeline()

   # Define aspects and aspect-level concepts in the pipeline
   # Concepts in the aspects will be extracted from the extracted aspect context
   contract_pipeline.aspects = [  # or use .add_aspects([...])
       Aspect(
           name="Contract Parties",
           description="Clauses defining the parties to the agreement",
           concepts=[  # define aspect-level concepts, if any
               StringConcept(
                   name="Party names and roles",
                   description="Names of all parties entering into the agreement and their roles",
                   examples=[  # optional
                       StringExample(
                           content="X (Client)",  # guidance regarding the expected output format
                       )
                   ],
               )
           ],
       ),
       Aspect(
           name="Term",
           description="Clauses defining the term of the agreement",
           concepts=[
               NumericalConcept(
                   name="Contract term",
                   description="The term of the agreement in years",
                   numeric_type="int",  # or "float", or "any" for auto-detection
                   add_references=True,  # extract references to the source text
                   reference_depth="paragraphs",
               )
           ],
       ),
   ]

   # Define document-level concepts
   # Concepts in the document will be extracted from the whole document content
   contract_pipeline.concepts = [  # or use .add_concepts()
       DateConcept(
           name="Effective date",
           description="The effective date of the agreement",
       ),
       StringConcept(
           name="Contract type",
           description="The type of agreement",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
       ),
       StringConcept(
           name="Governing law",
           description="The law that governs the agreement",
       ),
       JsonObjectConcept(
           name="Attachments",
           description="The titles and concise descriptions of the attachments to the agreement",
           structure={"title": str, "description": str | None},
           examples=[  # optional
               JsonObjectExample(  # guidance regarding the expected output format
                   content={
                       "title": "Appendix A",
                       "description": "Code of conduct",
                   }
               ),
           ],
       ),
       RatingConcept(
           name="Duration adequacy",
           description="Contract duration adequacy considering the subject matter and best practices.",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
           rating_scale=RatingScale(start=1, end=10),
           add_justifications=True,  # add justifications for the rating
           justification_depth="balanced",  # provide a balanced justification
           justification_max_sents=3,
       ),
   ]

   # Assign pipeline to the documents
   # You can re-use the same pipeline for multiple documents
   doc1.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document
   doc2.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document

   # Create an LLM group for data extraction and reasoning
   llm_extractor = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="extractor_text",  # signifies the LLM is used for data extraction tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=0.150,
           output_per_1m_tokens=0.600,
       ),
   )
   llm_reasoner = DocumentLLM(
       model="openai/o3-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="reasoner_text",  # signifies the LLM is used for reasoning tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=1.10,
           output_per_1m_tokens=4.40,
       ),
   )
   # The LLM group is used for all extraction tasks within the pipeline
   llm_group = DocumentLLMGroup(llms=[llm_extractor, llm_reasoner])

   # Extract all information from the documents at once
   doc1 = llm_group.extract_all(
       doc1, use_concurrency=True
   )  # use concurrency to speed up extraction
   doc2 = llm_group.extract_all(
       doc2, use_concurrency=True
   )  # use concurrency to speed up extraction
   # Or use async variants .extract_all_async(...)

   # Get the extracted data
   print("Some extracted data from doc 1:")
   print("Contract Parties > Party names and roles:")
   print(
       doc1.get_aspect_by_name("Contract Parties")
       .get_concept_by_name("Party names and roles")
       .extracted_items
   )
   print("Attachments:")
   print(doc1.get_concept_by_name("Attachments").extracted_items)
   # ...

   print("\nSome extracted data from doc 2:")
   print("Term > Contract term:")
   print(
       doc2.get_aspect_by_name("Term")
       .get_concept_by_name("Contract term")
       .extracted_items[0]
       .value
   )
   print("Duration adequacy:")
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].value)
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].justification)
   # ...

   # Output processing costs (requires setting the pricing details for each LLM)
   print("\nProcessing costs:")
   print(llm_group.get_cost())


# ==== converters/docx ====

DOCX Converter
**************

ContextGem provides built-in converter to easily transform DOCX files
into LLM-ready ContextGem document objects.

* 📑 Extracts information that other open-source tools often do not
  capture: misaligned tables, comments, footnotes, textboxes,
  headers/footers, and embedded images

* 🧩 Preserves document structure with rich metadata for improved LLM
  analysis

* 🛠️ Custom native converter that directly processes Word XML with
  zero external dependencies


🚀 Usage
========

   # Using ContextGem's DocxConverter

   from contextgem import DocxConverter

   converter = DocxConverter()

   # Convert a DOCX file to an LLM-ready ContextGem Document
   # from path
   document = converter.convert("path/to/document.docx")
   # or from file object
   with open("path/to/document.docx", "rb") as docx_file_object:
       document = converter.convert(docx_file_object)

   # You can also use it as a standalone text extractor
   docx_text = converter.convert_to_text_format(
       "path/to/document.docx",
       output_format="markdown",  # or "raw"
   )


🔄 Conversion Process
=====================

The "DocxConverter" performs the following operations when converting
a DOCX file to a ContextGem Document:

+--------------------------------+------------------------------------------------------------------------+
| Elements                       | Extraction Details                                                     |
|================================|========================================================================|
| **Text**                       | Extracts the full document text as either raw text or markdown format  |
|                                | (controlled by "raw_text_to_md" parameter)                             |
+--------------------------------+------------------------------------------------------------------------+
| **Paragraphs**                 | Extracts "Paragraph" objects with rich metadata serving as additional  |
|                                | context for LLM (e.g., *"Style: Normal, Table: 3, Row: 1, Column: 3,   |
|                                | Table Cell"*)                                                          |
+--------------------------------+------------------------------------------------------------------------+
| **Headings**                   | Preserves heading levels and formats as markdown headings when in      |
|                                | markdown mode                                                          |
+--------------------------------+------------------------------------------------------------------------+
| **Lists**                      | Maintains list hierarchy, numbering, and formatting with proper        |
|                                | indentation and list type information                                  |
+--------------------------------+------------------------------------------------------------------------+
| **Tables**                     | Preserves table structure and formats tables in markdown mode (can be  |
|                                | excluded using "include_tables=False")                                 |
+--------------------------------+------------------------------------------------------------------------+
| **Headers & Footers**          | Captures document headers and footers with appropriate metadata (can   |
|                                | be excluded using "include_headers=False" and "include_footers=False") |
+--------------------------------+------------------------------------------------------------------------+
| **Footnotes**                  | Extracts footnotes with references and preserves connection to         |
|                                | original text (can be excluded using "include_footnotes=False")        |
+--------------------------------+------------------------------------------------------------------------+
| **Comments**                   | Preserves document comments with author information and timestamps     |
|                                | (can be excluded using "include_comments=False")                       |
+--------------------------------+------------------------------------------------------------------------+
| **Text Boxes**                 | Extracts text from various text box formats (can be excluded using     |
|                                | "include_textboxes=False")                                             |
+--------------------------------+------------------------------------------------------------------------+
| **Images**                     | Extracts embedded images and converts them to "Image" objects for      |
|                                | further processing with vision models (can be excluded using           |
|                                | "include_images=False")                                                |
+--------------------------------+------------------------------------------------------------------------+


💥 Beyond Standard Libraries
============================

Our evaluation of popular open-source DOCX processing libraries
revealed critical limitations: most packages either omit important
elements (e.g. comments, textboxes, or embedded images), fail to
handle complex structures (such as inconsistently formatted tables),
or cannot extract paragraphs with the rich metadata needed for LLM
processing.

While it would have been much easier to use an existing open-source
package as a dependency, these limitations compelled us to build a
custom solution. The "DocxConverter" was developed specifically to
address these gaps, ensuring extraction of the most commonly occurring
DOCX elements with their contextual relationships preserved.


ℹ️ Current Limitations
======================

DocxConverter has the following limitations, some of which are
intentional:

* Character-level styling (e.g., bold, underline, italics,
  strikethrough) is *intentionally skipped* to ensure proper matching
  of processed paragraphs and sentences in the DOCX content.

* Nested tables are preserved but may lead to table cell duplication.

* Consecutive textboxes are preserved but may lead to textbox content
  duplication.

* Drawings such as charts are skipped as it is challenging to
  represent them in text format.


# ==== optimizations/optimization_choosing_llm ====

Choosing the Right LLM(s)
*************************


🧭 General Guidance
===================

Your choice of LLM directly affects the accuracy, speed, and cost of
your extraction pipeline. ContextGem integrates with various LLM
providers (via LiteLLM), enabling you to select models that best fit
your needs.

Since ContextGem specializes in deep single-document analysis, models
with large context windows are recommended. While each use case has
unique requirements, our experience suggests the following practical
guidelines. However, please note that for sensitive applications
(e.g., contract review) where accuracy is paramount and speed/cost are
secondary concerns, using the most capable model available for all
tasks is often the safest approach.


Choosing LLMs - General Guidance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+----------------------------------------------------+----------------------------------------------------+
| Aspect Extraction                                  | Concept Extraction                                 |
|====================================================|====================================================|
| A **smaller/distilled non-reasoning model**        | For *basic concepts* (e.g., titles, payment        |
| capable of identifying relevant document sections  | amounts, dates), the same **smaller/distilled non- |
| (e.g., "gpt-4o-mini"). This extraction resembles   | reasoning model** is often sufficient (e.g., "gpt- |
| multi-label classification. Complex aspects may    | 4o-mini"). For *complex concepts* requiring        |
| occasionally require larger or reasoning models.   | nuanced understanding within specific aspects or   |
|                                                    | the entire document, consider a **larger non-      |
|                                                    | reasoning model** (e.g., "gpt-4o"). For concepts   |
|                                                    | requiring advanced understanding or complex        |
|                                                    | reasoning (e.g., logical deductions, evaluation),  |
|                                                    | a **reasoning model** like "o3-mini" may be        |
|                                                    | appropriate.                                       |
+----------------------------------------------------+----------------------------------------------------+


🏷️ LLM Roles
============

Each LLM serves a specific role in your pipeline, defined by the
"role" parameter. This allows different models to handle different
tasks. For instance, when an aspect or concept has
"llm_role="extractor_text"", it will be processed by an LLM with that
matching role.

You can assign any pre-defined role to an LLM regardless of whether
it's actually a "reasoning" model (like o3-mini) or not (like gpt-4o).
This abstraction helps organize your pipeline based on your assessment
of each model's capabilities and task complexity. For simpler use
cases, you can omit role assignments, which will default to
""extractor_text"".

Example of selecting different LLMs for different tasks

   # Example of selecting different LLMs for different tasks

   import os

   from contextgem import Aspect, Document, DocumentLLM, DocumentLLMGroup, StringConcept

   # Define LLMs
   base_llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       role="extractor_text",  # default
   )

   # Optional - attach a fallback LLM
   base_llm_fallback = DocumentLLM(
       model="openai/gpt-3-5-turbo",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       role="extractor_text",  # must have the same role as the parent LLM
       is_fallback=True,
   )
   base_llm.fallback_llm = base_llm_fallback

   advanced_llm = DocumentLLM(
       model="openai/gpt-4o",  # can be a larger model (reasoning or non-reasoning)
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       role="reasoner_text",
   )

   # You can organize LLMs in a group to use them in a pipeline
   llm_group = DocumentLLMGroup(
       llms=[base_llm, advanced_llm],
   )

   # Assign the existing LLMs to aspects/concepts
   document = Document(
       raw_text="document_text",
       aspects=[
           Aspect(
               name="aspect_name",
               description="aspect_description",
               llm_role="extractor_text",
               concepts=[
                   StringConcept(
                       name="concept_name",
                       description="concept_description",
                       llm_role="reasoner_text",
                   )
               ],
           )
       ],
   )

   # Then use the LLM group to extract all information from the document
   # This will use different LLMs for different aspects/concepts under the hood
   # document = llm_group.extract_all(document)


# ==== optimizations/optimization_accuracy ====

Optimizing for Accuracy
***********************

When accuracy is paramount, ContextGem offers several techniques to
improve extraction quality, some of which are pretty obvious:

* **🚀 Use a Capable LLM**: Choose a powerful LLM model for
  extraction.

* **🪄 Use Larger Segmentation Models**: Select a larger SaT model for
  intelligent segmentation of paragraphs or sentences, to ensure the
  highest segmentation accuracy in complex documents (e.g. contracts).

* **💡 Provide Examples**: For most complex concepts, add examples to
  guide the LLM's extraction format and style.

* **🧠 Request Justifications**: For most complex aspects/concepts,
  enable justifications to understand the LLM's reasoning and instruct
  the LLM to "think" when giving an answer.

* **📏 Limit Paragraphs Per Call**: This will reduce each prompt's
  length and ensure a more focused analysis.

* **🔢 Limit Aspects/Concepts Per Call**: Process a smaller number of
  aspects or concepts in each LLM call, preventing prompt overloading.

* **🔄 Use a Fallback LLM**: Configure a fallback LLM to retry failed
  extractions with a different model.

Example of optimizing extraction for accuracy

   # Example of optimizing extraction for accuracy

   import os

   from contextgem import Document, DocumentLLM, StringConcept, StringExample

   # Define document
   doc = Document(
       raw_text="Non-Disclosure Agreement...",
       sat_model_id="sat-6l-sm",  # default is "sat-3l-sm"
       paragraph_segmentation_mode="sat",  # default is "newlines"
       # sentence segmentation mode is always "sat", as other approaches proved to be less accurate
   )

   # Define document concepts
   doc.concepts = [
       StringConcept(
           name="Title",  # A very simple concept, just an example for testing purposes
           description="Title of the document",
           add_justifications=True,  # enable justifications
           justification_depth="brief",  # default
           examples=[
               StringExample(
                   content="Supplier Agreement",
               )
           ],
       ),
       # ... add other concepts ...
   ]

   # ... attach other aspects/concepts to the document ...

   # Define and configure LLM
   llm = DocumentLLM(
       model="openai/gpt-4o",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       fallback_llm=DocumentLLM(
           model="openai/gpt-4-turbo",
           api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
           is_fallback=True,
       ),  # configure a fallback LLM
   )

   # Extract data from document with specific configuration options
   doc = llm.extract_all(
       doc,
       max_paragraphs_to_analyze_per_call=30,  # limit the number of paragraphs to analyze in an individual LLM call
       max_items_per_call=1,  # limit the number of aspects/concepts to analyze in an individual LLM call
       use_concurrency=True,  # optional: enable concurrent extractions
   )

   # ... use the extracted data ...


# ==== optimizations/optimization_speed ====

Optimizing for Speed
********************

For large-scale processing or time-sensitive applications, optimize
your pipeline for speed:

* **🚀 Enable and Configure Concurrency**: Process multiple
  extractions concurrently. Adjust the async limiter to adapt to your
  LLM API setup.

* **📦 Use Smaller Models**: Select smaller/distilled LLMs that
  perform faster. (See Choosing the Right LLM(s) for guidance on
  choosing the right model.)

* **🔄 Use a Fallback LLM**: Configure a fallback LLM to retry
  extractions that failed due to rate limits.

* **⚙️ Use Default Parameters**: All the extractions will be processed
  in as few LLM calls as possible.

* **📉 Enable Justifications Only When Necessary**: Do not use
  justifications for simple aspects or concepts. This will reduce the
  number of tokens generated.

Example of optimizing extraction for speed

   # Example of optimizing extraction for speed

   import os

   from aiolimiter import AsyncLimiter

   from contextgem import Document, DocumentLLM

   # Define document
   document = Document(
       raw_text="document_text",
       # aspects=[Aspect(...), ...],
       # concepts=[Concept(...), ...],
   )

   # Define LLM with a fallback model
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       async_limiter=AsyncLimiter(
           10, 5
       ),  # e.g. 10 acquisitions per 5-second period; adjust to your LLM API setup
       fallback_llm=DocumentLLM(
           model="openai/gpt-3.5-turbo",
           api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
           is_fallback=True,
           async_limiter=AsyncLimiter(
               20, 5
           ),  # e.g. 20 acquisitions per 5-second period; adjust to your LLM API setup
       ),
   )

   # Use the LLM for extraction with concurrency enabled
   llm.extract_all(document, use_concurrency=True)

   # ... use the extracted data ...


# ==== optimizations/optimization_cost ====

Optimizing for Cost
*******************

ContextGem offers several strategies to optimize for cost efficiency
while maintaining extraction quality:

* **💸 Select Cost-Efficient Models**: Use smaller/distilled non-
  reasoning LLMs for extracting aspects and basic concepts (e.g.
  titles, payment amounts, dates).

* **⚙️ Use Default Parameters**: All the extractions will be processed
  in as few LLM calls as possible.

* **📉 Enable Justifications Only When Necessary**: Do not use
  justifications for simple aspects or concepts. This will reduce the
  number of tokens generated.

* **📊 Monitor Usage and Cost**: Track LLM calls, token consumption,
  and cost to identify optimization opportunities.

Example of optimizing extraction for cost

   # Example of optimizing extraction for cost

   import os

   from contextgem import DocumentLLM, LLMPricing

   llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       pricing_details=LLMPricing(
           input_per_1m_tokens=0.150,
           output_per_1m_tokens=0.600,
       ),  # add pricing details to track costs
   )

   # ... use the LLM for extraction ...

   # ... monitor usage and cost ...
   usage = llm.get_usage()  # get the usage details, including tokens and calls' details.
   cost = llm.get_cost()  # get the cost details, including input, output, and total costs.
   print(usage)
   print(cost)


# ==== optimizations/optimization_long_docs ====

Dealing with Long Documents
***************************

ContextGem offers specialized configuration options for efficiently
processing lengthy documents.


✂️ Segmentation Approach
========================

Unlike many systems that rely on chunking (e.g. RAG), ContextGem
intelligently segments documents into natural semantic units like
paragraphs and sentences. This preserves the contextual integrity of
the content while allowing you to configure:

* Maximum number of paragraphs per LLM call

* Maximum number of aspects/concepts to analyze per LLM call

* Maximum number of images per LLM call (if the document contains
  images)


⚙️ Effective Optimization Strategies
====================================

* **🔄 Use Long-Context Models**: Select models with large context
  windows. (See Choosing the Right LLM(s) for guidance on choosing the
  right model.)

* **📏 Limit Paragraphs Per Call**: This will reduce each prompt's
  length and ensure a more focused analysis.

* **🔢 Limit Aspects/Concepts Per Call**: Process a smaller number of
  aspects or concepts in each LLM call, preventing prompt overloading.

* **⚡ Optional: Enable Concurrency**: Enable running extractions
  concurrently if your API setup permits. This will reduce the overall
  processing time. (See Optimizing for Speed for guidance on
  configuring concurrency.)

Since each use case has unique requirements, experiment with different
configurations to find your optimal setup.

Example of configuring LLM extraction for long documents

   # Example of configuring LLM extraction to process long documents

   import os

   from contextgem import Document, DocumentLLM

   # Define document
   long_doc = Document(
       raw_text="long_document_text",
   )

   # ... attach aspects/concepts to the document ...

   # Define and configure LLM
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
   )

   # Extract data from document with specific configuration options
   long_doc = llm.extract_all(
       long_doc,
       max_paragraphs_to_analyze_per_call=50,  # limit the number of paragraphs to analyze in an individual LLM call
       max_items_per_call=2,  # limit the number of aspects/concepts to analyze in an individual LLM call
       use_concurrency=True,  # optional: enable concurrent extractions
   )

   # ... use the extracted data ...


# ==== serialization ====

Serializing objects and results
*******************************

ContextGem provides multiple serialization methods to preserve your
document processing pipeline components and results. These methods
enable you to save your work, transfer data between systems, or
integrate with other applications.

When using serialization, all extracted data is preserved in the
serialized objects.


💾 Serialization Methods
========================

The following ContextGem objects support serialization:

* "Document" - Contains document content and extracted information

* "DocumentPipeline" - Defines extraction structure and logic

* "DocumentLLM" - Stores LLM configuration for document processing

Each object supports three serialization methods:

* "to_json()" - Converts the object to a JSON string for cross-
  platform compatibility

* "to_dict()" - Converts the object to a Python dictionary for in-
  memory operations

* "to_disk(file_path)" - Saves the object directly to disk at the
  specified path


🔄 Deserialization Methods
==========================

To reconstruct objects from their serialized forms, use the
corresponding class methods:

* "from_json(json_string)" - Creates an object from a JSON string

* "from_dict(dict_object)" - Creates an object from a Python
  dictionary

* "from_disk(file_path)" - Loads an object from a file on disk


📝 Example Usage
================

   # Example of serializing and deserializing ContextGem document,
   # document pipeline, and LLM config.

   import os
   from pathlib import Path

   from contextgem import (
       Aspect,
       BooleanConcept,
       Document,
       DocumentLLM,
       DocumentPipeline,
       DocxConverter,
       StringConcept,
   )

   # Create a document object
   converter = DocxConverter()
   docx_path = str(
       Path(__file__).resolve().parents[4]
       / "tests"
       / "docx_files"
       / "en_nda_with_anomalies.docx"
   )  # your file path here (Path adapted for testing)
   doc = converter.convert(docx_path, strict_mode=True)

   # Create a document pipeline
   document_pipeline = DocumentPipeline(
       aspects=[
           Aspect(
               name="Categories of confidential information",
               description="Clauses describing confidential information covered by the NDA",
               concepts=[
                   StringConcept(
                       name="Types of disclosure",
                       description="Types of disclosure of confidential information",
                   ),
                   # ...
               ],
           ),
           # ...
       ],
       concepts=[
           BooleanConcept(
               name="Is mutual",
               description="Whether the NDA is mutual (both parties act as discloser/recipient)",
               add_justifications=True,
           ),
           # ...
       ],
   )

   # Attach the pipeline to the document
   doc.assign_pipeline(document_pipeline)

   # Configure a document LLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract data from the document
   doc = llm.extract_all(doc)

   # Serialize the LLM config, pipeline and document
   llm_config_json = llm.to_json()  # or to_dict() / to_disk()
   document_pipeline_json = document_pipeline.to_json()  # or to_dict() / to_disk()
   processed_doc_json = doc.to_json()  # or to_dict() / to_disk()

   # Deserialize the LLM config, pipeline and document
   llm_deserialized = DocumentLLM.from_json(
       llm_config_json
   )  # or from_dict() / from_disk()
   document_pipeline_deserialized = DocumentPipeline.from_json(
       document_pipeline_json
   )  # or from_dict() / from_disk()
   processed_doc_deserialized = Document.from_json(
       processed_doc_json
   )  # or from_dict() / from_disk()

   # All extracted data is preserved!
   assert processed_doc_deserialized.aspects[0].concepts[0].extracted_items


🚀 Use Cases
============

* **Caching Results**: Save processed documents to avoid repeating
  expensive LLM calls

* **Transfer Between Systems**: Export results from one environment
  and import in another

* **API Integration**: Convert objects to JSON for API responses

* **Workflow Persistence**: Save pipeline configurations for later
  reuse


# ==== api/documents ====

Documents
*********

Module for handling documents.

This module provides the Document class, which represents a structured
or unstructured file containing written or visual content. Documents
can be processed to extract information, analyze content, and organize
data into paragraphs, sentences, aspects, and concepts.

The Document class supports various operations including: - Managing
raw text and structured paragraphs - Handling embedded or attached
images - Organizing content into aspects for focused analysis -
Associating concepts for information extraction - Processing documents
through pipelines for automated analysis

Documents serve as the primary container for content analysis within
the ContextGem framework, enabling complex document understanding and
information extraction workflows.

class contextgem.public.documents.Document(**data)

   Bases: "_AssignedInstancesProcessor"

   Represents a document containing textual and visual content for
   analysis.

   A document serves as the primary container for content analysis
   within the ContextGem framework, enabling complex document
   understanding and information extraction workflows.

   Variables:
      * **raw_text** -- The main text of the document as a single
        string. Defaults to None.

      * **paragraphs** -- List of Paragraph instances in consecutive
        order as they appear in the document. Defaults to an empty
        list.

      * **images** -- List of Image instances attached to or
        representing the document. Defaults to an empty list.

      * **aspects** -- List of aspects associated with the document
        for focused analysis. Validated to ensure unique names and
        descriptions. Defaults to an empty list.

      * **concepts** -- List of concepts associated with the document
        for information extraction. Validated to ensure unique names
        and descriptions. Defaults to an empty list.

      * **paragraph_segmentation_mode** -- Mode for paragraph
        segmentation. When set to "sat", uses a SaT (Segment Any Text
        https://arxiv.org/abs/2406.16678) model. Defaults to
        "newlines".

      * **sat_model_id** -- SaT model ID for paragraph/sentence
        segmentation. Defaults to "sat-3l-sm". See https://github.com
        /segment-any-text/wtpsplit for the list of available models.

   Parameters:
      * **custom_data** (*dict*)

      * **raw_text** (*Optional**[**NonEmptyStr**]*)

      * **paragraphs** (*list**[**Paragraph**]*)

      * **images** (*list**[**Image**]*)

      * **aspects** (*list**[**Aspect**]*)

      * **concepts** (*list**[**_Concept**]*)

      * **paragraph_segmentation_mode** (*Literal**[**"newlines"**,
        **"sat"**]*)

      * **sat_model_id** (*SaTModelId*)

   Note:
      Normally, you do not need to construct/populate paragraphs
      manually, as they are populated automatically from document's
      "raw_text" attribute. Only use this constructor for advanced use
      cases, such as when you have a custom paragraph segmentation
      tool.

   Example:
      Document definition

         from contextgem import Document

         # Create a document with raw text content
         contract_document = Document(
             raw_text=(
                 "...This agreement is effective as of January 1, 2025.\n\n"
                 "All parties must comply with the terms outlined herein. The terms include "
                 "monthly reporting requirements and quarterly performance reviews.\n\n"
                 "Failure to adhere to these terms may result in termination of the agreement. "
                 "Additionally, any breach of confidentiality will be subject to penalties as "
                 "described in this agreement.\n\n"
                 "This agreement shall remain in force for a period of three (3) years unless "
                 "otherwise terminated according to the provisions stated above..."
             ),
             paragraph_segmentation_mode="newlines",  # Default mode, splits on newlines
         )

         # Create a document with more advanced paragraph segmentation using a SaT model
         report_document = Document(
             raw_text=(
                 "Executive Summary "
                 "This report outlines our quarterly performance. "
                 "Revenue increased by [15%] compared to the previous quarter.\n\n"
                 "Customer satisfaction metrics show positive trends across all regions..."
             ),
             paragraph_segmentation_mode="sat",  # Use SaT model for intelligent paragraph segmentation
             sat_model_id="sat-3l-sm",  # Specify which SaT model to use
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   raw_text: Optional[NonEmptyStr]

   paragraphs: list[Paragraph]

   images: list[Image]

   aspects: list[Aspect]

   concepts: list[_Concept]

   paragraph_segmentation_mode: Literal['newlines', 'sat']

   sat_model_id: SaTModelId

   property sentences: list[Sentence]

      Provides access to all sentences within the paragraphs of the
      document by flattening and combining sentences from each
      paragraph into a single list.

      Returns:
         A list of Sentence objects that are contained within all
         paragraphs.

      Return type:
         list[Sentence]

   assign_pipeline(pipeline, overwrite_existing=False)

      Assigns a given pipeline to the document. The method deep-copies
      the input pipeline to prevent any modifications to the state of
      aspects or concepts in the original pipeline. If the aspects or
      concepts are already associated with the document, an error is
      raised unless the *overwrite_existing* parameter is explicitly
      set to *True*.

      Parameters:
         * **pipeline**
           ("contextgem.public.pipelines.DocumentPipeline") -- The
           DocumentPipeline object to attach to the document.

         * **overwrite_existing** ("bool") -- A boolean flag. If set
           to True, any existing aspects and concepts assigned to the
           document will be overwritten by the new pipeline. Defaults
           to False.

      Return type:
         "typing.Self"

      Returns:
         Returns the current instance of the document after assigning
         the pipeline.

   add_aspects(aspects)

      Adds aspects to the existing aspects list of an instance and
      returns the updated instance. This method ensures that the
      provided aspects are deeply copied to avoid any unintended state
      modification of the original reusable aspects.

      Parameters:
         **aspects** (*list**[**Aspect**]*) -- A list of aspects to be
         added. Each aspect is deeply copied to ensure the original
         list remains unaltered.

      Returns:
         Updated instance containing the newly added aspects.

      Return type:
         Self

   add_concepts(concepts)

      Adds a list of new concepts to the existing *concepts* attribute
      of the instance. This method ensures that the provided list of
      concepts is deep-copied to prevent unintended side effects from
      modifying the input list outside of this method.

      Parameters:
         **concepts** (*list**[**_Concept**]*) -- A list of concepts
         to be added. It will be deep-copied before being added to the
         instance's *concepts* attribute.

      Returns:
         Returns the instance itself after the modification.

      Return type:
         Self

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_aspect_by_name(name)

      Finds and returns an aspect with the specified name from the
      list of available aspects, if the instance has *aspects*
      attribute.

      Parameters:
         **name** (*str*) -- The name of the aspect to find.

      Returns:
         The aspect with the specified name.

      Return type:
         Aspect

      Raises:
         **ValueError** -- If no aspect with the specified name is
         found.

   get_aspects_by_names(names)

      Retrieve a list of Aspect objects corresponding to the provided
      list of names.

      Parameters:
         **names** ("list"["str"]) -- List of aspect names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of Aspect objects corresponding to provided names.

      Return type:
         list[Aspect]

   get_concept_by_name(name)

      Retrieves a concept from the list of concepts based on the
      provided name, if the instance has *concepts* attribute.

      Parameters:
         **name** (*str*) -- The name of the concept to search for.

      Returns:
         The *_Concept* object with the specified name.

      Return type:
         _Concept

      Raises:
         **ValueError** -- If no concept with the specified name is
         found.

   get_concepts_by_names(names)

      Retrieve a list of _Concept objects corresponding to the
      provided list of names.

      Parameters:
         **names** ("list"["str"]) -- List of concept names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Concept objects corresponding to provided names.

      Return type:
         list[_Concept]

   property llm_roles: set[str]

      A set of LLM roles associated with the object's aspects and
      concepts.

      Returns:
         A set containing unique LLM roles gathered from aspects and
         concepts.

      Return type:
         set[str]

   remove_all_aspects()

      Removes all aspects from the instance and returns the updated
      instance.

      This method clears the *aspects* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all aspects removed

   remove_all_concepts()

      Removes all concepts from the instance and returns the updated
      instance.

      This method clears the *concepts* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all concepts removed

   remove_all_instances()

      Removes all assigned instances from the object and resets them
      as empty lists. Returns the modified instance.

      Returns:
         The modified object with all assigned instances removed.

      Return type:
         Self

   remove_aspect_by_name(name)

      Removes an aspect from the assigned aspects by its name.

      Parameters:
         **name** (*str*) -- The name of the aspect to be removed

      Returns:
         Updated instance with the aspect removed.

      Return type:
         Self

   remove_aspects_by_names(names)

      Removes multiple aspects from an object based on the provided
      list of names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of names identifying
         the aspects to be removed.

      Returns:
         The updated object after the specified aspects have been
         removed.

      Return type:
         Self

   remove_concept_by_name(name)

      Removes a concept from the assigned concepts by its name.

      Parameters:
         **name** (*str*) -- The name of the concept to be removed

      Returns:
         Updated instance with the concept removed.

      Return type:
         Self

   remove_concepts_by_names(names)

      Removes concepts from the object by their names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of concept names to
         be removed.

      Returns:
         Returns the updated instance after removing the specified
         concepts.

      Return type:
         Self

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   custom_data: dict


# ==== api/converters ====

Converters
**********

class contextgem.public.converters.DocxConverter

   Bases: "object"

   Converter for DOCX files into ContextGem documents.

   This class handles extraction of text, formatting, tables, images,
   footnotes, comments, and other elements from DOCX files by directly
   parsing Word XML.

   The resulting ContextGem document is populated with the following:

   * Raw text: The raw text of the DOCX file converted to markdown or
     left as raw text, based on the "raw_text_to_md" flag.

   * Paragraphs: Paragraph objects with the following metadata:

     * Raw text: The raw text of the paragraph.

     * Additional context: Metadata about the paragraph's style, list
       level, table cell position, being part of a footnote or
       comment, etc. This context provides additional information that
       is useful for LLM analysis and extraction.

   * Images: Image objects constructed from embedded images in the
     DOCX file.

   Example:
      DocxConverter usage example

         # Using ContextGem's DocxConverter

         from contextgem import DocxConverter

         converter = DocxConverter()

         # Convert a DOCX file to an LLM-ready ContextGem Document
         # from path
         document = converter.convert("path/to/document.docx")
         # or from file object
         with open("path/to/document.docx", "rb") as docx_file_object:
             document = converter.convert(docx_file_object)

         # You can also use it as a standalone text extractor
         docx_text = converter.convert_to_text_format(
             "path/to/document.docx",
             output_format="markdown",  # or "raw"
         )

   convert_to_text_format(docx_path_or_file, output_format='markdown', include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, strict_mode=False)

      Converts a DOCX file directly to text without creating a
      ContextGem Document.

      Parameters:
         * **docx_path_or_file** ("str" | "pathlib._local.Path" |
           "typing.BinaryIO") -- Path to the DOCX file (as string or
           Path object) or a file-like object

         * **output_format** ("typing.Literal"["'raw'", "'markdown'"])
           -- Output format ("markdown" or "raw") (default:
           "markdown")

         * **include_tables** ("bool") -- If True, include tables in
           the output (default: True)

         * **include_comments** ("bool") -- If True, include comments
           in the output (default: True)

         * **include_footnotes** ("bool") -- If True, include
           footnotes in the output (default: True)

         * **include_headers** ("bool") -- If True, include headers in
           the output (default: True)

         * **include_footers** ("bool") -- If True, include footers in
           the output (default: True)

         * **include_textboxes** ("bool") -- If True, include textbox
           content (default: True)

         * **strict_mode** ("bool") -- If True, raise exceptions for
           any processing error instead of skipping problematic
           elements (default: False)

      Return type:
         "str"

      Returns:
         Text in the specified format

      Note:

        When using markdown output format, the following conditions
        apply:

        * Document structure elements (headings, lists, tables) are
          preserved

        * Character-level formatting (bold, italic, underline) is
          intentionally skipped to ensure proper text matching between
          markdown and DOCX content

        * Headings are converted to markdown heading syntax (# Heading
          1, ## Heading 2, etc.)

        * Lists are converted to markdown list syntax, preserving
          numbering and hierarchy

        * Tables are formatted using markdown table syntax

        * Footnotes, comments, headers, and footers are included as
          specially marked sections

   convert(docx_path_or_file, raw_text_to_md=True, include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_images=True, strict_mode=False)

      Converts a DOCX file into a ContextGem Document object.

      Parameters:
         * **docx_path_or_file** ("str" | "pathlib._local.Path" |
           "typing.BinaryIO") -- Path to the DOCX file (as string or
           Path object) or a file-like object

         * **raw_text_to_md** ("bool") -- If True, convert raw text to
           markdown (default: True)

         * **include_tables** ("bool") -- If True, include tables in
           the output (default: True)

         * **include_comments** ("bool") -- If True, include comments
           in the output (default: True)

         * **include_footnotes** ("bool") -- If True, include
           footnotes in the output (default: True)

         * **include_headers** ("bool") -- If True, include headers in
           the output (default: True)

         * **include_footers** ("bool") -- If True, include footers in
           the output (default: True)

         * **include_textboxes** ("bool") -- If True, include textbox
           content (default: True)

         * **include_images** ("bool") -- If True, extract and include
           images (default: True)

         * **strict_mode** ("bool") -- If True, raise exceptions for
           any processing error instead of skipping problematic
           elements (default: False)

      Return type:
         "contextgem.public.documents.Document"

      Returns:
         A populated Document object


# ==== api/aspects ====

Aspects
*******

Module for handling document aspects.

This module provides the Aspect class, which represents a defined area
or topic within a document that requires focused attention. Aspects
are used to identify and extract specific subjects or themes from
documents according to predefined criteria.

Aspects can be associated with concepts, reference paragraphs and
sentences from the source document, and can be configured with
different LLM roles for extraction and reasoning tasks.

The module integrates with the broader ContextGem framework for
document analysis and information extraction.

class contextgem.public.aspects.Aspect(**data)

   Bases: "_AssignedInstancesProcessor",
   "_ExtractedItemsAttributeProcessor",
   "_RefParasAndSentsAttrituteProcessor"

   Represents an aspect with associated metadata, sub-aspects,
   concepts, and logic for validation.

   An aspect is a defined area or topic within a document that
   requires focused attention. Each aspect corresponds to a specific
   subject or theme described in the task.

   Variables:
      * **name** -- The name of the aspect. Required, non-empty
        string.

      * **description** -- A detailed description of the aspect.
        Required, non-empty string.

      * **concepts** -- A list of concepts associated with the aspect.
        These concepts must be unique in both name and description and
        cannot include concepts with vision LLM roles.

      * **llm_role** -- The role of the LLM responsible for aspect
        extraction. Default is "extractor_text". Valid roles are
        "extractor_text" and "reasoner_text".

      * **reference_depth** -- The structural depth of references
        (paragraphs or sentences). Defaults to "paragraphs". Affects
        the structure of "extracted_items".

      * **add_justifications** -- Whether the LLM will output
        justification for each extracted item. Inherited from base
        class. Defaults to False.

      * **justification_depth** -- The level of detail for
        justifications. Inherited from base class. Defaults to
        "brief".

      * **justification_max_sents** -- Maximum number of sentences in
        a justification. Inherited from base class. Defaults to 2.

   Parameters:
      * **custom_data** (*dict*)

      * **add_justifications** (*bool*)

      * **justification_depth** (*JustificationDepth*)

      * **justification_max_sents** (*int*)

      * **name** (*NonEmptyStr*)

      * **description** (*NonEmptyStr*)

      * **aspects** (*list**[**Aspect**]*)

      * **concepts** (*list**[**_Concept**]*)

      * **llm_role** (*LLMRoleAspect*)

      * **reference_depth** (*ReferenceDepth*)

   Example:
      Aspect definition

         from contextgem import Aspect

         # Define an aspect focused on termination clauses
         termination_aspect = Aspect(
             name="Termination provisions",
             description="Contract termination conditions, notice requirements, and severance terms.",
             reference_depth="sentences",
             add_justifications=True,
             justification_depth="comprehensive",
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   name: NonEmptyStr

   description: NonEmptyStr

   aspects: list[Aspect]

   concepts: list[_Concept]

   llm_role: LLMRoleAspect

   reference_depth: ReferenceDepth

   add_aspects(aspects)

      Adds aspects to the existing aspects list of an instance and
      returns the updated instance. This method ensures that the
      provided aspects are deeply copied to avoid any unintended state
      modification of the original reusable aspects.

      Parameters:
         **aspects** (*list**[**Aspect**]*) -- A list of aspects to be
         added. Each aspect is deeply copied to ensure the original
         list remains unaltered.

      Returns:
         Updated instance containing the newly added aspects.

      Return type:
         Self

   add_concepts(concepts)

      Adds a list of new concepts to the existing *concepts* attribute
      of the instance. This method ensures that the provided list of
      concepts is deep-copied to prevent unintended side effects from
      modifying the input list outside of this method.

      Parameters:
         **concepts** (*list**[**_Concept**]*) -- A list of concepts
         to be added. It will be deep-copied before being added to the
         instance's *concepts* attribute.

      Returns:
         Returns the instance itself after the modification.

      Return type:
         Self

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_aspect_by_name(name)

      Finds and returns an aspect with the specified name from the
      list of available aspects, if the instance has *aspects*
      attribute.

      Parameters:
         **name** (*str*) -- The name of the aspect to find.

      Returns:
         The aspect with the specified name.

      Return type:
         Aspect

      Raises:
         **ValueError** -- If no aspect with the specified name is
         found.

   get_aspects_by_names(names)

      Retrieve a list of Aspect objects corresponding to the provided
      list of names.

      Parameters:
         **names** ("list"["str"]) -- List of aspect names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of Aspect objects corresponding to provided names.

      Return type:
         list[Aspect]

   get_concept_by_name(name)

      Retrieves a concept from the list of concepts based on the
      provided name, if the instance has *concepts* attribute.

      Parameters:
         **name** (*str*) -- The name of the concept to search for.

      Returns:
         The *_Concept* object with the specified name.

      Return type:
         _Concept

      Raises:
         **ValueError** -- If no concept with the specified name is
         found.

   get_concepts_by_names(names)

      Retrieve a list of _Concept objects corresponding to the
      provided list of names.

      Parameters:
         **names** ("list"["str"]) -- List of concept names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Concept objects corresponding to provided names.

      Return type:
         list[_Concept]

   property llm_roles: set[str]

      A set of LLM roles associated with the object's aspects and
      concepts.

      Returns:
         A set containing unique LLM roles gathered from aspects and
         concepts.

      Return type:
         set[str]

   property reference_paragraphs: list[Paragraph]

      Provides access to the instance's reference paragraphs, assigned
      during extraction.

      Returns:
         A list containing the paragraphs as *Paragraph* objects.

      Return type:
         list[Paragraph]

   property reference_sentences: list[Sentence]

      Provides access to the instance's reference sentences, assigned
      during extraction.

      Returns:
         A list containing the sentences as *Sentence* objects.

      Return type:
         list[Sentence]

   remove_all_aspects()

      Removes all aspects from the instance and returns the updated
      instance.

      This method clears the *aspects* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all aspects removed

   remove_all_concepts()

      Removes all concepts from the instance and returns the updated
      instance.

      This method clears the *concepts* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all concepts removed

   remove_all_instances()

      Removes all assigned instances from the object and resets them
      as empty lists. Returns the modified instance.

      Returns:
         The modified object with all assigned instances removed.

      Return type:
         Self

   remove_aspect_by_name(name)

      Removes an aspect from the assigned aspects by its name.

      Parameters:
         **name** (*str*) -- The name of the aspect to be removed

      Returns:
         Updated instance with the aspect removed.

      Return type:
         Self

   remove_aspects_by_names(names)

      Removes multiple aspects from an object based on the provided
      list of names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of names identifying
         the aspects to be removed.

      Returns:
         The updated object after the specified aspects have been
         removed.

      Return type:
         Self

   remove_concept_by_name(name)

      Removes a concept from the assigned concepts by its name.

      Parameters:
         **name** (*str*) -- The name of the concept to be removed

      Returns:
         Updated instance with the concept removed.

      Return type:
         Self

   remove_concepts_by_names(names)

      Removes concepts from the object by their names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of concept names to
         be removed.

      Returns:
         Returns the updated instance after removing the specified
         concepts.

      Return type:
         Self

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: dict


# ==== api/concepts ====

Concepts
********

Module for handling concepts at aspect and document levels.

This module provides classes for defining different types of concepts
that can be extracted from documents and aspects. Concepts represent
specific pieces of information to be identified and extracted by LLMs,
such as strings, numbers, boolean values, JSON objects, and ratings.

Each concept type has specific properties and behaviors tailored to
the kind of data it represents, including validation rules, extraction
methods, and reference handling. Concepts can be attached to documents
or aspects and can include examples, justifications, and references to
the source text.

class contextgem.public.concepts.StringConcept(**data)

   Bases: "_Concept"

   A concept model for string-based information extraction from
   documents and aspects.

   This class provides functionality for defining, extracting, and
   managing string data as conceptual entities within documents or
   aspects.

   Variables:
      * **name** -- The name of the concept (non-empty string,
        stripped).

      * **description** -- A brief description of the concept (non-
        empty string, stripped).

      * **examples** -- Example strings illustrating the concept
        usage.

      * **llm_role** -- The role of the LLM responsible for extracting
        the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision"). Defaults to
        "extractor_text".

      * **add_justifications** -- Whether to include justifications
        for extracted items.

      * **justification_depth** -- Justification detail level.
        Defaults to "brief".

      * **justification_max_sents** -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** -- Whether to include source references for
        extracted items.

      * **reference_depth** -- Source reference granularity
        ("paragraphs" or "sentences"). Defaults to "paragraphs". Only
        relevant when references are added to extracted items. Affects
        the structure of "extracted_items".

      * **singular_occurrence** -- Whether this concept is restricted
        to having only one extracted item. If True, only a single
        extracted item will be extracted. Defaults to False (multiple
        extracted items are allowed). Note that with advanced LLMs,
        this constraint may not be strictly required as they can often
        infer the appropriate cardinality from the concept's name,
        description, and type (e.g., "document title" vs "key
        findings").

   Parameters:
      * **custom_data** (*dict*)

      * **add_justifications** (*bool*)

      * **justification_depth** (*JustificationDepth*)

      * **justification_max_sents** (*int*)

      * **name** (*NonEmptyStr*)

      * **description** (*NonEmptyStr*)

      * **llm_role** (*LLMRoleAny*)

      * **add_references** (*bool*)

      * **reference_depth** (*ReferenceDepth*)

      * **singular_occurrence** (*StrictBool*)

      * **examples** (*list**[**StringExample**]*)

   Example:
      String concept definition

         from contextgem import StringConcept, StringExample

         # Define a string concept for identifying contract party names
         # and their roles in the contract
         party_names_and_roles_concept = StringConcept(
             name="Party names and roles",
             description=(
                 "Names of all parties entering into the agreement "
                 "and their contractual roles"
             ),
             examples=[
                 StringExample(
                     content="X (Client)",  # guidance regarding format
                 )
             ],
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   examples: list[StringExample]

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: dict

class contextgem.public.concepts.BooleanConcept(**data)

   Bases: "_Concept"

   A concept model for boolean (True/False) information extraction
   from documents and aspects.

   This class handles identification and extraction of boolean values
   that represent conceptual properties or attributes within content.

   Variables:
      * **name** -- The name of the concept (non-empty string,
        stripped).

      * **description** -- A brief description of the concept (non-
        empty string, stripped).

      * **llm_role** -- The role of the LLM responsible for extracting
        the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision"). Defaults to
        "extractor_text".

      * **add_justifications** -- Whether to include justifications
        for extracted items.

      * **justification_depth** -- Justification detail level.
        Defaults to "brief".

      * **justification_max_sents** -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** -- Whether to include source references for
        extracted items.

      * **reference_depth** -- Source reference granularity
        ("paragraphs" or "sentences"). Defaults to "paragraphs". Only
        relevant when references are added to extracted items. Affects
        the structure of "extracted_items".

      * **singular_occurrence** -- Whether this concept is restricted
        to having only one extracted item. If True, only a single
        extracted item will be extracted. Defaults to False (multiple
        extracted items are allowed). Note that with advanced LLMs,
        this constraint may not be strictly required as they can often
        infer the appropriate cardinality from the concept's name,
        description, and type (e.g., "document title" vs "key
        findings").

   Parameters:
      * **custom_data** (*dict*)

      * **add_justifications** (*bool*)

      * **justification_depth** (*JustificationDepth*)

      * **justification_max_sents** (*int*)

      * **name** (*NonEmptyStr*)

      * **description** (*NonEmptyStr*)

      * **llm_role** (*LLMRoleAny*)

      * **add_references** (*bool*)

      * **reference_depth** (*ReferenceDepth*)

      * **singular_occurrence** (*StrictBool*)

   Example:
      Boolean concept definition

         from contextgem import BooleanConcept

         # Create the concept with specific configuration
         has_confidentiality = BooleanConcept(
             name="Contains confidentiality clause",
             description="Determines whether the contract includes provisions requiring parties to maintain confidentiality",
             llm_role="reasoner_text",
             singular_occurrence=True,
             add_justifications=True,
             justification_depth="brief",
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: dict

class contextgem.public.concepts.NumericalConcept(**data)

   Bases: "_Concept"

   A concept model for numerical information extraction from documents
   and aspects.

   This class handles identification and extraction of numeric values
   (integers, floats, or both) that represent conceptual measurements
   or quantities within content.

   Variables:
      * **name** -- The name of the concept (non-empty string,
        stripped).

      * **description** -- A brief description of the concept (non-
        empty string, stripped).

      * **numeric_type** -- Type constraint for extracted numbers
        ("int", "float", or "any"). Defaults to "any" for auto-
        detection.

      * **llm_role** -- The role of the LLM responsible for extracting
        the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision"). Defaults to
        "extractor_text".

      * **add_justifications** -- Whether to include justifications
        for extracted items.

      * **justification_depth** -- Justification detail level.
        Defaults to "brief".

      * **justification_max_sents** -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** -- Whether to include source references for
        extracted items.

      * **reference_depth** -- Source reference granularity
        ("paragraphs" or "sentences"). Defaults to "paragraphs". Only
        relevant when references are added to extracted items. Affects
        the structure of "extracted_items".

      * **singular_occurrence** -- Whether this concept is restricted
        to having only one extracted item. If True, only a single
        extracted item will be extracted. Defaults to False (multiple
        extracted items are allowed). Note that with advanced LLMs,
        this constraint may not be strictly required as they can often
        infer the appropriate cardinality from the concept's name,
        description, and type (e.g., "document title" vs "key
        findings").

   Parameters:
      * **custom_data** (*dict*)

      * **add_justifications** (*bool*)

      * **justification_depth** (*JustificationDepth*)

      * **justification_max_sents** (*int*)

      * **name** (*NonEmptyStr*)

      * **description** (*NonEmptyStr*)

      * **llm_role** (*LLMRoleAny*)

      * **add_references** (*bool*)

      * **reference_depth** (*ReferenceDepth*)

      * **singular_occurrence** (*StrictBool*)

      * **numeric_type** (*Literal**[**"int"**, **"float"**,
        **"any"**]*)

   Example:
      Numerical concept definition

         from contextgem import NumericalConcept

         # Create concepts for different numerical values in the contract
         payment_amount = NumericalConcept(
             name="Payment amount",
             description="The monetary value to be paid according to the contract terms",
             numeric_type="float",
             llm_role="extractor_text",
             add_references=True,
             reference_depth="sentences",
         )

         payment_days = NumericalConcept(
             name="Payment term days",
             description="The number of days within which payment must be made",
             numeric_type="int",
             llm_role="extractor_text",
             add_justifications=True,
             justification_depth="balanced",
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   numeric_type: Literal['int', 'float', 'any']

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: dict

class contextgem.public.concepts.RatingConcept(**data)

   Bases: "_Concept"

   A concept model for rating-based information extraction with
   defined scale boundaries.

   This class handles identification and extraction of integer ratings
   that must fall within the boundaries of a specified rating scale.

   Variables:
      * **name** -- The name of the concept (non-empty string,
        stripped).

      * **description** -- A brief description of the concept (non-
        empty string, stripped).

      * **rating_scale** -- The rating scale defining valid value
        boundaries.

      * **llm_role** -- The role of the LLM responsible for extracting
        the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision"). Defaults to
        "extractor_text".

      * **add_justifications** -- Whether to include justifications
        for extracted items.

      * **justification_depth** -- Justification detail level.
        Defaults to "brief".

      * **justification_max_sents** -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** -- Whether to include source references for
        extracted items.

      * **reference_depth** -- Source reference granularity
        ("paragraphs" or "sentences"). Defaults to "paragraphs". Only
        relevant when references are added to extracted items. Affects
        the structure of "extracted_items".

      * **singular_occurrence** -- Whether this concept is restricted
        to having only one extracted item. If True, only a single
        extracted item will be extracted. Defaults to False (multiple
        extracted items are allowed). Note that with advanced LLMs,
        this constraint may not be strictly required as they can often
        infer the appropriate cardinality from the concept's name,
        description, and type (e.g., "document title" vs "key
        findings").

   Parameters:
      * **custom_data** (*dict*)

      * **add_justifications** (*bool*)

      * **justification_depth** (*JustificationDepth*)

      * **justification_max_sents** (*int*)

      * **name** (*NonEmptyStr*)

      * **description** (*NonEmptyStr*)

      * **llm_role** (*LLMRoleAny*)

      * **add_references** (*bool*)

      * **reference_depth** (*ReferenceDepth*)

      * **singular_occurrence** (*StrictBool*)

      * **rating_scale** (*RatingScale*)

   Example:
      Rating concept definition

         from contextgem import RatingConcept, RatingScale

         # Create a rating scale for contract fairness evaluation
         fairness_scale = RatingScale(start=1, end=5)

         # Create a concept to rate the fairness of contract terms
         fairness_rating = RatingConcept(
             name="Contract fairness rating",
             description="Evaluation of how balanced and fair the contract terms are for all parties",
             rating_scale=fairness_scale,
             llm_role="reasoner_text",
             add_justifications=True,
             justification_depth="comprehensive",
             justification_max_sents=10,
         )

         # Create a clarity scale for contract language evaluation
         clarity_scale = RatingScale(start=1, end=10)

         # Create a concept to rate the clarity of contract language
         clarity_rating = RatingConcept(
             name="Language clarity rating",
             description="Assessment of how clear and unambiguous the contract language is",
             rating_scale=clarity_scale,
             llm_role="reasoner_text",
             add_justifications=True,
             justification_depth="balanced",
             justification_max_sents=3,
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   rating_scale: RatingScale

   property extracted_items: list[_IntegerItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: dict

class contextgem.public.concepts.JsonObjectConcept(**data)

   Bases: "_Concept"

   A concept model for structured JSON object extraction from
   documents and aspects.

   This class handles identification and extraction of structured data
   in JSON format, with validation against a predefined schema
   structure.

   Variables:
      * **name** -- The name of the concept (non-empty string,
        stripped).

      * **description** -- A brief description of the concept (non-
        empty string, stripped).

      * **structure** -- JSON object schema as a class with type
        annotations or dictionary where keys are field names and
        values are type annotations. Supports generic aliases and
        union types. All annotated types must be JSON-serializable.
        Example: "{"item": str, "amount": int | float}". **Tip**: do
        not overcomplicate the structure to avoid prompt overloading.
        If you need to enforce a nested structure (e.g. an object
        within an object), use type hints together with examples that
        will guide the output format. E.g. structure "{"item":
        dict[str, str]}" and example "{"item": {"name": "item1",
        "description": "description1"}}".

      * **examples** -- Example JSON objects illustrating the concept
        usage.

      * **llm_role** -- The role of the LLM responsible for extracting
        the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision"). Defaults to
        "extractor_text".

      * **add_justifications** -- Whether to include justifications
        for extracted items.

      * **justification_depth** -- Justification detail level.
        Defaults to "brief".

      * **justification_max_sents** -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** -- Whether to include source references for
        extracted items.

      * **reference_depth** -- Source reference granularity
        ("paragraphs" or "sentences"). Defaults to "paragraphs". Only
        relevant when references are added to extracted items. Affects
        the structure of "extracted_items".

      * **singular_occurrence** -- Whether this concept is restricted
        to having only one extracted item. If True, only a single
        extracted item will be extracted. Defaults to False (multiple
        extracted items are allowed). Note that with advanced LLMs,
        this constraint may not be strictly required as they can often
        infer the appropriate cardinality from the concept's name,
        description, and type (e.g., "document title" vs "key
        findings").

   Parameters:
      * **custom_data** (*dict*)

      * **add_justifications** (*bool*)

      * **justification_depth** (*JustificationDepth*)

      * **justification_max_sents** (*int*)

      * **name** (*NonEmptyStr*)

      * **description** (*NonEmptyStr*)

      * **llm_role** (*LLMRoleAny*)

      * **add_references** (*bool*)

      * **reference_depth** (*ReferenceDepth*)

      * **singular_occurrence** (*StrictBool*)

      * **structure** (*type** | **dict**[**NonEmptyStr**, **Any**]*)

      * **examples** (*list**[**JsonObjectExample**]*)

   Example:
      JSON object concept definition

         from contextgem import JsonObjectConcept

         # Define a JSON object concept for capturing address information
         address_info_concept = JsonObjectConcept(
             name="Address information",
             description=(
                 "Structured address data from text including street, "
                 "city, state, postal code, and country."
             ),
             structure={
                 "street": str | None,
                 "city": str | None,
                 "state": str | None,
                 "postal_code": str | None,
                 "country": str | None,
             },
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   structure: type | dict[NonEmptyStr, Any]

   examples: list[JsonObjectExample]

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: dict

class contextgem.public.concepts.DateConcept(**data)

   Bases: "_Concept"

   A concept model for date object extraction from documents and
   aspects.

   This class handles identification and extraction of dates, with
   support for parsing string representations in a specified format
   into Python date objects.

   Variables:
      * **name** -- The name of the concept (non-empty string,
        stripped).

      * **description** -- A brief description of the concept (non-
        empty string, stripped).

      * **llm_role** -- The role of the LLM responsible for extracting
        the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision"). Defaults to
        "extractor_text".

      * **add_justifications** -- Whether to include justifications
        for extracted items.

      * **justification_depth** -- Justification detail level.
        Defaults to "brief".

      * **justification_max_sents** -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** -- Whether to include source references for
        extracted items.

      * **reference_depth** -- Source reference granularity
        ("paragraphs" or "sentences"). Defaults to "paragraphs". Only
        relevant when references are added to extracted items. Affects
        the structure of "extracted_items".

      * **singular_occurrence** -- Whether this concept is restricted
        to having only one extracted item. If True, only a single
        extracted item will be extracted. Defaults to False (multiple
        extracted items are allowed). Note that with advanced LLMs,
        this constraint may not be strictly required as they can often
        infer the appropriate cardinality from the concept's name,
        description, and type (e.g., "document title" vs "key
        findings").

   Parameters:
      * **custom_data** (*dict*)

      * **add_justifications** (*bool*)

      * **justification_depth** (*JustificationDepth*)

      * **justification_max_sents** (*int*)

      * **name** (*NonEmptyStr*)

      * **description** (*NonEmptyStr*)

      * **llm_role** (*LLMRoleAny*)

      * **add_references** (*bool*)

      * **reference_depth** (*ReferenceDepth*)

      * **singular_occurrence** (*StrictBool*)

   Example:
      Date concept definition

         from contextgem import DateConcept

         # Create a date concept to extract the effective date of the contract
         effective_date = DateConcept(
             name="Effective date",
             description="The effective as specified in the contract",
             add_references=True,  # Include references to where dates were found
             singular_occurrence=True,  # Only extract one effective date per document
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: dict


# ==== api/examples ====

Examples
********

Module for handling example data in document processing.

This module provides classes for defining examples that can be used to
guide LLM extraction tasks. Examples serve as reference points for the
model to understand the expected format and content of extracted
information. The module supports different types of examples including
string-based examples and structured JSON object examples.

Examples can be attached to concepts to provide concrete illustrations
of the kind of information to be extracted, improving the accuracy and
consistency of LLM-based extraction processes.

class contextgem.public.examples.StringExample(**data)

   Bases: "_Example"

   Represents a string example that can be provided by users for
   certain extraction tasks.

   Variables:
      **content** -- A non-empty string that holds the text content of
      the example.

   Parameters:
      * **custom_data** (*dict*)

      * **content** (*NonEmptyStr*)

   Note:
      Examples are optional and can be used to guide LLM extraction
      tasks. They serve as reference points for the model to
      understand the expected format and content of extracted
      information. StringExample can be attached to a "StringConcept".

   Example:
      String example definition

         from contextgem import StringConcept, StringExample

         # Create string examples
         string_examples = [
             StringExample(content="X (Client)"),
             StringExample(content="Y (Supplier)"),
         ]

         # Attach string examples to a StringConcept
         string_concept = StringConcept(
             name="Contract party name and role",
             description="The name and role of the contract party",
             examples=string_examples,  # Attach the example to the concept (optional)
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   content: NonEmptyStr

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   custom_data: dict

class contextgem.public.examples.JsonObjectExample(**data)

   Bases: "_Example"

   Represents a JSON object example that can be provided by users for
   certain extraction tasks.

   Variables:
      **content** -- A JSON-serializable dict with the minimum length
      of 1 that holds the content of the example.

   Parameters:
      * **custom_data** (*dict*)

      * **content** (*dict**[**str**, **Any**]*)

   Note:
      Examples are optional and can be used to guide LLM extraction
      tasks. They serve as reference points for the model to
      understand the expected format and content of extracted
      information. JsonObjectExample can be attached to a
      "JsonObjectConcept".

   Example:
      JSON object example definition

         from contextgem import JsonObjectConcept, JsonObjectExample

         # Create a JSON object example
         json_example = JsonObjectExample(
             content={
                 "name": "John Doe",
                 "education": "Bachelor's degree in Computer Science",
                 "skills": ["Python", "Machine Learning", "Data Analysis"],
                 "hobbies": ["Reading", "Traveling", "Gaming"],
             }
         )


         # Define a structure for JSON object concept
         class PersonInfo:
             name: str
             education: str
             skills: list[str]
             hobbies: list[str]


         # Also works as a dict with type hints, e.g.
         # PersonInfo = {
         #     "name": str,
         #     "education": str,
         #     "skills": list[str],
         #     "hobbies": list[str],
         # }

         # Attach JSON example to a JsonObjectConcept
         json_concept = JsonObjectConcept(
             name="Candidate info",
             description="Structured information about a job candidate",
             structure=PersonInfo,  # Define the expected structure
             examples=[json_example],  # Attach the example to the concept (optional)
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   content: dict[str, Any]

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   custom_data: dict


# ==== api/llms ====

LLMs
****

Module for handling processing logic using LLMs.

This module provides classes and utilities for interacting with LLMs
in document processing workflows. It includes functionality for
managing LLM configurations, handling API calls, processing text and
image inputs, tracking token usage and costs, and managing rate limits
for LLM requests.

The module supports various LLM providers through the litellm library,
enabling both text-only and multimodal (vision) capabilities. It
implements efficient asynchronous processing patterns and provides
detailed usage statistics for monitoring and cost management.

class contextgem.public.llms.DocumentLLMGroup(**data)

   Bases: "_GenericLLMProcessor"

   Represents a group of DocumentLLMs with unique roles for processing
   document content.

   This class manages multiple LLMs assigned to specific roles for
   text and vision processing. It ensures role compliance and
   facilitates extraction of aspects and concepts from documents.

   Variables:
      * **llms** -- A list of DocumentLLM instances, each with a
        unique role (e.g., *extractor_text*, *reasoner_text*,
        *extractor_vision*, *reasoner_vision*). At least 2 instances
        with distinct roles are required.

      * **output_language** -- Language for produced output text
        (justifications, explanations). Values: "en" (always English)
        or "adapt" (matches document/image language). All LLMs in the
        group must share the same output_language setting.

   Note:
      Refer to the "DocumentLLM" class for more information on
      constructing LLMs for the group.

   Example:
      LLM group definition

         from contextgem import DocumentLLM, DocumentLLMGroup

         # Create a text extractor LLM with a fallback
         text_extractor = DocumentLLM(
             model="openai/gpt-4o-mini",
             api_key="your-openai-api-key",  # Replace with your actual API key
             role="extractor_text",
         )

         # Create a fallback LLM for the text extractor
         text_extractor_fallback = DocumentLLM(
             model="anthropic/claude-3-5-haiku",
             api_key="your-anthropic-api-key",  # Replace with your actual API key
             role="extractor_text",  # Must have the same role as the primary LLM
             is_fallback=True,
         )

         # Assign the fallback LLM to the primary text extractor
         text_extractor.fallback_llm = text_extractor_fallback

         # Create a text reasoner LLM
         text_reasoner = DocumentLLM(
             model="openai/o3-mini",
             api_key="your-openai-api-key",  # Replace with your actual API key
             role="reasoner_text",  # For more complex tasks that require reasoning
         )

         # Create a vision extractor LLM
         vision_extractor = DocumentLLM(
             model="openai/gpt-4o-mini",
             api_key="your-openai-api-key",  # Replace with your actual API key
             role="extractor_vision",  # For handling images
         )

         # Create a vision reasoner LLM
         vision_reasoner = DocumentLLM(
             model="openai/gpt-4o",
             api_key="your-openai-api-key",
             role="reasoner_vision",  # For more complex vision tasks that require reasoning
         )

         # Create a DocumentLLMGroup with all four LLMs
         llm_group = DocumentLLMGroup(
             llms=[text_extractor, text_reasoner, vision_extractor, vision_reasoner],
             output_language="en",  # All LLMs must have the same output language ("en" is default)
         )
         # This group will have 5 LLMs: four main ones, with different roles,
         # and one fallback LLM for a specific LLM. Each LLM can have a fallback LLM.

         # Get usage statistics for the whole group or for a specific role
         group_usage = llm_group.get_usage()
         text_extractor_usage = llm_group.get_usage(llm_role="extractor_text")

         # Get cost statistics for the whole group or for a specific role
         all_costs = llm_group.get_cost()
         text_extractor_cost = llm_group.get_cost(llm_role="extractor_text")

         # Reset usage and cost statistics for the whole group or for a specific role
         llm_group.reset_usage_and_cost()
         llm_group.reset_usage_and_cost(llm_role="extractor_text")

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   llms: list[DocumentLLM]

   output_language: LanguageRequirement

   property is_group: bool

      Abstract property, to be implemented by subclasses.

      Whether the LLM is a single instance or a group.

   property list_roles: list[Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision']]

      Returns a list of all roles assigned to the LLMs in this group.

      Returns:
         A list of LLM role identifiers

      Return type:
         list[LLMRoleAny]

   group_update_output_language(output_language)

      Updates the output language for all LLMs in the group.

      Parameters:
         **output_language** (*LanguageRequirement*) -- The new output
         language to set for all LLMs

      Return type:
         "None"

   _eq_deserialized_llm_config(other)

      Custom config equality method to compare this DocumentLLMGroup
      with a deserialized instance.

      Uses the *_eq_deserialized_llm_config* method of the DocumentLLM
      class to compare each LLM in the group, including fallbacks, if
      any.

      Parameters:
         **other** (*DocumentLLMGroup*) -- Another DocumentLLMGroup
         instance to compare with

      Returns:
         True if the instances are equal, False otherwise

      Return type:
         bool

   get_usage(llm_role=None)

      Retrieves the usage information of the LLMs in the group,
      filtered by the specified LLM role if provided.

      Parameters:
         **llm_role** (*Optional**[**str**]*) -- Optional; A string
         representing the role of the LLM to filter the usage data. If
         None, returns usage for all LLMs in the group.

      Returns:
         A list of usage statistics containers for the specified LLMs
         and their fallbacks.

      Return type:
         list[_LLMUsageOutputContainer]

      Raises:
         **ValueError** -- If no LLM with the specified role exists in
         the group.

   get_cost(llm_role=None)

      Retrieves the accumulated cost information of the LLMs in the
      group, filtered by the specified LLM role if provided.

      Parameters:
         **llm_role** (*Optional**[**str**]*) -- Optional; A string
         representing the role of the LLM to filter the cost data. If
         None, returns cost for all LLMs in the group.

      Returns:
         A list of cost statistics containers for the specified LLMs
         and their fallbacks.

      Return type:
         list[_LLMCostOutputContainer]

      Raises:
         **ValueError** -- If no LLM with the specified role exists in
         the group.

   reset_usage_and_cost(llm_role=None)

      Resets the usage and cost statistics for LLMs in the group.

      This method clears accumulated usage and cost data, which is
      useful when processing multiple documents sequentially and
      tracking metrics for each document separately.

      Parameters:
         **llm_role** (*Optional**[**str**]*) -- Optional; A string
         representing the role of the LLM to reset statistics for. If
         None, resets statistics for all LLMs in the group.

      Raises:
         **ValueError** -- If no LLM with the specified role exists in
         the group.

      Return type:
         "None"

      Returns:
         None

class contextgem.public.llms.DocumentLLM(**data)

   Bases: "_GenericLLMProcessor"

   Handles processing documents with a specific LLM.

   This class serves as an abstraction for interacting with a LLM. It
   provides functionality for querying the LLM with text or image
   inputs, and manages prompt preparation and token usage tracking.
   The class can be configured with different roles based on the
   document processing task.

   Variables:
      * **model** -- Model identifier in format
        {model_provider}/{model_name}. See
        https://docs.litellm.ai/docs/providers for supported
        providers.

      * **deployment_id** -- Deployment ID for the LLM. Primarily used
        with Azure OpenAI.

      * **api_key** -- API key for LLM authentication. Not required
        for local models (e.g., Ollama).

      * **api_base** -- Base URL of the API endpoint.

      * **api_version** -- API version. Primarily used with Azure
        OpenAI.

      * **role** -- Role type for the LLM (e.g., "extractor_text",
        "reasoner_text", "extractor_vision", "reasoner_vision").
        Defaults to "extractor_text".

      * **system_message** -- Preparatory system-level message to set
        context for LLM responses.

      * **temperature** -- Sampling temperature (0.0 to 1.0)
        controlling response creativity. Lower values produce more
        predictable outputs, higher values generate more varied
        responses. Defaults to 0.3.

      * **max_tokens** -- Maximum tokens allowed in the generated
        response. Defaults to 4096.

      * **max_completion_tokens** -- Maximum token size for output
        completions in o1/o3/o4 models. Defaults to 16000.

      * **reasoning_effort** -- The effort level for the LLM to reason
        about the input. Defaults to None. Relevant for o1/o3/o4
        models.

      * **top_p** -- Nucleus sampling value (0.0 to 1.0) controlling
        output focus/randomness. Lower values make output more
        deterministic, higher values produce more diverse outputs.
        Defaults to 0.3.

      * **num_retries_failed_request** -- Number of retries when LLM
        request fails. Defaults to 3.

      * **max_retries_failed_request** -- LLM provider-specific retry
        count for failed requests. Defaults to 0.

      * **max_retries_invalid_data** -- Number of retries when LLM
        returns invalid data. Defaults to 3.

      * **timeout** -- Timeout in seconds for LLM API calls. Defaults
        to 120 seconds.

      * **pricing_details** -- LLMPricing object with pricing details
        for cost calculation.

      * **is_fallback** -- Indicates whether the LLM is a fallback
        model. Defaults to False.

      * **fallback_llm** -- DocumentLLM to use as fallback if current
        one fails. Must have the same role as the current LLM.

      * **output_language** -- Language for produced output text
        (justifications, explanations). Can be "en" (English) or
        "adapt" (adapts to document/image language). Defaults to "en".

      * **async_limiter** -- Controls frequency of async LLM API
        requests for concurrent tasks. Defaults to allowing 3
        acquisitions per 10-second period to prevent rate limit
        issues. See https://github.com/mjpieters/aiolimiter for
        configuration details.

      * **seed** -- Seed for random number generation to help produce
        more consistent outputs across multiple runs. When set to a
        specific integer value, the LLM will attempt to use this seed
        for sampling operations. However, deterministic output is
        still not guaranteed even with the same seed, as other factors
        may influence the model's response. Defaults to None.

   Parameters:
      * **model** (*NonEmptyStr*)

      * **deployment_id** (*Optional**[**NonEmptyStr**]*)

      * **api_key** (*Optional**[**NonEmptyStr**]*)

      * **api_base** (*Optional**[**NonEmptyStr**]*)

      * **api_version** (*Optional**[**NonEmptyStr**]*)

      * **role** (*LLMRoleAny*)

      * **system_message** (*Optional**[**NonEmptyStr**]*)

      * **temperature** (*Optional**[**float**]*)

      * **max_tokens** (*Optional**[**int**]*)

      * **max_completion_tokens** (*Optional**[**int**]*)

      * **reasoning_effort** (*Optional**[**ReasoningEffort**]*)

      * **top_p** (*Optional**[**float**]*)

      * **num_retries_failed_request** (*Optional**[**int**]*)

      * **max_retries_failed_request** (*Optional**[**int**]*)

      * **max_retries_invalid_data** (*Optional**[**int**]*)

      * **timeout** (*Optional**[**int**]*)

      * **pricing_details** (*Optional**[**dict**[**NonEmptyStr**,
        **float**]**]*)

      * **is_fallback** (*bool*)

      * **fallback_llm** (*Optional**[**DocumentLLM**]*)

      * **output_language** (*LanguageRequirement*)

      * **seed** (*Optional**[**StrictInt**]*)

   Note:

      * LLM groups
           Refer to the "DocumentLLMGroup" class for more information
           on constructing LLM groups, which are a collection of LLMs
           with unique roles, used for complex document processing
           tasks.

      * LLM role
           The "role" of an LLM is an abstraction to differentiate
           between tasks of different complexity. For example, if an
           aspect/concept is assigned "llm_role="extractor_text"", it
           means that the aspect/concept is extracted from the
           document using the LLM with the "role" set to
           "extractor_text". This helps to channel different tasks to
           different LLMs, ensuring that the task is handled by the
           most appropriate model. Usually, domain expertise is
           required to determine the most appropriate role for a
           specific aspect/concept. But for simple use cases, you can
           skip the role assignment completely, in which case the
           "role" will default to "extractor_text".

   Example:
      LLM definition

         from contextgem import DocumentLLM, LLMPricing

         # Create a single LLM for text extraction
         text_extractor = DocumentLLM(
             model="openai/gpt-4o-mini",
             api_key="your-api-key",  # Replace with your actual API key
             role="extractor_text",  # Role for text extraction
             pricing_details=LLMPricing(  # optional
                 input_per_1m_tokens=0.150, output_per_1m_tokens=0.600
             ),
         )

         # Create a fallback LLM in case the primary model fails
         fallback_text_extractor = DocumentLLM(
             model="anthropic/claude-3-7-sonnet",
             api_key="your-anthropic-api-key",  # Replace with your actual API key
             role="extractor_text",  # must be the same as the role of the primary LLM
             is_fallback=True,
             pricing_details=LLMPricing(  # optional
                 input_per_1m_tokens=3.00, output_per_1m_tokens=15.00
             ),
         )
         # Assign the fallback LLM to the primary LLM
         text_extractor.fallback_llm = fallback_text_extractor

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   model: NonEmptyStr

   deployment_id: Optional[NonEmptyStr]

   api_key: Optional[NonEmptyStr]

   api_base: Optional[NonEmptyStr]

   api_version: Optional[NonEmptyStr]

   role: LLMRoleAny

   system_message: Optional[NonEmptyStr]

   temperature: Optional[StrictFloat]

   max_tokens: Optional[StrictInt]

   max_completion_tokens: Optional[StrictInt]

   reasoning_effort: Optional[ReasoningEffort]

   top_p: Optional[StrictFloat]

   num_retries_failed_request: Optional[StrictInt]

   max_retries_failed_request: Optional[StrictInt]

   max_retries_invalid_data: Optional[StrictInt]

   timeout: Optional[StrictInt]

   pricing_details: Optional[LLMPricing]

   is_fallback: StrictBool

   fallback_llm: Optional[DocumentLLM]

   output_language: LanguageRequirement

   seed: Optional[StrictInt]

   property async_limiter: AsyncLimiter

   property is_group: bool

      Abstract property, to be implemented by subclasses.

      Whether the LLM is a single instance or a group.

   property list_roles: list[Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision']]

      Returns a list containing the role of this LLM.

      (For a single LLM, this returns a list with just one element -
      the LLM's role. For LLM groups, the method implementation
      returns roles of all LLMs in the group.)

      Returns:
         A list containing the role of this LLM.

      Return type:
         list[LLMRoleAny]

   chat(prompt, images=None)

      Synchronously sends a prompt to the LLM and gets a response. For
      models supporting vision, attach images to the prompt if needed.

      This method allows direct interaction with the LLM by submitting
      your own prompt.

      Parameters:
         * **prompt** (*str*) -- The input prompt to send to the LLM

         * **images** (*Optional**[**list**[**Image**]**]*) --
           Optional list of Image instances for vision queries

      Returns:
         The LLM's response

      Return type:
         str

      Raises:
         * **ValueError** -- If the prompt is empty or not a string

         * **ValueError** -- If images parameter is not a list of
           Image instances

         * **ValueError** -- If images are provided but the model
           doesn't support vision

         * **RuntimeError** -- If the LLM call fails and no fallback
           is available

   async chat_async(prompt, images=None)

      Asynchronously sends a prompt to the LLM and gets a response.
      For models supporting vision, attach images to the prompt if
      needed.

      This method allows direct interaction with the LLM by submitting
      your own prompt.

      Parameters:
         * **prompt** (*str*) -- The input prompt to send to the LLM

         * **images** (*Optional**[**list**[**Image**]**]*) --
           Optional list of Image instances for vision queries

      Returns:
         The LLM's response

      Return type:
         str

      Raises:
         * **ValueError** -- If the prompt is empty or not a string

         * **ValueError** -- If images parameter is not a list of
           Image instances

         * **ValueError** -- If images are provided but the model
           doesn't support vision

         * **RuntimeError** -- If the LLM call fails and no fallback
           is available

   _update_default_prompt(prompt_path, prompt_type)

      For advanced users only!

      Update the default Jinja2 prompt template for the LLM.

      This method allows you to replace the built-in prompt templates
      with custom ones for specific extraction types. The framework
      uses these templates to guide the LLM in extracting structured
      information from documents.

      The custom prompt must be a valid Jinja2 template and include
      all the necessary variables that are present in the default
      prompt. Otherwise, the extraction may fail. Default prompts are
      located under "contextgem/internal/prompts/"

      IMPORTANT NOTES:

      The default prompts are complex and specifically designed for
      various steps of LLM extraction with the framework. Such prompts
      include the necessary instructions, template variables, nested
      structures and loops, etc.

      Only use custom prompts if you MUST have a deeper customization
      and adaptation of the default prompts to your specific use case.
      Otherwise, the default prompts should be sufficient for most use
      cases.

      Use at your own risk!

      Parameters:
         * **prompt_path** (*str** | **Path*) -- Path to the Jinja2
           template file (.j2 extension required)

         * **prompt_type** (*DefaultPromptType*) -- Type of prompt to
           update ("aspect" or "concept")

      Return type:
         "None"

   _eq_deserialized_llm_config(other)

      Custom config equality method to compare this DocumentLLM with a
      deserialized instance.

      Compares the __dict__ of both instances and performs specific
      checks for certain attributes that require special handling.

      Note that, by default, the reconstructed deserialized
      DocumentLLM will be only partially equal (==) to the original
      one, as the api credentials are redacted, and the attached
      prompt templates, async limiter, and async lock are not
      serialized and point to different objects in memory post-
      initialization. Also, usage and cost are reset by default pre-
      serialization.

      Parameters:
         **other** (*DocumentLLM*) -- Another DocumentLLM instance to
         compare with

      Returns:
         True if the instances are equal, False otherwise

      Return type:
         bool

   get_usage()

      Retrieves the usage information of the LLM and its fallback LLM
      if configured.

      This method collects token usage statistics for the current LLM
      instance and its fallback LLM (if configured), providing
      insights into API consumption.

      Returns:
         A list of usage statistics containers for the LLM and its
         fallback.

      Return type:
         list[_LLMUsageOutputContainer]

   get_cost()

      Retrieves the accumulated cost information of the LLM and its
      fallback LLM if configured.

      This method collects cost statistics for the current LLM
      instance and its fallback LLM (if configured), providing
      insights into API usage expenses.

      Returns:
         A list of cost statistics containers for the LLM and its
         fallback.

      Return type:
         list[_LLMCostOutputContainer]

   reset_usage_and_cost()

      Resets the usage and cost statistics for the LLM and its
      fallback LLM (if configured).

      This method clears accumulated usage and cost data, which is
      useful when processing multiple documents sequentially and
      tracking metrics for each document separately.

      Return type:
         "None"

      Returns:
         None


# ==== api/data_models ====

Data models
***********

Module defining public data validation models.

class contextgem.public.data_models.LLMPricing(**data)

   Bases: "_InstanceSerializer"

   Represents the pricing details for an LLM.

   Defines the cost structure for processing input tokens and
   generating output tokens, with prices specified per million tokens.

   Variables:
      * **input_per_1m_tokens** -- The cost in currency units for
        processing 1M input tokens.

      * **output_per_1m_tokens** -- The cost in currency units for
        generating 1M output tokens.

   Parameters:
      * **input_per_1m_tokens** (*StrictFloat*)

      * **output_per_1m_tokens** (*StrictFloat*)

   Example:
      LLM pricing definition

         from contextgem import LLMPricing

         # Create a pricing model for an LLM (openai/o3-mini example)
         pricing = LLMPricing(
             input_per_1m_tokens=1.10,  # $1.10 per million input tokens
             output_per_1m_tokens=4.40,  # $4.40 per million output tokens
         )

         # LLMPricing objects are immutable
         try:
             pricing.input_per_1m_tokens = 0.7
         except ValueError as e:
             print(f"Error when trying to modify pricing: {e}")

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   input_per_1m_tokens: "typing.Annotated"["float"]

   output_per_1m_tokens: "typing.Annotated"["float"]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

class contextgem.public.data_models.RatingScale(**data)

   Bases: "_InstanceSerializer"

   Represents a rating scale with defined minimum and maximum values.

   This class defines a numerical scale for rating concepts, with
   configurable start and end values that determine the valid range
   for ratings.

   Variables:
      * **start** -- The minimum value of the rating scale
        (inclusive). Must be greater than or equal to 0.

      * **end** -- The maximum value of the rating scale (inclusive).
        Must be greater than 0.

   Parameters:
      * **start** (*StrictInt*)

      * **end** (*StrictInt*)

   Example:
      Rating scale definition

         from contextgem import RatingScale

         # Create a rating scale with default values (0 to 10)
         default_scale = RatingScale()

         # Create a custom rating scale (1 to 5)
         custom_scale = RatingScale(
             start=1,
             end=5,
         )

         # RatingScale objects are immutable
         try:
             custom_scale.end = 7
         except ValueError as e:
             print(f"Error when trying to modify rating scale: {e}")

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   start: "typing.Annotated"["int"]

   end: "typing.Annotated"["int"]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str


# ==== api/utils ====

Utility functions
*****************

Module defining public utility functions of the framework.

contextgem.public.utils.image_to_base64(image_path)

   Converts an image file to its Base64 encoded string representation.

   Helper function that can be used when constructing "Image" objects.

   Parameters:
      **image_path** (*str** | **Path*) -- The path to the image file
      to be encoded.

   Returns:
      A Base64 encoded string representation of the image.

   Return type:
      str

contextgem.public.utils.reload_logger_settings()

   Reloads logger settings from environment variables.

   This function should be called when environment variables related
   to logging have been changed after the module was imported. It re-
   reads the environment variables and reconfigures the logger
   accordingly.

   Returns:
      None

   Example:
      Reload logger settings

         import os

         from contextgem import reload_logger_settings

         # Initial logger settings are loaded from environment variables at import time

         # Change logger level to WARNING
         os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "WARNING"
         print("Setting logger level to WARNING")
         reload_logger_settings()
         # Now the logger will only show WARNING level and above messages

         # Disable the logger completely
         os.environ["CONTEXTGEM_DISABLE_LOGGER"] = "True"
         print("Disabling the logger")
         reload_logger_settings()
         # Now the logger is disabled and won't show any messages

         # You can re-enable the logger by setting CONTEXTGEM_DISABLE_LOGGER to "False"
         # os.environ["CONTEXTGEM_DISABLE_LOGGER"] = "False"
         # reload_logger_settings()


# ==== api/images ====

Images
******

Module for handling document images.

This module provides the Image class, which represents visual content
that can be attached to or fully represent a document. Images are
stored in base64-encoded format with specified MIME types to ensure
proper handling.

The module supports common image formats (JPEG, PNG, WebP) and
integrates with the broader ContextGem framework for document analysis
that includes visual content alongside textual information.

class contextgem.public.images.Image(**data)

   Bases: "_InstanceBase"

   Represents an image with specified MIME type and base64-encoded
   data. An image is typically attached to a document, or fully
   represents a document.

   Variables:
      * **mime_type** -- The MIME type of the image. This must be one
        of the predefined valid types ("image/jpg", "image/jpeg",
        "image/png", "image/webp").

      * **base64_data** -- The base64-encoded data of the image. The
        util function *image_to_base64()* from contextgem.public.utils
        can be used to encode images to base64.

   Parameters:
      * **custom_data** (*dict*)

      * **mime_type** (*Literal**[**"image/jpg"**, **"image/jpeg"**,
        **"image/png"**, **"image/webp"**]*)

      * **base64_data** (*NonEmptyStr*)

   Note:
      * Attached to documents:
           An image must be attached to a document. A document can
           have multiple images.

      * Extraction types:
           Only concept extraction is supported for images. Use LLM
           with role ""extractor_vision"" or ""reasoner_vision"" to
           extract concepts from images.

   Example:
      Image definition

         from pathlib import Path

         from contextgem import Document, Image, image_to_base64

         # Path is adapted for doc tests
         current_file = Path(__file__).resolve()
         root_path = current_file.parents[4]

         # Using the utility function to convert an image file to base64
         image_path = root_path / "tests" / "invoices" / "invoice.jpg"
         base64_data = image_to_base64(image_path)

         # Create an image instance with the base64-encoded data
         jpg_image = Image(mime_type="image/jpg", base64_data=base64_data)

         # Using pre-encoded base64 data directly
         png_image = Image(
             mime_type="image/png", base64_data="base64-string"  # image as a base64 string
         )

         # Using a different supported image format
         webp_image = Image(
             mime_type="image/webp",
             base64_data=image_to_base64(root_path / "tests" / "invoices" / "invoice.webp"),
         )

         # Attaching an image to a document
         # Documents can contain both text and multiple images, or just images

         # Create a document with text content
         text_document = Document(
             raw_text="This is a document with an attached image that shows an invoice.",
             images=[jpg_image],
         )

         # Create a document with only image content (no text)
         image_only_document = Document(images=[jpg_image])

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   mime_type: Literal['image/jpg', 'image/jpeg', 'image/png', 'image/webp']

   base64_data: NonEmptyStr

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   custom_data: dict


# ==== api/paragraphs ====

Paragraphs
**********

Module for handling document paragraphs.

This module provides the Paragraph class, which represents a
structured segment of text within a document. Paragraphs serve as
containers for sentences and maintain the raw text content of the
segment they represent.

The module supports validation to ensure data integrity and provides
mechanisms to prevent inconsistencies during document analysis by
restricting certain attribute modifications after initial assignment.

class contextgem.public.paragraphs.Paragraph(**data)

   Bases: "_ParasAndSentsBase"

   Represents a paragraph of a document with its raw text content and
   constituent sentences.

   Paragraphs are immutable text segments that can contain multiple
   sentences. Once sentences are assigned to a paragraph, they cannot
   be changed to maintain data integrity during analysis.

   Variables:
      * **raw_text** -- The complete text content of the paragraph.
        This value is frozen after initialization.

      * **sentences** -- The individual sentences contained within the
        paragraph. Defaults to an empty list. Cannot be reassigned
        once populated.

   Parameters:
      * **custom_data** (*dict*)

      * **additional_context** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **raw_text** (*NonEmptyStr*)

      * **sentences** (*list**[**Sentence**]*)

   Note:
      Normally, you do not need to construct paragraphs manually, as
      they are populated automatically from document's "raw_text"
      attribute. Only use this constructor for advanced use cases,
      such as when you have a custom paragraph segmentation tool.

   Example:
      Paragraph definition

         from contextgem import Paragraph

         # Create a paragraph with raw text content
         contract_paragraph = Paragraph(
             raw_text=(
                 "This agreement is effective as of January 1, 2025. "
                 "All parties must comply with the terms outlined herein. "
                 "Failure to adhere to these terms may result in termination of the agreement."
             )
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   raw_text: NonEmptyStr

   sentences: list[Sentence]

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   additional_context: Optional[NonEmptyStr]

   custom_data: dict


# ==== api/sentences ====

Sentences
*********

Module for handling document sentences.

This module provides the Sentence class, which represents a structured
unit of text within a document paragraph. Sentences are the
fundamental building blocks of text analysis, containing the raw text
content of individual statements.

The module supports validation to ensure data integrity and integrates
with the paragraph structure to maintain the hierarchical organization
of document content.

class contextgem.public.sentences.Sentence(**data)

   Bases: "_ParasAndSentsBase"

   Represents a sentence within a document paragraph.

   Sentences are immutable text units that serve as the fundamental
   building blocks for document analysis. The raw text content is
   preserved and cannot be modified after initialization to maintain
   data integrity.

   Variables:
      **raw_text** -- The complete text content of the sentence. This
      value is frozen after initialization.

   Parameters:
      * **custom_data** (*dict*)

      * **additional_context** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **raw_text** (*NonEmptyStr*)

   Note:
      Normally, you do not need to construct sentences manually, as
      they are populated automatically from document's "raw_text" or
      "paragraphs" attributes. Only use this constructor for advanced
      use cases, such as when you have a custom paragraph/sentence
      segmentation tool.

   Example:
      Sentence definition

         from contextgem import Sentence

         # Create a sentence with raw text content
         sentence = Sentence(raw_text="This is a simple sentence.")

         # Sentences are immutable - their content cannot be changed after creation
         try:
             sentence.raw_text = "Attempting to modify the sentence."
         except ValueError as e:
             print(f"Error when trying to modify sentence: {e}")

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   raw_text: NonEmptyStr

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   additional_context: Optional[NonEmptyStr]

   custom_data: dict


# ==== api/pipelines ====

Pipelines
*********

Module for handling document processing pipelines.

This module provides the DocumentPipeline class, which represents a
reusable collection of pre-defined aspects and concepts that can be
assigned to documents. Pipelines enable standardized document analysis
by packaging common extraction patterns into reusable units.

Pipelines serve as templates for document processing, allowing
consistent application of the same analysis approach across multiple
documents. They encapsulate both the structural organization (aspects)
and the specific information to extract (concepts) in a single,
assignable object.

class contextgem.public.pipelines.DocumentPipeline(**data)

   Bases: "_AssignedInstancesProcessor"

   Represents a reusable collection of predefined aspects and concepts
   for document analysis.

   Document pipelines serve as templates that can be assigned to
   multiple documents, ensuring consistent application of the same
   analysis criteria. They package common extraction patterns into
   reusable units, allowing for standardized document processing.

   Variables:
      * **aspects** -- A list of aspects to extract from documents.
        Aspects represent structural categories of information.
        Defaults to an empty list.

      * **concepts** -- A list of concepts to identify within
        documents. Concepts represent specific information elements to
        extract. Defaults to an empty list.

   Parameters:
      * **custom_data** (*dict*)

      * **aspects** (*list**[**Aspect**]*)

      * **concepts** (*list**[**_Concept**]*)

   Note:
      A pipeline is a reusable configuration of extraction steps. You
      can use the same pipeline to extract data from multiple
      documents.

   Example:
      Document pipeline definition

         from contextgem import (
             Aspect,
             BooleanConcept,
             DateConcept,
             Document,
             DocumentPipeline,
             StringConcept,
         )

         # Create a pipeline for NDA (Non-Disclosure Agreement) review
         nda_pipeline = DocumentPipeline(
             aspects=[
                 Aspect(
                     name="Confidential information",
                     description="Clauses defining the confidential information",
                 ),
                 Aspect(
                     name="Exclusions",
                     description="Clauses defining exclusions from confidential information",
                 ),
                 Aspect(
                     name="Obligations",
                     description="Clauses defining confidentiality obligations",
                 ),
                 Aspect(
                     name="Liability",
                     description="Clauses defining liability for breach of the agreement",
                 ),
                 # ... Add more aspects as needed
             ],
             concepts=[
                 StringConcept(
                     name="Anomaly",
                     description="Anomaly in the contract, e.g. out-of-context or nonsensical clauses",
                     llm_role="reasoner_text",
                     add_references=True,  # Add references to the source text
                     reference_depth="sentences",  # Reference to the sentence level
                     add_justifications=True,  # Add justifications for the anomaly
                     justification_depth="balanced",  # Justification at the sentence level
                     justification_max_sents=5,  # Maximum number of sentences in the justification
                 ),
                 BooleanConcept(
                     name="Is mutual",
                     description="Whether the NDA is mutual (bidirectional) or one-way",
                     singular_occurrence=True,
                     llm_role="reasoner_text",  # Use the reasoner role for this concept
                 ),
                 DateConcept(
                     name="Effective date",
                     description="The date when the NDA agreement becomes effective",
                     singular_occurrence=True,
                 ),
                 StringConcept(
                     name="Term",
                     description="The term of the NDA",
                 ),
                 StringConcept(
                     name="Governing law",
                     description="The governing law of the agreement",
                     singular_occurrence=True,
                 ),
                 # ... Add more concepts as needed
             ],
         )

         # Assign the pipeline to the NDA document
         nda_document = Document(raw_text="[NDA text]")
         nda_document.assign_pipeline(nda_pipeline)

         # Now the document is ready for processing with the NDA review pipeline!
         # The document can be processed to extract the defined aspects and concepts

         # Extract all aspects and concepts from the NDA using an LLM group
         # with LLMs with roles "extractor_text" and "reasoner_text".
         # llm_group.extract_all(nda_document)

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   aspects: list[Aspect]

   concepts: list[_Concept]

   add_aspects(aspects)

      Adds aspects to the existing aspects list of an instance and
      returns the updated instance. This method ensures that the
      provided aspects are deeply copied to avoid any unintended state
      modification of the original reusable aspects.

      Parameters:
         **aspects** (*list**[**Aspect**]*) -- A list of aspects to be
         added. Each aspect is deeply copied to ensure the original
         list remains unaltered.

      Returns:
         Updated instance containing the newly added aspects.

      Return type:
         Self

   add_concepts(concepts)

      Adds a list of new concepts to the existing *concepts* attribute
      of the instance. This method ensures that the provided list of
      concepts is deep-copied to prevent unintended side effects from
      modifying the input list outside of this method.

      Parameters:
         **concepts** (*list**[**_Concept**]*) -- A list of concepts
         to be added. It will be deep-copied before being added to the
         instance's *concepts* attribute.

      Returns:
         Returns the instance itself after the modification.

      Return type:
         Self

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str*) -- Path to the JSON file to load (must
         end with '.json').

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **OSError** -- If there's an error reading the file.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_aspect_by_name(name)

      Finds and returns an aspect with the specified name from the
      list of available aspects, if the instance has *aspects*
      attribute.

      Parameters:
         **name** (*str*) -- The name of the aspect to find.

      Returns:
         The aspect with the specified name.

      Return type:
         Aspect

      Raises:
         **ValueError** -- If no aspect with the specified name is
         found.

   get_aspects_by_names(names)

      Retrieve a list of Aspect objects corresponding to the provided
      list of names.

      Parameters:
         **names** ("list"["str"]) -- List of aspect names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of Aspect objects corresponding to provided names.

      Return type:
         list[Aspect]

   get_concept_by_name(name)

      Retrieves a concept from the list of concepts based on the
      provided name, if the instance has *concepts* attribute.

      Parameters:
         **name** (*str*) -- The name of the concept to search for.

      Returns:
         The *_Concept* object with the specified name.

      Return type:
         _Concept

      Raises:
         **ValueError** -- If no concept with the specified name is
         found.

   get_concepts_by_names(names)

      Retrieve a list of _Concept objects corresponding to the
      provided list of names.

      Parameters:
         **names** ("list"["str"]) -- List of concept names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Concept objects corresponding to provided names.

      Return type:
         list[_Concept]

   property llm_roles: set[str]

      A set of LLM roles associated with the object's aspects and
      concepts.

      Returns:
         A set containing unique LLM roles gathered from aspects and
         concepts.

      Return type:
         set[str]

   remove_all_aspects()

      Removes all aspects from the instance and returns the updated
      instance.

      This method clears the *aspects* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all aspects removed

   remove_all_concepts()

      Removes all concepts from the instance and returns the updated
      instance.

      This method clears the *concepts* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all concepts removed

   remove_all_instances()

      Removes all assigned instances from the object and resets them
      as empty lists. Returns the modified instance.

      Returns:
         The modified object with all assigned instances removed.

      Return type:
         Self

   remove_aspect_by_name(name)

      Removes an aspect from the assigned aspects by its name.

      Parameters:
         **name** (*str*) -- The name of the aspect to be removed

      Returns:
         Updated instance with the aspect removed.

      Return type:
         Self

   remove_aspects_by_names(names)

      Removes multiple aspects from an object based on the provided
      list of names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of names identifying
         the aspects to be removed.

      Returns:
         The updated object after the specified aspects have been
         removed.

      Return type:
         Self

   remove_concept_by_name(name)

      Removes a concept from the assigned concepts by its name.

      Parameters:
         **name** (*str*) -- The name of the concept to be removed

      Returns:
         Updated instance with the concept removed.

      Return type:
         Self

   remove_concepts_by_names(names)

      Removes concepts from the object by their names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of concept names to
         be removed.

      Returns:
         Returns the updated instance after removing the specified
         concepts.

      Return type:
         Self

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str*) -- Path where the JSON file should be
         saved (must end with '.json').

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **IOError** -- If there's an error during the file writing
           process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   custom_data: dict