-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathdocs-raw-for-llm.txt
10759 lines (7833 loc) · 352 KB
/
docs-raw-for-llm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
ContextGem - Effortless LLM extraction from documents
====================================================================================================
Copyright (c) 2025 Shcherbak AI AS
All rights reserved
Developed by Sergii Shcherbak
This software is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# ==== Documentation Content ====
# ==== motivation ====
Why ContextGem?
***************
ContextGem is an LLM framework designed to strike the right balance
between ease of use, customizability, and accuracy for structured data
and insights extraction from documents.
ContextGem offers the **easiest and fastest way** to build LLM
extraction workflows for document analysis through powerful
abstractions of most time consuming parts.
⏱️ Development Overhead of Other Frameworks
===========================================
Most popular LLM frameworks for extracting structured data from
documents require extensive boilerplate code to extract even basic
information. As a developer using these frameworks, you're typically
expected to:
📝 Prompt Engineering
* Write custom prompts from scratch for each extraction scenario
* Maintain different prompt templates for different extraction
workflows
* Adapt prompts manually when extraction requirements change
🔧 Technical Implementation
* Define your own data models and implement validation logic
* Implement complex chaining for multi-LLM workflows
* Implement nested context extraction logic (*e.g. document > sections
> paragraphs > entities*)
* Configure text segmentation logic for correct reference mapping
* Configure concurrent I/O processing logic to speed up complex
extraction workflows
**Result:** All these limitations significantly increase development
time and complexity.
💡 The ContextGem Solution
==========================
ContextGem addresses these challenges by providing a flexible,
intuitive framework that extracts structured data and insights from
documents with minimal effort. Complex, most time-consuming parts are
handled with **powerful abstractions**, eliminating boilerplate code
and reducing development overhead.
With ContextGem, you benefit from a "batteries included" approach,
coupled with simple, intuitive syntax.
ContextGem and Other Open-Source LLM Frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----+-----------------------------------------------+------------+----------------------+
| | Key built-in abstractions | **Context | Other frameworks* |
| | | Gem** | |
|=====|===============================================|============|======================|
| 💎 | **Automated dynamic prompts** Automatically | 🟢 | ◯ |
| | constructs comprehensive prompts for your | | |
| | specific extraction needs. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Automated data modelling and validators** | 🟢 | ◯ |
| | Automatically creates data models and | | |
| | validation logic. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Precise granular reference mapping | 🟢 | ◯ |
| | (paragraphs & sentences)** Automatically | | |
| | maps extracted data to the relevant parts of | | |
| | the document, which will always match in the | | |
| | source document, with customizable | | |
| | granularity. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Justifications (reasoning backing the | 🟢 | ◯ |
| | extraction)** Automatically provides | | |
| | justifications for each extraction, with | | |
| | customizable granularity. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Neural segmentation (SaT)** Automatically | 🟢 | ◯ |
| | segments the document into paragraphs and | | |
| | sentences using state-of-the-art SaT models, | | |
| | compatible with many languages. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Multilingual support (I/O without | 🟢 | ◯ |
| | prompting)** Supports multiple languages in | | |
| | input and output without additional | | |
| | prompting. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Single, unified extraction pipeline | 🟢 | 🟡 |
| | (declarative, reusable, fully serializable)** | | |
| | Allows to define a complete extraction | | |
| | workflow in a single, unified, reusable | | |
| | pipeline, using simple declarative syntax. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Grouped LLMs with role-specific tasks** | 🟢 | 🟡 |
| | Allows to easily group LLMs with different | | |
| | roles to process role- specific tasks in the | | |
| | pipeline. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Nested context extraction** Automatically | 🟢 | 🟡 |
| | manages nested context based on the pipeline | | |
| | definition (e.g. document > aspects > sub- | | |
| | aspects > concepts). | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Unified, fully serializable results storage | 🟢 | 🟡 |
| | model (document)** All extraction results | | |
| | are stored on the document object, including | | |
| | aspects, sub-aspects, and concepts. This | | |
| | object is fully serializable, and all the | | |
| | extraction results can be restored, with just | | |
| | one line of code. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Extraction task calibration with examples** | 🟢 | 🟡 |
| | Allows to easily define and attach output | | |
| | examples that guide the LLM's extraction | | |
| | behavior, without manually modifying prompts. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Built-in concurrent I/O processing** | 🟢 | 🟡 |
| | Automatically manages concurrent I/O | | |
| | processing to speed up complex extraction | | |
| | workflows, with a simple switch | | |
| | ("use_concurrency=True"). | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Automated usage & costs tracking** | 🟢 | 🟡 |
| | Automatically tracks usage (calls, tokens, | | |
| | costs) of all LLM calls. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Fallback and retry logic** Built-in retry | 🟢 | 🟢 |
| | logic and easily attachable fallback LLMs. | | |
+-----+-----------------------------------------------+------------+----------------------+
| 💎 | **Multiple LLM providers** Compatible with a | 🟢 | 🟢 |
| | wide range of commercial and locally hosted | | |
| | LLMs. | | |
+-----+-----------------------------------------------+------------+----------------------+
🟢 - fully supported - no additional setup required
🟡 - partially supported - requires additional setup
◯ - not supported - requires custom logic
* See ContextGem and other frameworks for specific implementation
examples comparing ContextGem with other popular open-source LLM
frameworks. (Comparison as of 24 March 2025.)
🎯 Focused Approach
===================
ContextGem is intentionally optimized for **in-depth single-document
analysis** to deliver maximum extraction accuracy and precision. While
this focused approach enables superior results for individual
documents, ContextGem currently does not support cross-document
querying or corpus-wide information retrieval. For these use cases,
traditional RAG (Retrieval-Augmented Generation) systems over document
collections (e.g. LlamaIndex) remain more appropriate.
# ==== vs_other_frameworks ====
ContextGem and other frameworks
*******************************
Due to ContextGem's powerful abstractions, it is the **easiest and
fastest way** to build LLM extraction workflows for document analysis.
✏️ Basic Example
================
Below is a basic example of an extraction workflow - *extraction of
anomalies from a document* - implemented side-by-side in ContextGem
and other frameworks. (All implementations are self-contained.
Comparison as of 24 March 2025.)
Even implementing this basic extraction workflow requires
significantly more effort in other frameworks:
* 🔧 **Manual model definition**: Developers must define Pydantic
validation models for structured output
* 📝 **Prompt engineering**: Crafting comprehensive prompts that guide
the LLM effectively
* 🔄 **Output parsing logic**: Setting up parsers to handle the LLM's
response
* 📄 **Reference mapping**: Writing custom logic for mapping
references in the source document
In contrast, ContextGem handles all these complexities automatically.
Users simply describe what to extract in natural language, provide
basic configuration parameters, and the framework takes care of the
rest.
-[ **ContextGem** ]-
⚡ Fastest way
ContextGem is the fastest and easiest way to implement an LLM
extraction workflow. All the boilerplate code is handled behind the
scenes.
**Major time savers:**
* ⌨️ **Simple syntax**: ContextGem uses a simple, intuitive API that
requires minimal code
* 📝 **Automatic prompt engineering**: ContextGem automatically
constructs a prompt tailored to the extraction task
* 🔄 **Automatic model definition**: ContextGem automatically defines
the Pydantic model for structured output
* 🧩 **Automatic output parsing**: ContextGem automatically parses the
LLM's response
* 🔍 **Automatic reference tracking**: Precise references are
automatically extracted and mapped to the original document
* 📏 **Flexible reference granularity**: References can be tracked at
different levels (paragraphs, sentences)
Anomaly extraction example (ContextGem)
# Quick Start Example - Extracting anomalies from a document, with source references and justifications
import os
from contextgem import Document, DocumentLLM, StringConcept
# Sample document text (shortened for brevity)
doc = Document(
raw_text=(
"Consultancy Agreement\n"
"This agreement between Company A (Supplier) and Company B (Customer)...\n"
"The term of the agreement is 1 year from the Effective Date...\n"
"The Supplier shall provide consultancy services as described in Annex 2...\n"
"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
"The purple elephant danced gracefully on the moon while eating ice cream.\n" # 💎 anomaly
"This agreement is governed by the laws of Norway...\n"
),
)
# Attach a document-level concept
doc.concepts = [
StringConcept(
name="Anomalies", # in longer contexts, this concept is hard to capture with RAG
description="Anomalies in the document",
add_references=True,
reference_depth="sentences",
add_justifications=True,
justification_depth="brief",
# see the docs for more configuration options
)
# add more concepts to the document, if needed
# see the docs for available concepts: StringConcept, JsonObjectConcept, etc.
]
# Or use `doc.add_concepts([...])`
# Define an LLM for extracting information from the document
llm = DocumentLLM(
model="openai/gpt-4o-mini", # or another provider/LLM
api_key=os.environ.get(
"CONTEXTGEM_OPENAI_API_KEY"
), # your API key for the LLM provider
# see the docs for more configuration options
)
# Extract information from the document
doc = llm.extract_all(doc) # or use async version `await llm.extract_all_async(doc)`
# Access extracted information in the document object
print(
doc.concepts[0].extracted_items
) # extracted items with references & justifications
# or `doc.get_concept_by_name("Anomalies").extracted_items`
-[ LangChain ]-
LangChain is a popular and versatile framework for building LLM
applications through composable components. It offers excellent
flexibility and a rich ecosystem of integrations. While powerful,
feature-rich, and widely adopted in the industry, it requires more
manual configuration and setup work for structured data extraction
tasks compared to ContextGem's streamlined approach.
**Development overhead:**
* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
that guide the LLM effectively
* 🔧 **Manual model definition**: Developers must define Pydantic
validation models for structured output
* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
response
* 🔍 **Manual reference mapping**: Writing custom logic for mapping
references
Anomaly extraction example (LangChain)
# LangChain implementation for extracting anomalies from a document, with source references and justifications
import os
from textwrap import dedent
from typing import Optional
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# Pydantic models must be manually defined
class Anomaly(BaseModel):
"""An anomaly found in the document."""
text: str = Field(description="The anomalous text found in the document")
justification: str = Field(
description="Brief justification for why this is an anomaly"
)
reference: str = Field(
description="The sentence containing the anomaly"
) # LLM reciting a reference is error-prone and unreliable
class AnomaliesList(BaseModel):
"""List of anomalies found in the document."""
anomalies: list[Anomaly] = Field(
description="List of anomalies found in the document"
)
def extract_anomalies_with_langchain(
document_text: str, api_key: Optional[str] = None
) -> list[Anomaly]:
"""
Extract anomalies from a document using LangChain.
Args:
document_text: The text content of the document
api_key: OpenAI API key (defaults to environment variable)
Returns:
List of extracted anomalies with justifications and references
"""
openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
llm = ChatOpenAI(model="gpt-4o-mini", openai_api_key=openai_api_key, temperature=0)
# Create a parser for structured output
parser = PydanticOutputParser(pydantic_object=AnomaliesList)
# Prompt must be manually drafted
# This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
template = dedent(
"""
You are an expert document analyzer. Your task is to identify any anomalies in the document.
Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
with the rest of the document's context and purpose.
Document:
{document_text}
Identify all anomalies in the document. For each anomaly, provide:
1. The anomalous text
2. A brief justification explaining why it's an anomaly
3. The complete sentence containing the anomaly for reference
{format_instructions}
"""
)
prompt = PromptTemplate(
template=template,
input_variables=["document_text"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Create a runnable chain
chain = (
{"document_text": lambda x: x}
| RunnablePassthrough.assign()
| prompt
| llm
| RunnableLambda(lambda x: parser.parse(x.content))
)
# Run the chain and extract anomalies
parsed_output = chain.invoke(document_text)
return parsed_output.anomalies
# Example usage
# Sample document text (shortened for brevity)
document_text = (
"Consultancy Agreement\n"
"This agreement between Company A (Supplier) and Company B (Customer)...\n"
"The term of the agreement is 1 year from the Effective Date...\n"
"The Supplier shall provide consultancy services as described in Annex 2...\n"
"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
"The purple elephant danced gracefully on the moon while eating ice cream.\n" # out-of-context / anomaly
"This agreement is governed by the laws of Norway...\n"
)
# Extract anomalies
anomalies = extract_anomalies_with_langchain(document_text)
# Print results
for anomaly in anomalies:
print(f"Anomaly: {anomaly}")
-[ LlamaIndex ]-
LlamaIndex is a powerful and versatile framework for building LLM
applications with data, particularly excelling at RAG workflows and
document retrieval. It offers a comprehensive set of tools for data
indexing and querying. While highly effective for its intended use
cases, for structured data extraction tasks (non-RAG setup), it
requires more manual configuration and setup work compared to
ContextGem's streamlined approach.
**Development overhead:**
* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
that guide the LLM effectively
* 🔧 **Manual model definition**: Developers must define Pydantic
validation models for structured output
* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
response
* 🔍 **Manual reference mapping**: Writing custom logic for mapping
references
Anomaly extraction example (LlamaIndex)
# LlamaIndex implementation for extracting anomalies from a document, with source references and justifications
import os
from textwrap import dedent
from typing import Optional
from llama_index.core.output_parsers import PydanticOutputParser
from llama_index.core.program import LLMTextCompletionProgram
from llama_index.llms.openai import OpenAI
from pydantic import BaseModel, Field
# Pydantic models must be manually defined
class Anomaly(BaseModel):
"""An anomaly found in the document."""
text: str = Field(description="The anomalous text found in the document")
justification: str = Field(
description="Brief justification for why this is an anomaly"
)
reference: str = Field(
description="The sentence containing the anomaly"
) # LLM reciting a reference is error-prone and unreliable
class AnomaliesList(BaseModel):
"""List of anomalies found in the document."""
anomalies: list[Anomaly] = Field(
description="List of anomalies found in the document"
)
def extract_anomalies_with_llama_index(
document_text: str, api_key: Optional[str] = None
) -> list[Anomaly]:
"""
Extract anomalies from a document using LlamaIndex.
Args:
document_text: The text content of the document
api_key: OpenAI API key (defaults to environment variable)
Returns:
List of extracted anomalies with justifications and references
"""
openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
llm = OpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0)
# Prompt must be manually drafted
# This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
prompt_template = dedent(
"""
You are an expert document analyzer. Your task is to identify any anomalies in the document.
Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
with the rest of the document's context and purpose.
Document:
{document_text}
Identify all anomalies in the document. For each anomaly, provide:
1. The anomalous text
2. A brief justification explaining why it's an anomaly
3. The complete sentence containing the anomaly for reference
"""
)
# Use PydanticOutputParser to directly parse the LLM output into our structured format
program = LLMTextCompletionProgram.from_defaults(
output_parser=PydanticOutputParser(output_cls=AnomaliesList),
prompt_template_str=prompt_template,
llm=llm,
verbose=True,
)
# Execute the program
try:
result = program(document_text=document_text)
return result.anomalies
except Exception as e:
print(f"Error parsing LLM response: {e}")
return []
# Example usage
# Sample document text (shortened for brevity)
document_text = (
"Consultancy Agreement\n"
"This agreement between Company A (Supplier) and Company B (Customer)...\n"
"The term of the agreement is 1 year from the Effective Date...\n"
"The Supplier shall provide consultancy services as described in Annex 2...\n"
"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
"The purple elephant danced gracefully on the moon while eating ice cream.\n" # out-of-context / anomaly
"This agreement is governed by the laws of Norway...\n"
)
# Extract anomalies
anomalies = extract_anomalies_with_llama_index(document_text)
# Print results
for anomaly in anomalies:
print(f"Anomaly: {anomaly}")
-[ LlamaIndex (RAG) ]-
LlamaIndex with RAG setup is a powerful and sophisticated framework
for document retrieval and analysis, offering exceptional capabilities
for knowledge-intensive applications. Its comprehensive architecture
excels at handling complex document interactions and information
retrieval tasks across large document collections. While it provides
robust and versatile capabilities for building advanced document-based
applications, it does require more manual configuration and
specialized setup for structured extraction tasks compared to
ContextGem's streamlined and intuitive approach.
**Development overhead:**
* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
that guide the LLM effectively
* 🔧 **Manual model definition**: Developers must define Pydantic
validation models for structured output
* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
response
* 🔍 **Complex reference mapping**: Getting precise references
correctly requires additional config, such as setting up a sentence
splitter, CitationQueryEngine, adjusting chunk sizes, etc.
Anomaly extraction example (LlamaIndex RAG)
# LlamaIndex (RAG) implementation for extracting anomalies from a document, with source references and justifications
import os
from textwrap import dedent
from typing import Any, Optional
from llama_index.core import Document, Settings, VectorStoreIndex
from llama_index.core.base.response.schema import RESPONSE_TYPE
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.output_parsers import PydanticOutputParser
from llama_index.core.query_engine import CitationQueryEngine
from llama_index.core.response_synthesizers.base import BaseSynthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.llms.openai import OpenAI
from pydantic import BaseModel, Field
# Pydantic models must be manually defined
class Anomaly(BaseModel):
text: str = Field(description="The anomalous text found in the document")
justification: str = Field(
description="Brief justification for why this is an anomaly"
)
# This field will hold the citation info (e.g., node references)
source_id: Optional[str] = Field(
description="Automatically added source reference", default=None
)
class AnomaliesList(BaseModel):
anomalies: list[Anomaly] = Field(
description="List of anomalies found in the document"
)
# Custom synthesizer that instructs the LLM to extract anomalies in JSON format.
class AnomalyExtractorSynthesizer(BaseSynthesizer):
def __init__(self, llm=None, nodes=None):
super().__init__()
self._llm = llm or Settings.llm
# Nodes are still provided in case additional context is needed.
self._nodes = nodes or []
def _get_prompts(self) -> dict[str, Any]:
return {}
def _update_prompts(self, prompts: dict[str, Any]):
return
async def aget_response(
self, query_str: str, text_chunks: list[str], **kwargs: Any
) -> RESPONSE_TYPE:
return self.get_response(query_str, text_chunks, **kwargs)
def get_response(
self, query_str: str, text_chunks: list[str], **kwargs: Any
) -> str:
all_text = "\n".join(text_chunks)
# Prompt must be manually drafted
# This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
prompt_str = dedent(
"""
You are an expert document analyzer. Your task is to identify anomalies in the document.
Anomalies are statements or phrases that seem out of place or inconsistent with the document's context.
Document:
{all_text}
For each anomaly, provide:
1. The anomalous text (only the specific phrase).
2. A brief justification for why it is an anomaly.
Format your answer as a JSON object:
{{
"anomalies": [
{{
"text": "anomalous text",
"justification": "reason for anomaly",
}}
]
}}
"""
)
print(prompt_str)
output_parser = PydanticOutputParser(output_cls=AnomaliesList)
response = self._llm.complete(prompt_str.format(all_text=all_text))
try:
parsed_response = output_parser.parse(response.text)
self._last_anomalies = parsed_response
return parsed_response.model_dump_json()
except Exception as e:
print(f"Error parsing LLM response: {e}")
print(f"Raw response: {response.text}")
return "{}"
def extract_anomalies_with_citations(
document_text: str, api_key: Optional[str] = None
) -> list[Anomaly]:
"""
Extract anomalies from a document using LlamaIndex with citation support.
Args:
document_text: The content of the document.
api_key: OpenAI API key (if not provided, read from environment variable).
Returns:
List of extracted anomalies with automatically added source references.
"""
openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
llm = OpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0)
Settings.llm = llm
# Create a Document and split it into nodes
doc = Document(text=document_text)
splitter = SentenceSplitter(
paragraph_separator="\n",
chunk_size=100,
chunk_overlap=0,
)
nodes = splitter.get_nodes_from_documents([doc])
print(f"Document split into {len(nodes)} nodes")
# Build a vector index and retriever using all nodes.
index = VectorStoreIndex(nodes)
retriever = VectorIndexRetriever(index=index, similarity_top_k=len(nodes))
# Create a custom synthesizer.
synthesizer = AnomalyExtractorSynthesizer(llm=llm, nodes=nodes)
# Initialize CitationQueryEngine by passing the expected components.
citation_query_engine = CitationQueryEngine(
retriever=retriever,
llm=llm,
response_synthesizer=synthesizer,
citation_chunk_size=100, # Adjust as needed
citation_chunk_overlap=10, # Adjust as needed
)
try:
response = citation_query_engine.query(
"Extract all anomalies from this document"
)
# If the synthesizer stored the anomalies, attach the citation info
if hasattr(synthesizer, "_last_anomalies"):
anomalies = synthesizer._last_anomalies.anomalies
formatted_citations = (
response.get_formatted_sources()
if hasattr(response, "get_formatted_sources")
else None
)
for anomaly in anomalies:
anomaly.source_id = formatted_citations
return anomalies
return []
except Exception as e:
print(f"Error querying document: {e}")
return []
# Example usage
document_text = (
"Consultancy Agreement\n"
"This agreement between Company A (Supplier) and Company B (Customer)...\n"
"The term of the agreement is 1 year from the Effective Date...\n"
"The Supplier shall provide consultancy services as described in Annex 2...\n"
"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
"The purple elephant danced gracefully on the moon while eating ice cream.\n" # anomaly
"This agreement is governed by the laws of Norway...\n"
)
anomalies = extract_anomalies_with_citations(document_text)
for anomaly in anomalies:
print(f"Anomaly: {anomaly}")
-[ Instructor ]-
Instructor is a popular framework that specializes in structured data
extraction with LLMs using Pydantic. It offers excellent type safety
and validation capabilities, making it a solid choice for many
extraction tasks. While powerful for structured outputs, Instructor
requires more manual setup for document analysis workflows.
**Development overhead:**
* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
that guide the LLM effectively
* 🔧 **Manual model definition**: Developers must define Pydantic
validation models for structured output
* 🔍 **Manual reference mapping**: Writing custom logic for mapping
references
Anomaly extraction example (Instructor)
# Instructor implementation for extracting anomalies from a document, with source references and justifications
import os
from textwrap import dedent
from typing import Optional
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
# Pydantic models must be manually defined
class Anomaly(BaseModel):
"""An anomaly found in the document."""
text: str = Field(description="The anomalous text found in the document")
justification: str = Field(
description="Brief justification for why this is an anomaly"
)
source_text: str = Field(
description="The sentence containing the anomaly"
) # LLM reciting a reference is error-prone and unreliable
class AnomaliesList(BaseModel):
"""List of anomalies found in the document."""
anomalies: list[Anomaly] = Field(
description="List of anomalies found in the document"
)
def extract_anomalies_with_instructor(
document_text: str, api_key: Optional[str] = None
) -> list[Anomaly]:
"""
Extract anomalies from a document using Instructor.
Args:
document_text: The text content of the document
api_key: OpenAI API key (defaults to environment variable)
Returns:
List of extracted anomalies with justifications and references
"""
openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
client = OpenAI(api_key=openai_api_key)
instructor_client = instructor.from_openai(client)
# Prompt must be manually drafted
# This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
prompt = dedent(
f"""
You are an expert document analyzer. Your task is to identify any anomalies in the document.
Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
with the rest of the document's context and purpose.
Document:
{document_text}
Identify all anomalies in the document. For each anomaly, provide:
1. The anomalous text - just the specific anomalous phrase
2. A brief justification explaining why it's an anomaly
3. The exact complete sentence containing the anomaly for reference
Only identify real anomalies that truly don't belong in this type of document.
"""
)
# Extract structured data using Instructor
response = instructor_client.chat.completions.create(
model="gpt-4o-mini",
response_model=AnomaliesList,
messages=[
{"role": "system", "content": "You are an expert document analyzer."},
{"role": "user", "content": prompt},
],
temperature=0,
)
return response.anomalies
# Example usage
# Sample document text (shortened for brevity)
document_text = (
"Consultancy Agreement\n"
"This agreement between Company A (Supplier) and Company B (Customer)...\n"
"The term of the agreement is 1 year from the Effective Date...\n"
"The Supplier shall provide consultancy services as described in Annex 2...\n"
"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
"The purple elephant danced gracefully on the moon while eating ice cream.\n" # out-of-context / anomaly
"This agreement is governed by the laws of Norway...\n"
)
# Extract anomalies
anomalies = extract_anomalies_with_instructor(document_text)
# Print results
for anomaly in anomalies:
print(f"Anomaly: {anomaly}")
🔬 Advanced Example
===================
As use cases grow more complex, the development overhead of
alternative frameworks becomes increasingly evident, while
ContextGem's abstractions deliver substantial time savings. As
extraction steps stack up, the implementation with other frameworks
quickly becomes *non-scalable*:
* 📝 **Manual prompt engineering**: Crafting comprehensive prompts for
each extraction step
* 🔧 **Manual model definition**: Defining Pydantic validation models
for each element of extraction
* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
response
* 🔍 **Manual reference mapping**: Writing custom logic for mapping
references
* 📄 **Complex pipeline configuration**: Writing custom logic for
pipeline configuration and extraction components
* 📊 **Implementing usage and cost tracking callbacks**, which quickly
increases in complexity when multiple LLMs are used in the pipeline
* 🔄 **Complex concurrency setup**: Implementing complex concurrency
logic with asyncio
* 📝 **Embedding examples in prompts**: Writing output examples
directly in the custom prompts
* 📋 **Manual result aggregation**: Need to write code to collect and
organize results
Below is a more advanced example of an extraction workflow - *using an
extraction pipeline for multiple documents, with concurrency and cost
tracking* - implemented side-by-side in ContextGem and other
frameworks. (All implementations are self-contained. Comparison as of
24 March 2025.)
-[ **ContextGem** ]-
⚡ Fastest way
ContextGem is the fastest and easiest way to implement an LLM
extraction workflow. All the boilerplate code is handled behind the
scenes.
**Major time savers:**
* ⌨️ **Simple syntax**: ContextGem uses a simple, intuitive API that
requires minimal code
* 🔄 **Automatic model definition**: ContextGem automatically defines
the Pydantic model for structured output
* 📝 **Automatic prompt engineering**: ContextGem automatically
constructs a prompt tailored to the extraction task
* 🧩 **Automatic output parsing**: ContextGem automatically parses the
LLM's response
* 🔍 **Automatic reference tracking**: Precise references are
automatically extracted and mapped to the original document
* 📏 **Flexible reference granularity**: References can be tracked at
different levels (paragraphs, sentences)
* 📄 **Easy pipeline definition**: Simple, declarative syntax for
defining the extraction pipeline involving multiple LLMs, in a few
lines of code
* 💰 **Automated usage and cost tracking**: Built-in token counting
and cost calculation without additional setup
* 🔄 **Built-in concurrency**: Concurrent execution of extraction
steps with a simple switch "use_concurrency=True"
* 📊 **Easy example definition**: Output examples can be easily
defined without modifying any prompts
* 📋 **Built-in result aggregation**: Results are automatically
collected and organized in a unified storage model (document)
Extraction pipeline example (ContextGem)
# Advanced Usage Example - analyzing multiple documents with a single pipeline,
# with different LLMs, concurrency and cost tracking
import os