Skip to content

intsig-textin/langchain-xparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

langchain-xparse

LangChain integration with xParse Pipeline API for document parsing, chunking and embedding. Supports parse / chunk / embed stages only (extract is not supported in this loader).

Installation

From PyPI:

pip install langchain-xparse

Local editable install:

pip install -e .

Configuration

Set your TextIn credentials (from Textin Workspace ):

export XPARSE_APP_ID="your-app-id"
export XPARSE_SECRET_CODE="your-secret-code"

Or pass them when creating the loader:

loader = XParseLoader(
    file_path="doc.pdf",
    app_id="your-app-id",
    secret_code="your-secret-code",
)

Usage

Basic (parse only)

from langchain_xparse import XParseLoader

loader = XParseLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:200])
print(docs[0].metadata)  # source, category, element_id, filename, page_number, ...

Lazy load

for doc in loader.lazy_load():
    # process(doc)

Async

async for doc in loader.alazy_load():
    # process(doc)

Convenience params (parse + chunk, or parse + chunk + embed)

loader = XParseLoader(
    file_path="doc.pdf",
    parse_provider="textin",
    chunk_strategy="by_title",
    chunk_max_characters=500,
    chunk_overlap=50,
)
# Or with embed:
loader = XParseLoader(
    file_path="doc.pdf",
    parse_provider="textin",
    chunk_strategy="basic",
    chunk_max_characters=1000,
    embed_provider="qwen",
    embed_model_name="text-embedding-v4",
)
docs = loader.load()

Custom stages (advanced)

loader = XParseLoader(
    file_path="doc.pdf",
    stages=[
        {"type": "parse", "config": {"provider": "textin"}},
        {"type": "chunk", "config": {"strategy": "by_page", "max_characters": 800}},
    ],
)

Multiple files

loader = XParseLoader(file_path=["a.pdf", "b.pdf"])
for doc in loader.lazy_load():
    print(doc.metadata.get("source"), doc.page_content[:50])

File-like object

When passing a file-like object instead of a path, you must set metadata_filename:

with open("doc.pdf", "rb") as f:
    loader = XParseLoader(file=f, metadata_filename="doc.pdf")
    docs = loader.load()

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages