LangChain integration with xParse Pipeline API for document parsing, chunking and embedding. Supports parse / chunk / embed stages only (extract is not supported in this loader).
From PyPI:
pip install langchain-xparseLocal editable install:
pip install -e .Set your TextIn credentials (from Textin Workspace ):
export XPARSE_APP_ID="your-app-id"
export XPARSE_SECRET_CODE="your-secret-code"Or pass them when creating the loader:
loader = XParseLoader(
file_path="doc.pdf",
app_id="your-app-id",
secret_code="your-secret-code",
)from langchain_xparse import XParseLoader
loader = XParseLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:200])
print(docs[0].metadata) # source, category, element_id, filename, page_number, ...for doc in loader.lazy_load():
# process(doc)async for doc in loader.alazy_load():
# process(doc)loader = XParseLoader(
file_path="doc.pdf",
parse_provider="textin",
chunk_strategy="by_title",
chunk_max_characters=500,
chunk_overlap=50,
)
# Or with embed:
loader = XParseLoader(
file_path="doc.pdf",
parse_provider="textin",
chunk_strategy="basic",
chunk_max_characters=1000,
embed_provider="qwen",
embed_model_name="text-embedding-v4",
)
docs = loader.load()loader = XParseLoader(
file_path="doc.pdf",
stages=[
{"type": "parse", "config": {"provider": "textin"}},
{"type": "chunk", "config": {"strategy": "by_page", "max_characters": 800}},
],
)loader = XParseLoader(file_path=["a.pdf", "b.pdf"])
for doc in loader.lazy_load():
print(doc.metadata.get("source"), doc.page_content[:50])When passing a file-like object instead of a path, you must set metadata_filename:
with open("doc.pdf", "rb") as f:
loader = XParseLoader(file=f, metadata_filename="doc.pdf")
docs = loader.load()