Skip to content

Comments

feat(memory): document loaders, text splitter, and ingestion pipeline#558

Merged
bug-ops merged 5 commits intomainfrom
m30-document-loaders
Feb 18, 2026
Merged

feat(memory): document loaders, text splitter, and ingestion pipeline#558
bug-ops merged 5 commits intomainfrom
m30-document-loaders

Conversation

@bug-ops
Copy link
Owner

@bug-ops bug-ops commented Feb 18, 2026

Summary

  • Add DocumentLoader trait with TextLoader (txt/md) and feature-gated PdfLoader in zeph-memory
  • Add TextSplitter with configurable chunk size, overlap, and sentence-aware splitting
  • Add IngestionPipeline for load -> split -> embed -> store via Qdrant
  • Include file size guard (50 MiB default) and path canonicalization for security

Issues

Closes #469,Closes #470,Closes #471,Closes #472
Refs #478

Test plan

  • TextLoader: plain text, markdown, empty file, unknown extension, canonical paths
  • TextSplitter: sentence-aware, char-based, overlap, edge cases (13 tests)
  • PdfLoader feature gate compiles with --features pdf
  • clippy, fmt, nextest (1647 tests pass)

@github-actions github-actions bot added enhancement New feature or request documentation Improvements or additions to documentation memory Persistence and memory rust dependencies size/XL and removed enhancement New feature or request labels Feb 18, 2026
@bug-ops bug-ops force-pushed the m30-document-loaders branch 2 times, most recently from e5dd76a to aca6e87 Compare February 18, 2026 19:43
@github-actions github-actions bot added the ci label Feb 18, 2026
…line

Introduce document loading subsystem in zeph-memory with DocumentLoader
trait, TextLoader (txt/md), TextSplitter with sentence-aware chunking,
and IngestionPipeline that integrates with Qdrant vector store.

Add feature-gated PdfLoader behind `pdf` feature using pdf-extract.
Include file size guard (50 MiB) and path canonicalization for security.

Closes #469, #470, #471, #472
Refs #478
Match TextLoader pattern with per-instance max_file_size field
defaulting to DEFAULT_MAX_FILE_SIZE (50 MiB).
…or ingestion pipeline

Add 4 property-based tests for TextSplitter: never panics on arbitrary
input, chunks cover all content, indices are sequential, no empty chunks.

Add 5 integration tests using testcontainers Qdrant for IngestionPipeline:
single/empty/multi-chunk ingest, load_and_ingest from file, payload
verification.
Add docs/src/guide/document-loaders.md covering DocumentLoader trait,
TextLoader, PdfLoader, TextSplitter, and IngestionPipeline.

Update architecture/crates.md, feature-flags.md, SUMMARY.md,
zeph-memory README, and root README with pdf feature flag.
Integration tests require Docker (testcontainers) and fail in the
snapshot check job which lacks Docker. Limit insta test to --lib --bins
since all snapshots are in inline cfg(test) modules.
@bug-ops bug-ops force-pushed the m30-document-loaders branch from c1d7c70 to f37b390 Compare February 18, 2026 20:03
@bug-ops bug-ops enabled auto-merge (squash) February 18, 2026 20:04
@bug-ops bug-ops merged commit d933e5b into main Feb 18, 2026
20 checks passed
@bug-ops bug-ops deleted the m30-document-loaders branch February 18, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci dependencies documentation Improvements or additions to documentation enhancement New feature or request memory Persistence and memory rust size/XL

Projects

None yet

1 participant