-
Notifications
You must be signed in to change notification settings - Fork 845
[MEDI] Allow Pipeline and Reader to work with any Source #7090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…gestionDocumentReader
…umber of breaking changes
…ut for MarkItDownMcpReader
…ke TSource and transfrom it into TChunk(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the data ingestion pipeline to be generic over source types, allowing it to work with any source type beyond just FileInfo and Stream. The key change introduces IIngestionDocumentReader<TSource> interface and converts IngestionPipeline to IngestionPipeline<TSource, TChunk>.
Key changes:
- Introduces
IIngestionDocumentReader<TSource>interface to enable generic source type support - Refactors
IngestionPipeline<T>toIngestionPipeline<TSource, TChunk>with two type parameters - Extracts file system-specific operations into
FileSystemIngestionExtensionsclass - Moves media type detection logic to
MediaTypeProviderutility class
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| IIngestionDocumentReader.cs | New generic interface for document readers with any source type |
| IngestionDocumentReader.cs | Updated abstract class to implement new interface and use MediaTypeProvider |
| IngestionPipeline.cs | Converted to generic over source type; file system operations moved to extensions |
| FileSystemIngestionExtensions.cs | New extension class with FileInfo-specific processing methods |
| MediaTypeProvider.cs | New utility class for media type detection, extracted from IngestionDocumentReader |
| MarkdownReader.cs, MarkItDownReader.cs, MarkItDownMcpReader.cs | Updated to make mediaType parameter nullable |
| Test files | Updated to use generic type parameters and new test cases |
Comments suppressed due to low confidence (1)
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentReader.cs:1
- The
.markdownand.mdextensions mapping is duplicated in bothIngestionDocumentReader.cs(line 41) andMediaTypeProvider.cs(line 41). SinceMediaTypeProvideris now the centralized location for media type mappings and is being linked into this project, this duplication should be removed.
// Licensed to the .NET Foundation under one or more agreements.
The idea is to introduce a new interface, called
IIngestionDocumentReader, where the generic type parameter specifies the source. Source can be anything (FileInfo, Stream but also int or Guid or custom user type). It's up to the reader to get the document for given source (parse a file or read it from DB or somewhere else).The
IngestionDocumentReaderclass remains, we are still opinionated and believe that readers should implementFileInfoandStreamsupport. It implements the new interface.But we also know that it's not enough for all the scenarios, so Pipline no longer accepts the old reader class, but something that implements the new interface.
Because of that, the pipeline now needs to specify two generic type arguments instead of one. This is a breaking change.
Moreover, pipeline itself no longer defines methods for processing multiple files or directory. This functionality was moved to extension methods.
fixes #7082
Microsoft Reviewers: Open in CodeFlow