Skip to content

Conversation

@adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Nov 27, 2025

The idea is to introduce a new interface, called IIngestionDocumentReader, where the generic type parameter specifies the source. Source can be anything (FileInfo, Stream but also int or Guid or custom user type). It's up to the reader to get the document for given source (parse a file or read it from DB or somewhere else).

public interface IIngestionDocumentReader<TSource>
{
    Task<IngestionDocument> ReadAsync(TSource source, string identifier, string? mediaType = null, CancellationToken cancellationToken = default);
}

The IngestionDocumentReader class remains, we are still opinionated and believe that readers should implement FileInfo and Stream support. It implements the new interface.
But we also know that it's not enough for all the scenarios, so Pipline no longer accepts the old reader class, but something that implements the new interface.

Because of that, the pipeline now needs to specify two generic type arguments instead of one. This is a breaking change.
Moreover, pipeline itself no longer defines methods for processing multiple files or directory. This functionality was moved to extension methods.

fixes #7082

Microsoft Reviewers: Open in CodeFlow

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the data ingestion pipeline to be generic over source types, allowing it to work with any source type beyond just FileInfo and Stream. The key change introduces IIngestionDocumentReader<TSource> interface and converts IngestionPipeline to IngestionPipeline<TSource, TChunk>.

Key changes:

  • Introduces IIngestionDocumentReader<TSource> interface to enable generic source type support
  • Refactors IngestionPipeline<T> to IngestionPipeline<TSource, TChunk> with two type parameters
  • Extracts file system-specific operations into FileSystemIngestionExtensions class
  • Moves media type detection logic to MediaTypeProvider utility class

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
IIngestionDocumentReader.cs New generic interface for document readers with any source type
IngestionDocumentReader.cs Updated abstract class to implement new interface and use MediaTypeProvider
IngestionPipeline.cs Converted to generic over source type; file system operations moved to extensions
FileSystemIngestionExtensions.cs New extension class with FileInfo-specific processing methods
MediaTypeProvider.cs New utility class for media type detection, extracted from IngestionDocumentReader
MarkdownReader.cs, MarkItDownReader.cs, MarkItDownMcpReader.cs Updated to make mediaType parameter nullable
Test files Updated to use generic type parameters and new test cases
Comments suppressed due to low confidence (1)

src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentReader.cs:1

  • The .markdown and .md extensions mapping is duplicated in both IngestionDocumentReader.cs (line 41) and MediaTypeProvider.cs (line 41). Since MediaTypeProvider is now the centralized location for media type mappings and is being linked into this project, this duplication should be removed.
// Licensed to the .NET Foundation under one or more agreements.

@adamsitnik adamsitnik requested a review from a team as a code owner November 27, 2025 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using IngestionPipeline for content not originating from the file system

1 participant