Skip to content

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Mar 9, 2025

Necessary, according to @roji:

  • Adds a non-generic IEmbeddingGenerator, with GetService then moving down to the non-generic from the generic. This also simplifies the GetService/GetRequiredService extension methods, which no longer need to be generic on TInput/TEmbedding.

Optional (we should discuss which of these we actually want to take):

  • Deletes GeneratedEmbeddings and makes GenerateAsync return an IAsyncEnumerable<TEmbedding> instead of GeneratedEmbeddings<TEmbedding>. The positive aspect of this is it allows for embeddings to be streamed as they're generated (though the main embedding generation services today don't stream), and also enables TEmbedding to be covariant (though it's not clear how much benefit that actually yields). The negative is we lose a place to hang metadata that's about the whole batch, in particular usage data. This moves that usage data to instead be on the Embedding, which is helpful for cases where such information is available per embedding, but isn't ideal for cases where it's about the whole batch. This PR deals with that by just putting such information onto one of the generated embeddings. We'll need to discuss the tradeoff. The alternative is we keep GeneratedEmbeddings but make it an IAsyncEnumerable instead of an IList. That gives a place to hang such batch information, but only if it's all available at the start of the streaming; however, it's also a bit harder to consume, as you need to await the initial call and then await foreach the embeddings.
  • Adds a SupportedInputMediaTypes set to EmbeddingGeneratorMetadata, so that a generator can be explicit about what inputs it supports.
  • Adds an optional MediaType parameter to TextContent. This would allow, for example, tagging code by language.
  • Makes the DataContent's ReadOnlyMemory constructor require a media type to be supplied. Ideally we'd also require this for non-data urls, but that's a bit trickier, as we can't distinguish between data and non-data urls in the type system. We could check for it dynamically. (This one wasn't actually mentioned by @roji.)

Not addressed:

  • Removing IDisposable from IEmbeddingGenerator (and IChatClient), or adding IAsyncDisposable. Not having these implementations makes middleware awkward, as all middleware components need to expose the interface. It also makes it harder to know you might need to dispose the result of a method like AsEmbeddingGenerator. To my knowledge, we don't know of cases where async disposal is necessary.
  • Add back a Metadata property. I still think GetService<EmbeddingGeneratorMetadata>() is sufficient, but we can discuss it further. We might want to consider an analyzer that would make diagnostic suggestions about supporting it in GetService and filling in as much data as possible.
  • Splitting off a UrlContent from DataContent, to differentiate in a strongly typed way cases where data is included or not.

cc: @roji, @SteveSandersonMS

Microsoft Reviewers: Open in CodeFlow

@github-actions github-actions bot added the area-ai Microsoft.Extensions.AI libraries label Mar 9, 2025
@roji
Copy link
Member

roji commented Mar 10, 2025

Adds an optional MediaType parameter to TextContent. This would allow, for example, tagging code by language.

Specifically for language, IIRC MIME types don't officially support that - maybe text/plan;lang=en-US is OK, but I don't think it's standard (that's why HTTP has a separate Content-Language header?). If we wanted to, we could add a Language property to TextContent, though before that we'd probably want to see evidence of actual embedding generation services that accept a language parameter (I'd expect them to maybe auto-detect?) - @westey-m did you encounter any that do?

@stephentoub
Copy link
Member Author

stephentoub commented Mar 10, 2025

Specifically for language, IIRC MIME types don't officially support that

I was referring to programming languages, for eg code generated by the llm as part of a code interpreter tool, like text/x-python, text/x-csharp, text/html, or application/typescript.

@stephentoub stephentoub force-pushed the addressembeddingfeedback branch from 6fd8d15 to e964967 Compare March 10, 2025 17:53
@stephentoub stephentoub marked this pull request as ready for review March 10, 2025 17:54
@stephentoub stephentoub requested a review from a team as a code owner March 10, 2025 17:54
@stephentoub
Copy link
Member Author

@roji, @westey-m, @SteveSandersonMS, I've pushed a revision based on our discussion. This now only includes:

  • Lowering GetService down to a non-generic IEmbeddingGenerator
  • Splitting DataContent into DataContent and UriContent. The former requires bytes, the latter is about remote content.

@rogerbarreto, FYI

@shyamnamboodiripad, @peterwald, I don't think this should affect you much if at all, but FYI.

@stephentoub
Copy link
Member Author

cc: @luisquintanilla, since you were previously asking about DataContent representing both in-memory data and remote content

Copy link
Member

@SteveSandersonMS SteveSandersonMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (see comments)!

@stephentoub
Copy link
Member Author

Addressed feedback:

  • Renamed MediaTypeStartsWith to HasTopLevelMediaType and changed it to compare just the top-level type portion of the media type, e.g. content.HasTopLevelMediaType("image").
  • Changed Add{Keyed}EmbeddingGenerator to register for both the generic and non-generic interfaces.

@stephentoub stephentoub merged commit 7f051d4 into dotnet:main Mar 10, 2025
6 checks passed
@stephentoub stephentoub deleted the addressembeddingfeedback branch March 10, 2025 22:03
Debug.Assert(Data is not null, "Expected Data to be initialized.");
_uri = string.Concat("data:", MediaType, ";base64,", Convert.ToBase64String(Data.GetValueOrDefault()
Debug.Assert(_data is not null, "Expected _data to be initialized.");
_uri = string.Concat("data:", MediaType, ";base64,", Convert.ToBase64String(_data.GetValueOrDefault()
Copy link
Contributor

@rogerbarreto rogerbarreto Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the _dataUri is generated from the _data, since it's a readonly whouldn't it be nice to cache it and vice-versa?

Suggested change
_uri = string.Concat("data:", MediaType, ";base64,", Convert.ToBase64String(_data.GetValueOrDefault()
_uri = string.Concat("data:", MediaType, ";base64,", Convert.ToBase64String(_data.GetValueOrDefault()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would _dataUri end up being used again?

RussKie added a commit that referenced this pull request Mar 10, 2025
* The .NET side code for the `ScenarioRunResult` was recently changed (#5998) to include `ChatResponse` (which can contain multiple `ChatMessage`s) in place of a single `ChatMessage`. Unfortunately, we missed updating the TypeScript reporting code to account for this. (#6061)

This change fixes the problem by updating the deserialization code in TypeScript to match what .NET code serializes.

* Automatically add 'untriaged' label to new issues without milestones (#6060)

* Address M.E.VectorData feedback for IEmbeddingGenerator (#6058)

* Move GetService down to a non-generic IEmbeddingGenerator interface

* Separate UriContent from DataContent

* Address feedback

---------

Co-authored-by: Shyam N <shyamnamboodiripad@users.noreply.github.com>
Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>
Co-authored-by: Stephen Toub <stoub@microsoft.com>
joperezr pushed a commit to joperezr/extensions that referenced this pull request Mar 11, 2025
* Move GetService down to a non-generic IEmbeddingGenerator interface

* Separate UriContent from DataContent

* Address feedback
joperezr pushed a commit to joperezr/extensions that referenced this pull request Mar 11, 2025
…tor (dotnet#6058)

Address M.E.VectorData feedback for IEmbeddingGenerator (dotnet#6058)

* Move GetService down to a non-generic IEmbeddingGenerator interface

* Separate UriContent from DataContent
@github-actions github-actions bot locked and limited conversation to collaborators Apr 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-ai Microsoft.Extensions.AI libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants