Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove or relax the length limitation of Uri #96544

Open
hez2010 opened this issue Jan 5, 2024 · 6 comments
Open

Remove or relax the length limitation of Uri #96544

hez2010 opened this issue Jan 5, 2024 · 6 comments

Comments

@hez2010
Copy link
Contributor

hez2010 commented Jan 5, 2024

The data is a valid scheme which can carry some data other than path in a Uri. For example, we can save an image as base64 data in a Uri:

var uri = new Uri("data:image/jpeg;base64,...");

However, Uri in .NET has a maximum length limitation of 0xFFF0 which makes it unable to save some large data (especially, base64 with length more than 65496 (0xFFF0 - "data:image/jpeg;base64,".Length)).

It's common that a HTML contains an image tag which use base64 encoded data for its src attribute:

<image src="data:image/jpeg;base64,..." />

However, we are unable to get the src as a Uri if the length of src is too large, for example, an 1 mb image.

new Uri("data:image/jpeg;base64," + Enumerable.Repeat("a", 70000).Aggregate((s, n) => $"{s}{n}"))

The above code will throw a UriFormatException with Invalid URI: The Uri string is too long..

While the same code is perfect valid in JavaScript: new URL("data:image/jpeg;base64," + "a".repeat(70000)). The limitation in .NET here also makes it hard to do interop with JavaScript URL types (such as Blazor), because we cannot simply project the Uri type to the JavaScript URL type.

In the real-world scenario, I hit this issue while using semantic-kernel with GPT-4 Vision, which uses base64 image data in its image_url field. I have to resize the images to make them less than the length limitation.

I would like to suggest remove this limitation, or relax it, for example, relax the limitation from 0xFFF0 to 0x7FFFFFF0.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 5, 2024
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 5, 2024
@vcsjones vcsjones added area-System.Net and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jan 5, 2024
@ghost
Copy link

ghost commented Jan 5, 2024

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

The data is a valid scheme which can carry some data other than path in a Uri. For example, we can save an image as base64 data in a Uri:

var uri = new Uri("data:image/jpeg;base64,dGVzdA==");

However, Uri in .NET has a maximum length limitation of 0xFFF0 which makes it unable to save some large data (especially, base64 with length more than 65496 (0xFFF0 - "data:image/jpeg;base64,".Length)).

It's common that a HTML contains an image tag which use base64 encoded data for its src attribute:

<image src="data:image/jpeg;base64,dGVzdA==" />

However, we are unable to get the src as a Uri if the length of src is too large, for example, an 1 mb image.

new Uri("data:image/jpeg;base64," + Enumerable.Repeat("a", 70000).Aggregate((s, n) => $"{s}{n}"))

The above code will throw a UriFormatException with Invalid URI: The Uri string is too long..

While the same code is perfect valid in JavaScript: new URL("data:image/jpeg;base64," + "a".repeat(70000)). The limitation in .NET here also makes it hard to do interop with JavaScript URL types (such as Blazor), because we cannot simply project the Uri type to the JavaScript URL type.

In the real-world scenario, I hit this issue while using semantic-kernel with GPT-4 Vision, which uses base64 image data in its image_url field. I have to resize the images to make them less than the length limitation.

I would like to suggest remove this limitation, or relax it, for example, relax the limitation from 0xFFF0 to 0x7FFFFFF0.

Author: hez2010
Assignees: -
Labels:

area-System.Net, untriaged

Milestone: -

@colejohnson66
Copy link

if (length >= c_MaxUriBufferSize)
return ParsingError.SizeLimit;

What's the purpose of there even being a cap on the length? RFC 3986 doesn't mention anything about a limit needing to exist, and there's no comment as to why.

@MihaZupan
Copy link
Member

MihaZupan commented Jan 5, 2024

The same discussion from a few years ago: Uri rejects otherwise valid strings with length >= 65520 (#1857)

Essentially:

  • Uri is practically useless for data: uris, as it provides no meaningful validation, and exposes no utility methods for extracting the data from the format
  • Having a length limit makes sense in most cases (a 64 KB url is huge for HTTP), but obviously unfortunate for data:
  • This isn't a random constant. The limitation (mainly?) came from the fact that Uri internally stores offsets as ushorts (hence the ~2^16 length limit). Lifting it would mean more work to make sure other parts of Uri play nice

Ideally, we would have a dedicated type for dealing with data Uris - #85164 (comment).
Also related: #95838

But we don't have a good answer for cases where you're stuck with using Uri by existing APIs, like in your case, so I'm hesitant to close this as wont-fix as we did with #1857.

Moving to Future for now to let others comment/upvote if they run into the same problem

@dersia
Copy link

dersia commented Mar 12, 2024

@MihaZupan this would be really important for the Azure Ai sdk and semantic kernel and I would love for this to move forward. can we do something to move this forward?

I thought about your suggestion to use a new type like proposed in #85164 but the problem that I see with that is for apis/sdks like Azure SDK that are built from open-api-specs using tools like autorest. the type in the spec just says url and what would you map it to? System.Uri or System.DataUri.

so from that point of view I think it makes much mir sense to move this proposal forward and make System.Uri to work with DataUri's. and in fact it does work with DataUri's already, it is just that because of the size limitations it won't work with bigger DataUri's.

also adding a new type would bring up a weird state where you can use System.Uri as long as you stay within bound and have to switch to System.DataUri when you exceed this limit. so there would be no clear recommendation which one to use + this isn't breaking.

//Edit:
having said the above, I would change the api to add an ctor overload that takes a ROM<bytes> that would than do the base64 encoding and create a valid datauri out of bytes.

//Edit2:
I looked into the DataUri spec again and as a result I would also suggest to add a factory-method to allow correct construction of non base64 DataUri's. in this case I would change the byte-based constructor to be also a factory method. I will add alternate api change suggestions.

namespace System;

public class Uri
{
    // new apis
    // data:content/type;foo=bar;base64,R0lGODdh
    public Uri(ReadOnlyMemory<byte> data, string contentType, Dictionary<string, string>? additionalProperties = null);
    // data:content/type;foo=bar;utf8,hello world
    public static Uri AsDataUri(string data, string contentType, Dictionary<string, string>? additionalProperties = null);
}

alternate and preferred design:

namespace System;

public class Uri
{
    // new apis
    // data:content/type;foo=bar;base64,R0lGODdh
    public static Uri AsDataUri(ReadOnlyMemory<byte> data, string contentType, Dictionary<string, string>? additionalProperties = null);
    // data:content/type;foo=bar;utf8,hello world
    public static Uri AsDataUri(string data, string contentType, Dictionary<string, string>? additionalProperties = null);
}

github-merge-queue bot referenced this issue in microsoft/semantic-kernel Jun 10, 2024
### Motivation and Context

⚠️ Breaking changes on non-experimental types **ImageContent**

Resolves #5625
Resolves #5295

For a brief time this changes will keep the content below as
experimental.

- BinaryContent
- AudioContent
- ImageContent 
- FunctionCallContent
- FunctionResultContent

Changes:

### **BinaryContent** 
- Removed providers for lazy loading content, simplifying its usage and
APIs.
- Removed `Stream` constructor to avoid IDisposable resource consumption
or bad practices.
- Added `Uri` dedicated for Referenced Uri information
- Added `DataUri` property which can be set or get dynamically (auto
generated if you created the
content using byte array with a mime type) 
Setting a `DataUri` will automatically update the `MimeType` property
and add any extra metadata that may be available in the data scheme
definition.
- Added a required `mimeType` property to the ByteArray constructor, to
encourage passing the mimeType when creating BinaryContent directly or
from specializations.
- Added `Data` property which can be set or get dynamically (auto
generated if you created the content using a data uri format)
Setting a Data on an existing BinaryContent will also reflect on the
getter of `DataUri` for the given content.
- Added DataUri and Base64 validation when setting DataUri on the
contents.
- When using DataUri parameters those merge with the current content
metadata.
i.e:
`data:image/jpeg;parameter1=value1;parameter2=value2;base64,binary==`

### ⚠️ **ImageContent** Fixes bugs and inconsistency behavior:
- Setting the Data of an image doesn't change the current data uri and
vice versa, allowing the sema image content instance to have different
binary data to representations.
- When an Image could not have DataUri and Uri in the same instance,
this limits scenarios where you have the image data but want to have a
reference to where that content is from.
- Wasn't possible to create an Image from a data uri greater than the
size limit of .Net System.Uri type here:

[https://github.com/dotnet/runtime/issues/96544](https://github.com/dotnet/runtime/issues/96544).

### **FunctionResultContent**
- Update `Id` property to `CallId`.
@mikepizzo
Copy link

Note: this is not just for data: uris.

We have scenarios in which users create really long queries (for example, return instances that match this set of ids, where the number of ids can be in the thousands).

In OData, we have added a pattern for adding /$query to the resource path and using POST to pass the query string in the body of the request to work around limitations with HTTP stacks. However, internally we still build full URIs for the request (for example, in order to generate a "nextLink") and run into this limitation.

See, for example, OData#1293

@hez2010
Copy link
Contributor Author

hez2010 commented Aug 6, 2024

Yeah. This issue has to be fixed as internally we are creating an instance of Uri even if you are using the string overload in HttpRequestMessage.
@MihaZupan We have no way to workaround this issue if there's any usage to HttpClient.

LudoCorporateShark referenced this issue in LudoCorporateShark/semantic-kernel Aug 25, 2024
### Motivation and Context

⚠️ Breaking changes on non-experimental types **ImageContent**

Resolves microsoft#5625
Resolves microsoft#5295

For a brief time this changes will keep the content below as
experimental.

- BinaryContent
- AudioContent
- ImageContent 
- FunctionCallContent
- FunctionResultContent

Changes:

### **BinaryContent** 
- Removed providers for lazy loading content, simplifying its usage and
APIs.
- Removed `Stream` constructor to avoid IDisposable resource consumption
or bad practices.
- Added `Uri` dedicated for Referenced Uri information
- Added `DataUri` property which can be set or get dynamically (auto
generated if you created the
content using byte array with a mime type) 
Setting a `DataUri` will automatically update the `MimeType` property
and add any extra metadata that may be available in the data scheme
definition.
- Added a required `mimeType` property to the ByteArray constructor, to
encourage passing the mimeType when creating BinaryContent directly or
from specializations.
- Added `Data` property which can be set or get dynamically (auto
generated if you created the content using a data uri format)
Setting a Data on an existing BinaryContent will also reflect on the
getter of `DataUri` for the given content.
- Added DataUri and Base64 validation when setting DataUri on the
contents.
- When using DataUri parameters those merge with the current content
metadata.
i.e:
`data:image/jpeg;parameter1=value1;parameter2=value2;base64,binary==`

### ⚠️ **ImageContent** Fixes bugs and inconsistency behavior:
- Setting the Data of an image doesn't change the current data uri and
vice versa, allowing the sema image content instance to have different
binary data to representations.
- When an Image could not have DataUri and Uri in the same instance,
this limits scenarios where you have the image data but want to have a
reference to where that content is from.
- Wasn't possible to create an Image from a data uri greater than the
size limit of .Net System.Uri type here:

[https://github.com/dotnet/runtime/issues/96544](https://github.com/dotnet/runtime/issues/96544).

### **FunctionResultContent**
- Update `Id` property to `CallId`.
@stephentoub stephentoub modified the milestones: Future, 10.0.0 Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants