Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor file upload for content fragment #5953

Closed
wants to merge 7 commits into from

Conversation

flvndvd
Copy link
Contributor

@flvndvd flvndvd commented Jun 28, 2024

Description

Summary

While working on vision, we identified a race condition in the process of uploading content fragment files. Currently, we create the conversation/post the message with the content fragment before uploading the associated file, which causes issues. As we plan to use URLs for vision and need to resize images (increasing upload time), the existing logic needs an refactoring.

Key Changes

This PR revamps the current logic, separating files from content fragments. While files will still be used within content fragments, they can now exist independently. The goal is to achieve this separation without storing every file in a database. The main changes include:

  1. New Endpoint /files: This endpoint must be called first, allowing the client to obtain a file ID. The endpoint returns the file ID and a signed (using JWT) upload URL valid for 30 seconds.
  2. New Endpoint /files/:fileId: This endpoint supports both file upload and retrieval. For uploads, the client must provide the token to verify that the file ID was generated by our system. While it currently doesn't process textual files, it does resize images. Once the file is uploaded, the endpoint returns a download URL, which can be used in the url field of a content fragment.

The useFileUploaderService has been adjusted to accommodate these changes. Previously, no special handling was done until files were submitted. With the new approach, files are uploaded to our cloud storage as soon as they are uploaded to the system, unblocking the send button only after the upload and resizing are complete. This avoids impacting the chat experience.

Impact on Production

  • All files must now be uploaded using the new architecture to be used in content fragments.
  • Text extraction remains on the front-end for now but will soon be moved to the new pattern.
  • Legacy logic for uploading raw files post-content fragment creation is removed.

Trade-offs and Future Considerations

This solution has some trade-offs, such as the inability to recompute file IDs and the loss of the current GCS structure of conversation/message/content. There is also the potential for "ghost" files to remain in the system if they are added to the input bar but never sent. However, these drawbacks are acceptable given the increased flexibility needed for future file handling enhancements.

Note on Public API

These changes do not yet apply to the public API but will be rolled out once vision becomes generally available.

Image Resizing

Based on experiments with vision, we plan to resize the largest side of images down to 768px.

Follow up work

  • Implement file deletion when files are being removed from the input bar

Demos

ImageUpload

PDFUpload

Risk

Worst case, file upload is broken.

⚠️ Rolling back is not fully safe, as it will disrupt the file downloads for files created while this code was in production.

Deploy Plan

@flvndvd flvndvd marked this pull request as ready for review June 28, 2024 23:15
Comment on lines +35 to +36
"image/jpeg",
"image/png",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this, which gates vision before shipping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems non homogeneous. We should standardize on content type or file extension. Do browsers accept both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this was for test purpose, will be reworked in the final PR.

@flvndvd flvndvd self-assigned this Jun 29, 2024
@flvndvd flvndvd requested a review from spolu June 29, 2024 09:05
Copy link
Contributor

@spolu spolu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

I thiiiink this PR needs to be split if we want to make forward progress as it includes things that needs to be discussed separately:

  • A new file upload endpoint whose details needs to be discussed in details: do we support streaming? do we support transformation? what API do we want for this?
  • Move content fragment to the new file upload. This one is quite straightforward and will be much easier to review once the file upload endpoint is cut and solid
  • Introduce image attachments which will require further separate discussions

Happy to have all of them here, but I'm not sure it's the most efficent :)

Comment on lines +35 to +36
"image/jpeg",
"image/png",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems non homogeneous. We should standardize on content type or file extension. Do browsers accept both?

@@ -122,26 +122,6 @@ export async function submitMessage({
);
})
);

for (const [i, mcfRes] of contentFragmentsRes.entries()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why stop looking for errors?

@@ -110,7 +110,7 @@ export async function submitMessage({
body: JSON.stringify({
title: contentFragment.title,
content: contentFragment.content,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is content still pushed here?

* Text files handling.
*/

export function isSupportedTextMimeType(file: formidable.File): boolean {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid these type of functions which are just a wrapper on another function but also only used once makes understanding the code harder not simpler IMHO

}

switch (req.method) {
case "POST": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint needs to be aggressively rate-limited

@@ -141,13 +145,65 @@ export function useFileUploaderService() {
const results = await Promise.all(previewPromises);

setFileBlobs([...fileBlobs, ...results]);

for (const fileBlob of results) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a Promise.all here or at least have some form of parallelization as we def don't want to wait for a sequential upload?

});

if (!uploadResult.ok) {
throw new FileBlobUploadError("failed_to_upload_file", fileBlob.file);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No try catch of our own errors

});
}

setIsProcessingFiles(false);
} catch (err) {
if (err instanceof FileBlobUploadError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use a Result

await resizeAndUploadToFileStorage(owner, fileTokenPayload, file);
} else if (isSupportedTextMimeType(file)) {
// TODO:(2026-06-28 flav) Move logic to extract text from PDF here.
await uploadToFileStorage(owner, fileTokenPayload, file);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're building a new file API we should make sure to support something forward looking. This downloads the entire file in memory before uploading right?

If not: since we're cutting a new API we want to make sure it supports proper streaming?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It creates interesting questions on how we want to handle that: we can't stream and transform at the same time. Which hints that we likely want a file status API and an asyncrhonous processing mechanism run server side to resize images or extract PDFs.


try {
const form = new IncomingForm();
const [, files] = await form.parse(req);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this materialize the file in memory? We'd want to stream to GCS instead here no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the file is not in memory but written on the disk. It's then exposed as a stream that we can handle. We could avoid saving anything on the disk by simply streaming straight away to GCS but this means that we can't do any pre-processing before saving.

@flvndvd flvndvd closed this Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants