Refactor file upload for content fragment #5953

flvndvd · 2024-06-28T22:38:22Z

Description

Summary

While working on vision, we identified a race condition in the process of uploading content fragment files. Currently, we create the conversation/post the message with the content fragment before uploading the associated file, which causes issues. As we plan to use URLs for vision and need to resize images (increasing upload time), the existing logic needs an refactoring.

Key Changes

This PR revamps the current logic, separating files from content fragments. While files will still be used within content fragments, they can now exist independently. The goal is to achieve this separation without storing every file in a database. The main changes include:

New Endpoint /files: This endpoint must be called first, allowing the client to obtain a file ID. The endpoint returns the file ID and a signed (using JWT) upload URL valid for 30 seconds.
New Endpoint /files/:fileId: This endpoint supports both file upload and retrieval. For uploads, the client must provide the token to verify that the file ID was generated by our system. While it currently doesn't process textual files, it does resize images. Once the file is uploaded, the endpoint returns a download URL, which can be used in the url field of a content fragment.

The useFileUploaderService has been adjusted to accommodate these changes. Previously, no special handling was done until files were submitted. With the new approach, files are uploaded to our cloud storage as soon as they are uploaded to the system, unblocking the send button only after the upload and resizing are complete. This avoids impacting the chat experience.

Impact on Production

All files must now be uploaded using the new architecture to be used in content fragments.
Text extraction remains on the front-end for now but will soon be moved to the new pattern.
Legacy logic for uploading raw files post-content fragment creation is removed.

Trade-offs and Future Considerations

This solution has some trade-offs, such as the inability to recompute file IDs and the loss of the current GCS structure of conversation/message/content. There is also the potential for "ghost" files to remain in the system if they are added to the input bar but never sent. However, these drawbacks are acceptable given the increased flexibility needed for future file handling enhancements.

Note on Public API

These changes do not yet apply to the public API but will be rolled out once vision becomes generally available.

Image Resizing

Based on experiments with vision, we plan to resize the largest side of images down to 768px.

Follow up work

Implement file deletion when files are being removed from the input bar

Demos

Risk

Worst case, file upload is broken.

⚠️ Rolling back is not fully safe, as it will disrupt the file downloads for files created while this code was in production.

Deploy Plan

flvndvd · 2024-06-29T08:18:51Z

front/components/assistant/conversation/input_bar/InputBarContainer.tsx

+  "image/jpeg",
+  "image/png",


I'll remove this, which gates vision before shipping.

This seems non homogeneous. We should standardize on content type or file extension. Do browsers accept both?

Yes this was for test purpose, will be reworked in the final PR.

spolu

Left some comments.

I thiiiink this PR needs to be split if we want to make forward progress as it includes things that needs to be discussed separately:

A new file upload endpoint whose details needs to be discussed in details: do we support streaming? do we support transformation? what API do we want for this?
Move content fragment to the new file upload. This one is quite straightforward and will be much easier to review once the file upload endpoint is cut and solid
Introduce image attachments which will require further separate discussions

Happy to have all of them here, but I'm not sure it's the most efficent :)

spolu · 2024-07-01T07:55:47Z

front/components/assistant/conversation/input_bar/InputBarContainer.tsx

+  "image/jpeg",
+  "image/png",


This seems non homogeneous. We should standardize on content type or file extension. Do browsers accept both?

spolu · 2024-07-01T07:56:39Z

front/components/assistant/conversation/lib.ts

@@ -122,26 +122,6 @@ export async function submitMessage({
        );
      })
    );
-
-    for (const [i, mcfRes] of contentFragmentsRes.entries()) {


Why stop looking for errors?

spolu · 2024-07-01T07:57:58Z

front/components/assistant/conversation/lib.ts

@@ -110,7 +110,7 @@ export async function submitMessage({
            body: JSON.stringify({
              title: contentFragment.title,
              content: contentFragment.content,


why is content still pushed here?

spolu · 2024-07-01T08:06:13Z

front/lib/api/files.ts

+ * Text files handling.
+ */
+
+export function isSupportedTextMimeType(file: formidable.File): boolean {


Can we avoid these type of functions which are just a wrapper on another function but also only used once makes understanding the code harder not simpler IMHO

spolu · 2024-07-01T08:07:51Z

front/pages/api/w/[wId]/files/index.ts

+  }
+
+  switch (req.method) {
+    case "POST": {


This endpoint needs to be aggressively rate-limited

spolu · 2024-07-01T08:13:15Z

front/hooks/useFileUploaderService.ts

@@ -141,13 +145,65 @@ export function useFileUploaderService() {
      const results = await Promise.all(previewPromises);

      setFileBlobs([...fileBlobs, ...results]);
+
+      for (const fileBlob of results) {


Should be a Promise.all here or at least have some form of parallelization as we def don't want to wait for a sequential upload?

spolu · 2024-07-01T08:14:03Z

front/hooks/useFileUploaderService.ts

+        });
+
+        if (!uploadResult.ok) {
+          throw new FileBlobUploadError("failed_to_upload_file", fileBlob.file);


No try catch of our own errors

spolu · 2024-07-01T08:14:53Z

front/hooks/useFileUploaderService.ts

+        });
+      }
+
+      setIsProcessingFiles(false);
    } catch (err) {
      if (err instanceof FileBlobUploadError) {


Should use a Result

spolu · 2024-07-01T08:20:54Z

front/pages/api/w/[wId]/files/[fileId].ts

+          await resizeAndUploadToFileStorage(owner, fileTokenPayload, file);
+        } else if (isSupportedTextMimeType(file)) {
+          // TODO:(2026-06-28 flav) Move logic to extract text from PDF here.
+          await uploadToFileStorage(owner, fileTokenPayload, file);


If we're building a new file API we should make sure to support something forward looking. This downloads the entire file in memory before uploading right?

If not: since we're cutting a new API we want to make sure it supports proper streaming?

It creates interesting questions on how we want to handle that: we can't stream and transform at the same time. Which hints that we likely want a file status API and an asyncrhonous processing mechanism run server side to resize images or extract PDFs.

spolu · 2024-07-01T08:33:45Z

front/pages/api/w/[wId]/files/[fileId].ts

+
+      try {
+        const form = new IncomingForm();
+        const [, files] = await form.parse(req);


Will this materialize the file in memory? We'd want to stream to GCS instead here no?

Here the file is not in memory but written on the disk. It's then exposed as a stream that we can handle. We could avoid saving anything on the disk by simply streaming straight away to GCS but this means that we can't do any pre-processing before saving.

flvndvd added 3 commits June 29, 2024 00:26

Add base logic to support file upload

2fb5624

✨

600e525

Fix pdf display

6f6d9f2

flvndvd marked this pull request as ready for review June 28, 2024 23:15

Store file name in file metadata on cloud storage

c7736de

flvndvd commented Jun 29, 2024

View reviewed changes

flvndvd added 3 commits June 29, 2024 10:36

Implement file size limit

11b1214

Delete file from cloud storage

0efcafe

✨

da2f6bc

flvndvd self-assigned this Jun 29, 2024

flvndvd requested a review from spolu June 29, 2024 09:05

spolu reviewed Jul 1, 2024

View reviewed changes

flvndvd mentioned this pull request Jul 2, 2024

Introduce file upload api #5977

Merged

flvndvd closed this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor file upload for content fragment #5953

Refactor file upload for content fragment #5953

flvndvd commented Jun 28, 2024 •

edited

Loading

flvndvd Jun 29, 2024

spolu Jul 1, 2024

flvndvd Jul 2, 2024

spolu left a comment

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

spolu Jul 1, 2024

flvndvd Jul 1, 2024

Refactor file upload for content fragment #5953

Refactor file upload for content fragment #5953

Conversation

flvndvd commented Jun 28, 2024 • edited Loading

Description

Summary

Key Changes

Impact on Production

Trade-offs and Future Considerations

Note on Public API

Image Resizing

Follow up work

Demos

Risk

Deploy Plan

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spolu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flvndvd commented Jun 28, 2024 •

edited

Loading