Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce file upload api #5977

Merged
merged 20 commits into from
Jul 2, 2024
Merged

Introduce file upload api #5977

merged 20 commits into from
Jul 2, 2024

Conversation

flvndvd
Copy link
Contributor

@flvndvd flvndvd commented Jul 1, 2024

Description

This PR goes one step further than the original attempt at introducing a file API in #5953.

Summary

While working on vision, we identified a race condition in the process of uploading content fragment files. Currently, we create the conversation or post the message with the content fragment before uploading the associated file, which causes issues. As we plan to use URLs for vision and need to resize images (increasing upload time), the existing logic requires refactoring.

Key Changes

This PR establishes a new set of endpoints aimed at facilitating file uploads, laying the groundwork for future expansion to support various file types and use cases. The key additions include:

  1. /files Endpoint:

    • Creates a file entry in the database and generates a file ID.
    • Ensures the file adheres to size and content type requirements.
    • Enforces an agressive rate limit.
    • Returns a file ID and an upload URL valid for 60 seconds to prevent incomplete uploads.
  2. /files/:fileId Endpoint:

    • Supports both file uploads and retrievals.
    • Allows a 60-second window for uploading files.
    • Streams the file directly to cloud storage, applies pre-processing if necessary, and stores the processed file in the cloud.
    • Currently, the endpoint resizes images and plans to incorporate PDF text extraction in a future update.
    • Returns a download URL after the file is uploaded and pre-processed.

Pre-processing is currently handled within this endpoint due to the following reasons:

  • Only small files (a few megabytes) are supported, making quick pre-processing feasible.
  • Simplifies the client experience by providing ready-to-use files upon API response.

In the future, this approach may need to be revised for larger files (a few gigabytes).

The concept of use cases is introduced, although not yet implemented. Future updates may apply different pre-processing based on the use case. The only supported use case at present is conversation, aligning with existing content fragment logic.

Memory and Performance Considerations

  • The service minimizes memory usage by avoiding loading entire files into memory or writing them to disk, leveraging cloud storage instead. The exception is image files, which require full memory loading for resizing using Sharp.

Known Issues and Updates

  • The current version of google-cloud/storage has a known issue with streams on Node.js > 21.x. Consequently, this PR includes a version bump. The update primarily changes a few types without affecting behavior.

Risk

This endpoint is not yet used.

Deploy Plan

@@ -45,5 +47,8 @@ export abstract class BaseResource<M extends Model> {
return new this(this.model, blob.get());
}

abstract delete(transaction?: Transaction): Promise<Result<undefined, Error>>;
abstract delete(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored so delete always takes an Authenticator instance.

@flvndvd flvndvd marked this pull request as ready for review July 2, 2024 08:17
Copy link
Contributor

@spolu spolu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, left mostly aesthetic comments

// Cloud storage logic.

getPublicUrl(auth: Authenticator): string {
// TODO(2024-07-01 flav) Remove once we introduce AuthenticatorWithWorkspace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this todo really is not needed :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep around since this check really bothers me and clutters a lot the code.

@@ -28,7 +30,7 @@ export abstract class BaseResource<M extends Model> {
this.id = blob.id;
}

static async fetchById<T extends BaseResource<M>, M extends Model>(
static async fetchByInternalId<T extends BaseResource<M>, M extends Model>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename it I would call it fetchByModelId to be even more explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point.


static async fetchById(
auth: Authenticator,
id: FileId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we generaly use fileId for sId arguments

static async makeNew(
blob: Omit<CreationAttributes<FileModel>, "status" | "sId">
) {
const fileId = makeDustFileId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not introduce new terminology, this is an sId right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}
}

function makeDustFileId(): FileId {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's modify generateModelSId to take an optional prefix instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to do it at first, but was not sure we were okay with it 🔥 !


res.status(200).json({ file: fileRes.toJSON(auth) });
return;
} catch (error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catch should be around external functions only. Let's narrow the catch to the relevant call. If that can come from maybeApplyPreProcessing then this catch is not valid

});
}

const preProcessingRes = await maybeApplyPreProcessing(auth, fileRes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fine of "...Res" for resources. Should be just "file" no since ...Res is already used everywhere for results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree but file was taken by formiddable, will update.

});
}

const fileRes = await FileResource.fetchById(auth, fileId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call it "file"


export async function maybeApplyPreProcessing(
auth: Authenticator,
fileRes: FileResource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file

auth: Authenticator,
fileRes: FileResource
): Promise<Result<undefined, Error>> {
const processing = processingPerContentType[fileRes.contentType];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce the useCase logic here even if there is only one possible value for useCase? 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor

@spolu spolu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM \o/

@flvndvd flvndvd merged commit 9c592c4 into main Jul 2, 2024
4 checks passed
@flvndvd flvndvd deleted the flav/file-upload-api branch July 2, 2024 11:00
albandum pushed a commit that referenced this pull request Jul 2, 2024
* Checkout file upload endpoint related code

* Limit file upload to 1.

* Tmp

* Implement FileResource

* ✨

* Fix google-cloud/storage issue

* Add unique sId index

* Fix migration

* 👕

* Make file upload rate limit more aggressive

* Enforce upload window

* Ensure the uploaded file matches the requirements

* ✂️

* Use fromidable to adhere to file requirements

* s/fetchByInternalId/fetchByModelId

* Move try/catch + move markAsFailed for preprocessing

* s/fileRes/file

* Remove types for FileId

* 🙈

* Implement use cases for pre-processing
@flvndvd flvndvd mentioned this pull request Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants