Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a non-S3 file-system based option for temporary file storage #64

Merged
merged 14 commits into from
Dec 11, 2024

Conversation

jkwatson
Copy link
Collaborator

No description provided.

Comment on lines -79 to +77
.bucketName(
Optional.ofNullable(System.getenv("S3_RAG_DOCUMENT_BUCKET")).orElse("rag-files"))
.bucketName(Optional.ofNullable(System.getenv("S3_RAG_DOCUMENT_BUCKET")).orElse(""))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if this is a breaking change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we verified it works as desired. 👍🏽

@@ -62,6 +60,8 @@ public S3RagFileUploader(

@Override
public void uploadFile(MultipartFile file, String s3Path) {
log.info("Uploading file to S3: {}", s3Path);
System.out.println("Uploading file to S3: " + s3Path);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

Comment on lines 76 to 78
source_file = Path(settings.rag_databases_dir, "file_storage", document_key)
target_file = Path(temp_dir, original_filename)
shutil.copy(source_file, target_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can just return source_file. I don't think we depend on the specific file path for anything else anymore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. is that true? I thought the docling stuff needed the directory to write the temporary markdown file to

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah...we definitely need the temp version returned, since it has the original file name, unlike the saved one.

Comment on lines 66 to 67
# This is a hook to configure the pytest session
def pytest_configure(config: Any) -> None:
settings.rag_databases_dir = "/tmp/databases"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why we're doing this hard-coded path instead of something like https://docs.pytest.org/en/stable/how-to/tmp_path.html?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pure python ignorance is all

"""
Download document from S3
"""
return s3.download(temp_dir, bucket_name, document_key, original_filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should just copy the content of s3.download? Instead of having two ways to deal with S3, since we want to incentivize DocumentStorage

@ewilliams-cloudera ewilliams-cloudera merged commit b9eb72d into main Dec 11, 2024
3 checks passed
@ewilliams-cloudera ewilliams-cloudera deleted the mob/main branch December 11, 2024 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants