S3 Directory Document Loading Component #2818
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a new component which will allow a user to load multiple documents from an S3 bucket. There are optional parameters Server URL and Prefix. The component duplicates the functionality of the filesystem directory document loading component.
When connecting to MinIO buckets or a local S3 bucket, the Server URL will need to be provided.
When you want to filter only entries under a specific directory you would use the prefix option (hierarchy in s3 is flat so if there was a directoryB in directoryA, you would specify directoryA/directoryB to only load contents of directoryB) This also defaults to recursive loading. Another option can be added to limit that if needed.
Tested with a MinIO bucket containing pdf files in different directories. Verified with no prefix(download entire bucket), a prefix containing another directory and a prefix with no directory.
In addition, this was tested with a global s3 bucket (When the Server URL is not provided)