Skip to content

Commit

Permalink
Add code for reusable load from files component #290 (#296)
Browse files Browse the repository at this point in the history
This PR contains code for load from files component related to #290

---------

Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
Co-authored-by: Matthias Richter <matthias.r1092@gmail.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
Co-authored-by: Sharon Grundmann <sharon.grundmann@ml6.eu>
  • Loading branch information
7 people authored Jul 26, 2023
1 parent fde67f7 commit 0abc597
Show file tree
Hide file tree
Showing 6 changed files with 691 additions and 0 deletions.
23 changes: 23 additions & 0 deletions components/load_from_files/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM --platform=linux/amd64 python:3.8-slim

# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["python", "main.py"]
36 changes: 36 additions & 0 deletions components/load_from_files/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Load from files

## Description
This component is based on the `DaskLoadComponent` and is used to load dataset from files within a directory.
It allows you to load datasets which
- Have files within a local data directory
- Have compressed files like .zip, gzip, tar or tar.gz within the data directory
- Are hosted on remote locations like AWS S3 bucket, Azure's Blob storage or GCP's cloud storage

And returns a dataframe with two columns
- file_filename(containing the file name in string format)
- file_content (containing the respective file content in bytes format)

Here is an illustration of how to use this component in your pipeline
on a local directory with zip files

```python
from fondant.pipeline import Pipeline, ComponentOp

my_pipeline = Pipeline(
pipeline_name="my_pipeline",
base_path="./", # TODO: update this
pipeline_description="This is my pipeline",
)

load_from_files = ComponentOp(
component_dir="components/load_from_files",
arguments={
"directory_uri": "./data.zip", # change this to your
# directory_uri, remote or local
},
output_partition_size="10MB",
)

my_pipeline.add_op(load_from_files, dependencies=[])
```
16 changes: 16 additions & 0 deletions components/load_from_files/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Load from files
description: Component that loads a dataset from files
image: ghcr.io/ml6team/load_from_files:dev

produces:
file:
fields:
filename:
type: string
content:
type: binary

args:
directory_uri:
description: Local or remote path to the directory containing the files
type: str
3 changes: 3 additions & 0 deletions components/load_from_files/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
fsspec==2023.6.0
pandas==2.0.3
dask==2023.5.0
Loading

0 comments on commit 0abc597

Please sign in to comment.