Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a reusable file loader component #290

Closed
RobbeSneyders opened this issue Jul 12, 2023 · 3 comments
Closed

Create a reusable file loader component #290

RobbeSneyders opened this issue Jul 12, 2023 · 3 comments
Assignees

Comments

@RobbeSneyders
Copy link
Member

It would be useful to have a reusable file loader component which can load data from files in a directory.

It should translate the following directory structure:

|- data_directory
   |- filename_1
   |- filename_2
   |- filename_3

To the following Dask dataframe:

id file_content
filename_1 file 1 content
filename_2 file 2 content
filename_3 file 3 content

A minimal component specification would look like:

name: Load from files
description: Component that loads a dataset from files
image: ghcr.io/ml6team/load_from_files:dev

produces:
  file:
    fields:
      content:
        type: binary

args:
  path:
    description: Local or remote path to the directory containing the files
    type: str

Some notes:

  • We can make the component "generic" as a next step (meaning it can handle dynamic component specifications)
  • Remote paths should work out of the box if the component installs fondant[aws,azure,gcp].
  • We can add additional functionality later (eg. handling subdirectories, taking additional arguments, ...
@RobbeSneyders RobbeSneyders converted this from a draft issue Jul 12, 2023
@satishjasthi
Copy link
Contributor

@RobbeSneyders I'll take up this issue

@satishjasthi
Copy link
Contributor

@RobbeSneyders, For downloading files from a remote path, what options are we considering now?
Should it support downloading files from Cloud computing platforms like GCP, AWS and Azure? If so should i add component which handles auth part to fetch directory data and then parse it.
Or should the current version of code just downloads remote dir in the form of zip or tar and then unpacks it and then parses it

@satishjasthi
Copy link
Contributor

satishjasthi commented Jul 17, 2023

@RobbeSneyders I have raised PR for this issue

GeorgesLorre added a commit that referenced this issue Jul 26, 2023
This PR contains code for load from files component related to #290

---------

Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
Co-authored-by: Matthias Richter <matthias.r1092@gmail.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
Co-authored-by: Sharon Grundmann <sharon.grundmann@ml6.eu>
@github-project-automation github-project-automation bot moved this from Ready for development to Done in Fondant development Aug 28, 2023
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
This PR contains code for load from files component related to #290

---------

Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
Co-authored-by: Matthias Richter <matthias.r1092@gmail.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
Co-authored-by: Sharon Grundmann <sharon.grundmann@ml6.eu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants