-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce repartitioning #309
Merged
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
2683b93
optimize image download component
PhilippeMoussalli 78a60c2
Merge branch 'main' into improve-imagedownload
PhilippeMoussalli 1b7521c
add dask diagnostics to fondant
PhilippeMoussalli 1c53aa1
add repartitioning strategy to fondant
PhilippeMoussalli 897f5f8
remove changes from component
PhilippeMoussalli 0c48233
Merge branch 'main' into improve-imagedownload
PhilippeMoussalli 2edcd91
change save path diagnostics
PhilippeMoussalli d63eed7
add client to dataIO
PhilippeMoussalli 51cadfe
remove progressbar
PhilippeMoussalli 678b3e1
move client
PhilippeMoussalli 44d5493
move dataframe visualization
PhilippeMoussalli 592f3ae
add bokeh to dependencies
PhilippeMoussalli 5c56ddf
add bokeh to dependencies
PhilippeMoussalli 8e575c7
add bokeh to dependencies
PhilippeMoussalli 0d1e76b
move client to execute
PhilippeMoussalli 43132f5
silence logs
PhilippeMoussalli 7f38c31
remove log supression
PhilippeMoussalli d764ad4
add log silencing to cluster
PhilippeMoussalli d50cc6b
add worker log filter
PhilippeMoussalli 9e1eb31
supress logs
PhilippeMoussalli f344ea8
supress logs
PhilippeMoussalli 08b74cc
add output_partition_size as an argument and revert diagnosis
PhilippeMoussalli df19ad3
remove output_partition from user arguments
PhilippeMoussalli 09ef925
enable more control over partitioning
PhilippeMoussalli 6b1e63b
fix test
PhilippeMoussalli 57a8af4
Merge branch 'main' into introduce-repartioning
PhilippeMoussalli 8715369
correct load partitioning
PhilippeMoussalli c15e24c
address PR feedback
PhilippeMoussalli 9ff1e68
Adjust argument passing
PhilippeMoussalli 4c8d482
debug
PhilippeMoussalli 5bf3a50
Further debugging
PhilippeMoussalli 18e1f79
Further debugging
PhilippeMoussalli b2bdecc
Revert debug and improve logging
PhilippeMoussalli 951a786
Merge branch 'main' into introduce-repartioning
PhilippeMoussalli 3394fcf
fix indentation
PhilippeMoussalli 02561ab
fix tests
PhilippeMoussalli 99746dc
add docs
PhilippeMoussalli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -41,6 +41,13 @@ def transform(self, dataframe: dd.DataFrame) -> dd.DataFrame: | |
raise NotImplementedError | ||
|
||
|
||
class DaskWriteComponent(BaseComponent): | ||
"""Component that accepts a Dask DataFrame and writes its contents.""" | ||
|
||
def write(self, dataframe: dd.DataFrame) -> None: | ||
raise NotImplementedError | ||
|
||
|
||
Comment on lines
+44
to
+50
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was just moved in the file? I think both orders can be logical (Dask -> Pandas) or (Read -> Transform -> Write) |
||
class PandasTransformComponent(BaseComponent): | ||
"""Component that transforms the incoming dataset partition per partition as a pandas | ||
DataFrame. | ||
|
@@ -57,12 +64,5 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame: | |
raise NotImplementedError | ||
|
||
|
||
class DaskWriteComponent(BaseComponent): | ||
"""Component that accepts a Dask DataFrame and writes its contents.""" | ||
|
||
def write(self, dataframe: dd.DataFrame) -> None: | ||
raise NotImplementedError | ||
|
||
|
||
Component = t.TypeVar("Component", bound=BaseComponent) | ||
"""Component type which can represents any of the subclasses of BaseComponent""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason the input partitions are specified in terms of rows and the output partitions in terms of size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the output partition size ensures that the written partitions are small and can be easily loaded by the next component.
The input is defined by rows to allow you to easily iterate on it if you run into out of memory issues, for example if you run into OOM issues when retrieving 100 images from URLs. It's more intuitive to set that number to lower (10 rows for example) then to change the size of the input partitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I do feel using num rows is way more intuitive than size