Introduce repartitioning #309

PhilippeMoussalli · 2023-07-19T14:45:30Z

PR that introduces the partitioning strategy discussed in #288

The automatic behavior is as follows for all component types (dask, pandas)

The written dataframe is re-partitioned to 250 Mb
The loaded dataframe is re-partitioned depending on the current number of partitions and workers

The behavior above can be overwritten by the end user in case they want to implement their own custom logic, this is done on the ComponentOp level as an additional flag parameters that can be passed. See added docs with this PR for more details

I will handle adding the diagnostic tools and optimizing the downloader component in a separate PR.

NielsRogge · 2023-07-19T15:20:14Z

src/fondant/data_io.py

+            f"available number of workers is {n_workers}.",
+        )
+        if n_partitions < n_workers:
+            dataframe = dataframe.repartition(npartitions=n_partitions)


Hmm we first get n_partitions = dataframe.npartitions and then we repartition using the same number?

Can you explain?

woops it should be dataframe = dataframe.repartition(npartitions=n_workers)
good catch

RobbeSneyders

Thanks @PhilippeMoussalli! Some comments.

RobbeSneyders · 2023-07-20T20:03:08Z

examples/pipelines/controlnet-interior-design/components/generate_prompts/Dockerfile

@@ -12,7 +12,7 @@ RUN pip3 install --no-cache-dir -r requirements.txt

 # Install Fondant
 # This is split from other requirements to leverage caching
-ARG FONDANT_VERSION=main
+ARG FONDANT_VERSION=09ef9254fef5d382d7d60d97b66fa2ac1e0df7e0


Should be reverted before merging.

RobbeSneyders · 2023-07-20T20:03:19Z

examples/pipelines/controlnet-interior-design/pipeline.py

+# pipeline.add_op(laion_retrieval_op, dependencies=generate_prompts_op)
+# pipeline.add_op(download_images_op, dependencies=laion_retrieval_op)
+# pipeline.add_op(caption_images_op, dependencies=download_images_op)
+# pipeline.add_op(segment_images_op, dependencies=caption_images_op)
+# pipeline.add_op(write_to_hub_controlnet, dependencies=segment_images_op)


Should be reverted before merging.

RobbeSneyders · 2023-07-20T20:04:14Z

src/fondant/component.py

+class DaskWriteComponent(BaseComponent):
+    """Component that accepts a Dask DataFrame and writes its contents."""
+
+    def write(self, dataframe: dd.DataFrame) -> None:
+        raise NotImplementedError
+
+


This was just moved in the file? I think both orders can be logical (Dask -> Pandas) or (Read -> Transform -> Write)

RobbeSneyders · 2023-07-20T20:05:30Z

src/fondant/component_spec.py

+                {
+                    "name": "output_partition_size",
+                    "description": "The size of the output partition size, defaults"
+                    " to 250MB. Set to `disable` to disable the automatic partitioning",
+                    "type": "String",
+                    "default": "250MB",
+                },


I don't think it's the output partitioning we need to make dynamic, as this will only impact the following component. I think the user should be able to overwrite the input partitioning, so the partitions can be made small at the start and still fit in memory when the data grows.

And I think it would be ideal if the user could specify it in rows instead of MB, but not sure if that's possible.

RobbeSneyders · 2023-07-20T20:07:46Z

src/fondant/data_io.py

+    def __init__(self, *, manifest: Manifest, component_spec: ComponentSpec):
+        super().__init__(manifest=manifest, component_spec=component_spec)


No need to overwrite this if yo're just calling super.

RobbeSneyders · 2023-07-20T20:09:52Z

src/fondant/data_io.py

+            f"The number of partitions of the input dataframe is {n_partitions}. The "
+            f"available number of workers is {n_workers}.",
+        )
+        if n_partitions < n_workers:


This would then become:

Suggested change

if n_partitions < n_workers:

if input_partition_size:

dataframe.repartition(partition_size=input_partition_size)

elif n_partitions < n_workers:

RobbeSneyders · 2023-07-20T20:12:07Z

src/fondant/data_io.py

@@ -159,6 +210,8 @@ def _write_subset(

        schema = {field.name: field.type.value for field in subset_spec.fields.values()}

+        dataframe = self.partition_written_dataframe(dataframe)


Why is this done on a subset level here? I would do it once on the dataframe level.

RobbeSneyders · 2023-07-23T14:56:08Z

src/fondant/executor.py

@@ -39,13 +39,15 @@ def __init__(
        input_manifest_path: t.Union[str, Path],
        output_manifest_path: t.Union[str, Path],
        metadata: t.Dict[str, t.Any],
-        user_arguments: t.Dict[str, Argument],
+        user_arguments: t.Dict[str, t.Any],
+        output_partition_size: t.Optional[str] = "250MB",


Shouldn't this be added to the parser as well, so it's extracted by from_args?

RobbeSneyders · 2023-07-23T14:57:50Z

src/fondant/pipeline.py

+        parameters.
+        """
+
+        def _validate_partition_size_arg(file_size):


Can we add this as a type when registering it on the parser in the Executor class?

NielsRogge · 2023-07-25T07:33:06Z

docs/pipeline.md

+    input_partition_rows=100, 
+    output_partition_size="10MB",  


Any reason the input partitions are specified in terms of rows and the output partitions in terms of size?

the output partition size ensures that the written partitions are small and can be easily loaded by the next component.

The input is defined by rows to allow you to easily iterate on it if you run into out of memory issues, for example if you run into OOM issues when retrieving 100 images from URLs. It's more intuitive to set that number to lower (10 rows for example) then to change the size of the input partitions

Ya I do feel using num rows is way more intuitive than size

GeorgesLorre

Nice work! Looking forward to see this in action

PR that introduces the partitioning strategy discussed in ml6team#288 1) The automatic behavior is as follows for all component types (dask, pandas) * The written dataframe is re-partitioned to 250 Mb * The loaded dataframe is re-partitioned depending on the current number of partitions and workers 2) The behavior above can be overwritten by the end user in case they want to implement their own custom logic, this is done on the ComponentOp level as an additional flag parameters that can be passed. See added docs with this PR for more details I will handle adding the diagnostic tools and optimizing the downloader component in a separate PR.

PR that introduces the partitioning strategy discussed in #288 1) The automatic behavior is as follows for all component types (dask, pandas) * The written dataframe is re-partitioned to 250 Mb * The loaded dataframe is re-partitioned depending on the current number of partitions and workers 2) The behavior above can be overwritten by the end user in case they want to implement their own custom logic, this is done on the ComponentOp level as an additional flag parameters that can be passed. See added docs with this PR for more details I will handle adding the diagnostic tools and optimizing the downloader component in a separate PR.

PhilippeMoussalli added 24 commits July 11, 2023 13:15

optimize image download component

2683b93

Merge branch 'main' into improve-imagedownload

78a60c2

add dask diagnostics to fondant

1b7521c

add repartitioning strategy to fondant

1c53aa1

remove changes from component

897f5f8

Merge branch 'main' into improve-imagedownload

0c48233

change save path diagnostics

2edcd91

add client to dataIO

d63eed7

remove progressbar

51cadfe

move client

678b3e1

move dataframe visualization

44d5493

add bokeh to dependencies

592f3ae

add bokeh to dependencies

5c56ddf

add bokeh to dependencies

8e575c7

move client to execute

0d1e76b

silence logs

43132f5

remove log supression

7f38c31

add log silencing to cluster

d764ad4

add worker log filter

d50cc6b

supress logs

9e1eb31

supress logs

f344ea8

add output_partition_size as an argument and revert diagnosis

08b74cc

remove output_partition from user arguments

df19ad3

enable more control over partitioning

09ef925

PhilippeMoussalli mentioned this pull request Jul 19, 2023

[WIP] Optimize image download component #288

Closed

fix test

6b1e63b

PhilippeMoussalli requested review from RobbeSneyders and GeorgesLorre and removed request for RobbeSneyders and GeorgesLorre July 19, 2023 15:01

PhilippeMoussalli requested review from RobbeSneyders, NielsRogge and GeorgesLorre July 19, 2023 15:01

PhilippeMoussalli self-assigned this Jul 19, 2023

Merge branch 'main' into introduce-repartioning

57a8af4

NielsRogge reviewed Jul 19, 2023

View reviewed changes

correct load partitioning

8715369

RobbeSneyders reviewed Jul 23, 2023

View reviewed changes

PhilippeMoussalli added 9 commits July 24, 2023 17:01

address PR feedback

c15e24c

Adjust argument passing

9ff1e68

debug

4c8d482

Further debugging

5bf3a50

Further debugging

18e1f79

Revert debug and improve logging

b2bdecc

Merge branch 'main' into introduce-repartioning

951a786

fix indentation

3394fcf

fix tests

02561ab

PhilippeMoussalli force-pushed the introduce-repartioning branch from 8c5a51b to 83f0d3a Compare July 25, 2023 07:10

add docs

99746dc

PhilippeMoussalli force-pushed the introduce-repartioning branch from 83f0d3a to 99746dc Compare July 25, 2023 07:13

NielsRogge reviewed Jul 25, 2023

View reviewed changes

GeorgesLorre approved these changes Jul 25, 2023

View reviewed changes

PhilippeMoussalli merged commit 56b8326 into main Jul 25, 2023

PhilippeMoussalli deleted the introduce-repartioning branch July 25, 2023 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce repartitioning #309

Introduce repartitioning #309

PhilippeMoussalli commented Jul 19, 2023 •

edited

Loading

NielsRogge Jul 19, 2023

PhilippeMoussalli Jul 19, 2023

RobbeSneyders left a comment

RobbeSneyders Jul 20, 2023

RobbeSneyders Jul 20, 2023

RobbeSneyders Jul 20, 2023

RobbeSneyders Jul 20, 2023

RobbeSneyders Jul 20, 2023

RobbeSneyders Jul 20, 2023

RobbeSneyders Jul 20, 2023

RobbeSneyders Jul 20, 2023

RobbeSneyders Jul 23, 2023

RobbeSneyders Jul 23, 2023

NielsRogge Jul 25, 2023

PhilippeMoussalli Jul 25, 2023

satishjasthi Jul 25, 2023

GeorgesLorre left a comment

		def __init__(self, *, manifest: Manifest, component_spec: ComponentSpec):
		super().__init__(manifest=manifest, component_spec=component_spec)

-        if n_partitions < n_workers:
+        if input_partition_size:
+            dataframe.repartition(partition_size=input_partition_size)
+        elif n_partitions < n_workers:

		@@ -159,6 +210,8 @@ def _write_subset(

		schema = {field.name: field.type.value for field in subset_spec.fields.values()}

		dataframe = self.partition_written_dataframe(dataframe)

Introduce repartitioning #309

Introduce repartitioning #309

Conversation

PhilippeMoussalli commented Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgesLorre left a comment

Choose a reason for hiding this comment

PhilippeMoussalli commented Jul 19, 2023 •

edited

Loading