[WIP] Optimize image download component #288

PhilippeMoussalli · 2023-07-11T11:26:46Z

PR that optimizes the image download component as proposed here:

https://docs.google.com/document/d/1Nv9gLe1uiD9mFt62LLJ1Z13cyIihNokOtY20jk9GEpY/edit#heading=h.9ucv3iu9w7zy

This implementation currently only optimizes the current component but the goal is to move the optimization outside the components and into Fondant. Once this approach is validated for different scenarios we can proceed to do so.

RobbeSneyders

Thanks @PhilippeMoussalli!

Some questions that can hopefully help me build a better understanding 🙂

Did you run any benchmarks or use one of the profiling options provided by Dask to validate the changes?

RobbeSneyders · 2023-07-11T21:49:59Z

components/download_images/src/main.py

@@ -123,8 +124,9 @@ def transform(
            max_aspect_ratio=max_aspect_ratio,
        )

-        # Remove duplicates from laion retrieval
+        # Remove duplicates from laion retrieval (global)
        dataframe = dataframe.drop_duplicates()


Shouldn't we do this in the laion retrieval components instead?

Ideally yes, but I think the issue there is that we are working on a per-partition basis since it's a pandas component. So we are unable to execute drop duplicates on the whole dataset afaik

Makes sense, but I don't think drop_duplicates is very performant until we sort on the index first, which is probably something we should do after retrieving LAION, since we're changing the ids.

would that sorting be global then and require repartitioning such that the duplicate ids would all be clustered in one partition? or is it per-partition?

Global indeed so they are clustered per partition. I think the drop_duplicates method requires a reshuffle step otherwise anyway. Have you generated a task graph of this component? This might be something interesting to log for each component actually.

I was trying out the download images component for Datacomp and I had to remove drop_duplicates to get it to work

why is that? can you elaborate?

For me it resulted in TypeError: unhashable type: 'numpy.ndarray'

might be because you have a numpy array in your dataframe and they considered mutable and not hashable. drop_duplicates() probably expects the elements to be hashable and by default it's now operating on all the columns, you can specify that you only want to drop an id column by using the subset argument as mentioned here

I will also make the changes in this PR

RobbeSneyders · 2023-07-11T21:57:59Z

components/download_images/src/main.py

        dataframe = dataframe.drop_duplicates()
+        dataframe = dataframe.repartition(npartitions=os.cpu_count())


What if our data is larger than RAM x os.cpu_count()?

Maybe we should partition here based on size as well. If the size is sufficiently large, npartitions will be larger than os.cpu_count(). If the size is small, the optimization is less important.

good point, I still think that we should somehow consider both especially if the operation per row takes a long time but the dataset is just inherently small because it just consists of text data.

If we can consider both, that would be ideal, but not sure if we can.

I have a new updated workflow that I think can mitigate the issue of having large data (also added it to the doc). That way we only repartition if we need but the reparitions should still be smaller than RAM x os.cpu_count()

There is still the issue of having single partitions being larger than RAM, I proposed a way in which this can be tackled but would need to investigate more. Not sure if there is a clear-cut solution for that one.

Can we actually know if n_partitions<n_workers at graph construction time?

We can make the first repartition based on memory dynamic. Where we use 250Mb by default, but the user can define a lower number if they know that their component explodes the memory usage per row.

Can we actually know if n_partitions<n_workers at graph construction time?

Yes, maybe it's not clear but this block would then be part of the transform component, i'v updated the plot.

df = dd.read_parquet() n_partitions = df.npartitions n_workers = os.cpu_count() if n_partitions<n_workers: n_partitions = n_workers df = df.repartition(n_partitions=n_partitions) return df # dataframe returned to the user

We can make the first repartition based on memory dynamic. Where we use 250Mb by default, but the user can define a lower number if they know that their component explodes the memory usage per row.

I think this approach could work, but would involve quite a bit of trial and error since you probably will adjust it if your pipeline fails

Ok, the split between the load and transform component was not clear to me before. I thought this was all happening in the transform component. I think this makes sense indeed!

Can you update this PR to reflect this flow?

RobbeSneyders · 2023-07-11T21:59:45Z

components/download_images/src/main.py

@@ -145,7 +147,7 @@ def transform(

        # Remove images that could not be fetched
        dataframe = dataframe.dropna()
-
+        dataframe = dataframe.repartition(partition_size="250MB")


What if the memory of a single partition after downloading the images is larger than RAM? Will it not lead to memory issues before this repartitioning is executed?

My understanding is that the repartitioning should happen on the fly as the dataset is being created and not at the end, but I might be wrong. Would need to check this

I think you might be right, we still have all partitions that only get re-partitioned at the end but they might be larger than RAM. The highlighted part is the partition that might be troublesome. Thanks to @shayorshay for helping with the interpretation.

PhilippeMoussalli · 2023-07-12T07:31:34Z

Thanks @PhilippeMoussalli!

Some questions that can hopefully help me build a better understanding slightly_smiling_face

Did you run any benchmarks or use one of the profiling options provided by Dask to validate the changes?

Only bench-marking I did was with docker desktop for now and checked that there were more CPU usage active for the different cores. Didn't introduce the Dask profiling yet

PhilippeMoussalli · 2023-07-19T14:45:47Z

Closing this PR in favor of #309

PhilippeMoussalli · 2023-07-19T15:09:16Z

Closed in favor of #309

PR that introduces the partitioning strategy discussed in #288 1) The automatic behavior is as follows for all component types (dask, pandas) * The written dataframe is re-partitioned to 250 Mb * The loaded dataframe is re-partitioned depending on the current number of partitions and workers 2) The behavior above can be overwritten by the end user in case they want to implement their own custom logic, this is done on the ComponentOp level as an additional flag parameters that can be passed. See added docs with this PR for more details I will handle adding the diagnostic tools and optimizing the downloader component in a separate PR.

PR that introduces the partitioning strategy discussed in ml6team#288 1) The automatic behavior is as follows for all component types (dask, pandas) * The written dataframe is re-partitioned to 250 Mb * The loaded dataframe is re-partitioned depending on the current number of partitions and workers 2) The behavior above can be overwritten by the end user in case they want to implement their own custom logic, this is done on the ComponentOp level as an additional flag parameters that can be passed. See added docs with this PR for more details I will handle adding the diagnostic tools and optimizing the downloader component in a separate PR.

PR that introduces the partitioning strategy discussed in #288 1) The automatic behavior is as follows for all component types (dask, pandas) * The written dataframe is re-partitioned to 250 Mb * The loaded dataframe is re-partitioned depending on the current number of partitions and workers 2) The behavior above can be overwritten by the end user in case they want to implement their own custom logic, this is done on the ComponentOp level as an additional flag parameters that can be passed. See added docs with this PR for more details I will handle adding the diagnostic tools and optimizing the downloader component in a separate PR.

optimize image download component

2683b93

PhilippeMoussalli requested review from RobbeSneyders, GeorgesLorre, shayorshay and NielsRogge July 11, 2023 11:26

PhilippeMoussalli self-assigned this Jul 11, 2023

PhilippeMoussalli added Core Core framework Components Implementation of components labels Jul 11, 2023

PhilippeMoussalli changed the title ~~optimize image download component~~ Optimize image download component Jul 11, 2023

Merge branch 'main' into improve-imagedownload

78a60c2

RobbeSneyders reviewed Jul 11, 2023

View reviewed changes

PhilippeMoussalli added 2 commits July 17, 2023 17:54

add dask diagnostics to fondant

1b7521c

add repartitioning strategy to fondant

1c53aa1

PhilippeMoussalli changed the title ~~Optimize image download component~~ [WIP] Optimize image download component Jul 18, 2023

PhilippeMoussalli added 2 commits July 18, 2023 14:56

remove changes from component

897f5f8

Merge branch 'main' into improve-imagedownload

0c48233

PhilippeMoussalli force-pushed the improve-imagedownload branch from 2ee3f79 to 0c48233 Compare July 18, 2023 15:21

PhilippeMoussalli added 5 commits July 19, 2023 09:29

change save path diagnostics

2edcd91

add client to dataIO

d63eed7

remove progressbar

51cadfe

move client

678b3e1

move dataframe visualization

44d5493

PhilippeMoussalli force-pushed the improve-imagedownload branch from 18600e8 to 18f7950 Compare July 19, 2023 09:17

add bokeh to dependencies

592f3ae

PhilippeMoussalli force-pushed the improve-imagedownload branch from 18f7950 to 592f3ae Compare July 19, 2023 09:18

PhilippeMoussalli added 4 commits July 19, 2023 11:19

add bokeh to dependencies

5c56ddf

add bokeh to dependencies

8e575c7

move client to execute

0d1e76b

silence logs

43132f5

PhilippeMoussalli added 7 commits July 19, 2023 11:36

remove log supression

7f38c31

add log silencing to cluster

d764ad4

add worker log filter

d50cc6b

supress logs

9e1eb31

supress logs

f344ea8

add output_partition_size as an argument and revert diagnosis

08b74cc

remove output_partition from user arguments

df19ad3

PhilippeMoussalli force-pushed the improve-imagedownload branch from 994cd40 to df19ad3 Compare July 19, 2023 14:17

PhilippeMoussalli mentioned this pull request Jul 19, 2023

Introduce repartitioning #309

Merged

PhilippeMoussalli closed this Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Optimize image download component #288

[WIP] Optimize image download component #288

PhilippeMoussalli commented Jul 11, 2023

RobbeSneyders left a comment

RobbeSneyders Jul 11, 2023

PhilippeMoussalli Jul 12, 2023

RobbeSneyders Jul 17, 2023

PhilippeMoussalli Jul 17, 2023

RobbeSneyders Jul 17, 2023

NielsRogge Jul 18, 2023

PhilippeMoussalli Jul 18, 2023

NielsRogge Jul 18, 2023

PhilippeMoussalli Jul 18, 2023

RobbeSneyders Jul 11, 2023 •

edited

Loading

PhilippeMoussalli Jul 12, 2023

RobbeSneyders Jul 12, 2023

PhilippeMoussalli Jul 13, 2023 •

edited

Loading

RobbeSneyders Jul 17, 2023

PhilippeMoussalli Jul 17, 2023

RobbeSneyders Jul 17, 2023

RobbeSneyders Jul 17, 2023

RobbeSneyders Jul 11, 2023

PhilippeMoussalli Jul 12, 2023

PhilippeMoussalli Jul 13, 2023

PhilippeMoussalli commented Jul 12, 2023

PhilippeMoussalli commented Jul 19, 2023

PhilippeMoussalli commented Jul 19, 2023

		dataframe = dataframe.drop_duplicates()
		dataframe = dataframe.repartition(npartitions=os.cpu_count())

[WIP] Optimize image download component #288

[WIP] Optimize image download component #288

Conversation

PhilippeMoussalli commented Jul 11, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilippeMoussalli Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilippeMoussalli commented Jul 12, 2023

PhilippeMoussalli commented Jul 19, 2023

PhilippeMoussalli commented Jul 19, 2023

RobbeSneyders Jul 11, 2023 •

edited

Loading

PhilippeMoussalli Jul 13, 2023 •

edited

Loading