Updates to examples #77

dberenbaum · 2024-07-17T17:00:45Z

Updating or dropping old examples (WIP). Let me know if you have a different idea or prefer to drop any particular example. Just trying to do the simplest updates possible for now.

In a follow-up PR, I would like to:

reorganize examples into directories
fix imports to not use lib
either add tests, docstrings, etc. for lib code or move it out of lib and directly into examples
move all datasets into gs://datachain-demo

cloudflare-workers-and-pages · 2024-07-17T17:02:49Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`c33dfcd`
Status:	✅ Deploy successful!
Preview URL:	https://91e41a04.datachain-documentation.pages.dev
Branch Preview URL:	https://examples-updates.datachain-documentation.pages.dev

View logs

dberenbaum · 2024-07-17T17:05:06Z

src/datachain/lib/udf.py

@@ -198,7 +198,7 @@ def __call__(self, *rows, cache, download_cb):
                        flat.extend(flatten(obj))
                    else:
                        flat.append(obj)
-                res.append(flat)
+                res.append(tuple(flat))


There is a check below that reshapes the output if a tuple was returned, so we need to keep it as a tuple rather than a list

sounds good! thank you for noticing this.

codecov · 2024-07-17T17:09:35Z

The author of this PR, dberenbaum, is not an activated member of this organization on Codecov.
Please activate this user on Codecov to display this PR comment.
Coverage data is still being uploaded to Codecov.io for purposes of overall coverage calculations.
Please don't hesitate to email us at support@codecov.io with any questions.

dberenbaum · 2024-07-18T16:54:52Z

This one should be ready for review. I tried to update to the new api while keeping the changes as light as I could. If you think this isn't worth it and want to drop any/all of these, I'm fine with that.

shcheklein · 2024-07-18T19:44:38Z

I think should be good to go, any blockers @dberenbaum ?

mattseddon · 2024-07-19T02:34:47Z

examples/blip2_image_desc_lib.py

@@ -1,8 +1,8 @@
 # pip install torch
 import torch

-from datachain.lib.hf_image_to_text import BLIP2describe
-from datachain.query import C, DatasetQuery
+from datachain.lib.dc import C, DataChain


[N] Use top-level imports (from datachain) where possible

Yes! Lets' use top level imports in all examples.

Also, I'd suggest importing Column instead of C. It looks more readable.

mattseddon · 2024-07-19T02:35:39Z

examples/blip2_image_desc_lib.py

@@ -15,21 +15,19 @@


 if __name__ == "__main__":


[Q] Do we still need if __name__ == "__main__"? I thought we got rid of this requirement

we need to run script as is without datachain query command. So, main makes sense or you can just skip it but please do not use the query command.

mattseddon · 2024-07-19T02:39:57Z

examples/udfs/stateful_similarity.py

@@ -1,79 +0,0 @@
-"""


[C] Could use the example from https://github.com/iterative/dvcx/pull/1640 but name it similarity_search or similar as it is not stateful

I dropped it because we have other examples that use similarity search and it seems out of place in this directory which is otherwise about introducing basic udf syntax.

mattseddon · 2024-07-19T02:45:23Z

IMO it's important to have all examples working to move forward. You can debate the minutiae after the fact.

PR is related to

mattseddon · 2024-07-19T03:17:27Z

examples/hf_pipeline.py

@@ -1,8 +1,9 @@
 # pip install torch
+# NOTE: also need to install ffmpeg binary


[I] I got an error because I don't have scipy installed (no longer a dependency)

mattseddon · 2024-07-19T03:35:12Z

[Q] Can any lib files be dropped as per #66?

mattseddon · 2024-07-19T04:05:32Z

examples/iptc_exif_xmp_lib.py

@@ -1,15 +1,18 @@
-from datachain.lib.iptc_exif_xmp import GetMetadata
-from datachain.query import C, DatasetQuery
+from datachain.lib.dc import C, DataChain


[F] Warns about defusedxml dependency:

warnings.warn("XMP data cannot be read without defusedxml dependency")

mattseddon · 2024-07-19T05:11:08Z

examples/iptc_exif_xmp_lib.py

-        .filter(C.name.glob("*.jpg"))
+    (
+        DataChain.from_storage(source, type="image")
+        .filter(C("name").glob("*.jpg"))
        .limit(10000)


[Q] Limit to 100?

also end up with output like this:

Processed: 85022 rows [00:03, 25291.45 rows/s] Download: 28.7MB [02:03, 244kB/s] Processed: 100 rows [02:03, 1.23s/ rows] file xmp exif iptc error source 0 gs://dvcx-datalakes {} {} {} 1 gs://dvcx-datalakes {} {} {} 2 gs://dvcx-datalakes {} {} {} 3 gs://dvcx-datalakes {} {} {} 4 gs://dvcx-datalakes {} {} {} 5 gs://dvcx-datalakes {} {} {} 6 gs://dvcx-datalakes {} {} {} 7 gs://dvcx-datalakes {} {} {} 8 gs://dvcx-datalakes {} {} {} 9 gs://dvcx-datalakes {} {} {} 10 gs://dvcx-datalakes {} {} {} 11 gs://dvcx-datalakes {} {} {} 12 gs://dvcx-datalakes {} {} {} 13 gs://dvcx-datalakes {} {} {} 14 gs://dvcx-datalakes {} {} {} 15 gs://dvcx-datalakes {} {} {} 16 gs://dvcx-datalakes {} {} {} 17 gs://dvcx-datalakes {} {} {} 18 gs://dvcx-datalakes {} {} {} 19 gs://dvcx-datalakes {} {} {} [Limited by 20 rows]

Maybe bin this one?

Yeah, unfortunately it's slow and most of the images have no metadata. Added parallelism and filtered for the non-empty rows in the latest update.

mattseddon · 2024-07-19T05:37:01Z

examples/llava2_image_desc_lib.py

@@ -1,8 +1,8 @@
 # pip install torch


[F] I get ImportError: Using low_cpu_mem_usage=Trueor adevice_maprequires Accelerate:pip install accelerate``

If we wanted to make it easy to run all the examples we could add an [examples] optional dependencies section to the pyproject.toml and ask users to install those in the README or at the top of each example (would be the kitchen sink install).

dmpetrov

Amazing PR, thank you for doing this!

dmpetrov · 2024-07-19T05:47:51Z

examples/blip2_image_desc_lib.py

@@ -1,8 +1,8 @@
 # pip install torch
 import torch

-from datachain.lib.hf_image_to_text import BLIP2describe
-from datachain.query import C, DatasetQuery
+from datachain.lib.dc import C, DataChain


Yes! Lets' use top level imports in all examples.

dmpetrov · 2024-07-19T05:52:22Z

examples/blip2_image_desc_lib.py

@@ -15,21 +15,19 @@


 if __name__ == "__main__":


we need to run script as is without datachain query command. So, main makes sense or you can just skip it but please do not use the query command.

dmpetrov · 2024-07-19T05:55:14Z

examples/blip2_image_desc_lib.py

@@ -1,8 +1,8 @@
 # pip install torch
 import torch

-from datachain.lib.hf_image_to_text import BLIP2describe
-from datachain.query import C, DatasetQuery
+from datachain.lib.dc import C, DataChain


Also, I'd suggest importing Column instead of C. It looks more readable.

dmpetrov · 2024-07-19T05:58:48Z

src/datachain/lib/unstructured.py

-        text = "\n\n".join([str(el) for el in elements])
-        df = convert_to_dataframe(elements)
-        return (df.to_json(), title, text, "")
+def partition_object(file):


I'd suggest using this code inline from examples. To small module - not need to keep it in lib.

We should do this more with other modules - it's great that our UDFs are expressive enough to minimize amount of code and show to users how it actually works. No need in lib 🙂

dmpetrov · 2024-07-19T05:59:22Z

src/datachain/lib/udf.py

@@ -198,7 +198,7 @@ def __call__(self, *rows, cache, download_cb):
                        flat.extend(flatten(obj))
                    else:
                        flat.append(obj)
-                res.append(flat)
+                res.append(tuple(flat))


sounds good! thank you for noticing this.

dmpetrov · 2024-07-19T06:00:56Z

src/datachain/lib/hf_pipeline.py

-        self.kwargs = kwargs
-
-    def raw_processor(self, obj):
+    def process(self, file):


The same as unstructured comment below - it's better to use this code inline and remove this file from lib.

PS: It's ok to keep as is to move faster.

dberenbaum · 2024-07-19T13:31:10Z

Thanks for the comments @mattseddon and @dmpetrov! Leaving some of the comments to be addressed in a follow-up (see the checklist in the description). I wanted to keep this first PR manageable and then follow-up, because it will become very hard to track all the changes if I do it in a single PR.

IMO it's important to have all examples working to move forward.

👍 I think we should make it p1 after release.

Dave Berenbaum added 11 commits July 15, 2024 07:36

update hf image to text

7002cc2

Merge branch 'main' into examples_updates

9b9226c

update hf pipeline

637eeff

update iptc_exif_xmp

e54429c

Merge branch 'main' into examples_updates

569d15d

update gpt4 vision example

86478db

update openimage-detect example

cde6ab9

update unstructured example

05ff646

update wds examples

b6927ec

merge main

a9e796a

Merge branch 'main' into examples_updates

e1a6111

dberenbaum requested review from dmpetrov and volkfox July 17, 2024 17:00

revert unneeded changes

d3f4421

dberenbaum commented Jul 17, 2024

View reviewed changes

Dave Berenbaum added 4 commits July 17, 2024 14:17

drop files

4be98e6

Merge branch 'main' into examples_updates

a34b1f7

update or drop rest of examples

bc92f77

drop chain.results() from examples

e56d6ae

dberenbaum marked this pull request as ready for review July 18, 2024 16:53

Dave Berenbaum added 2 commits July 18, 2024 12:55

Merge branch 'main' into examples_updates

814504b

ruff fix

3c11aaf

mattseddon approved these changes Jul 19, 2024

View reviewed changes

mattseddon reviewed Jul 19, 2024

View reviewed changes

mattseddon mentioned this pull request Jul 19, 2024

rm outdated file from lib and examples #66

Closed

mattseddon reviewed Jul 19, 2024

View reviewed changes

dmpetrov approved these changes Jul 19, 2024

View reviewed changes

Dave Berenbaum added 3 commits July 19, 2024 09:40

update examples/iptc_exif_xmp_lib.py

d48f4ef

add accelerate dep

6f58496

Merge branch 'main' into examples_updates

c33dfcd

dberenbaum force-pushed the examples_updates branch from 48f2e4d to c33dfcd Compare July 19, 2024 15:36

dberenbaum merged commit 088128b into main Jul 19, 2024
18 of 19 checks passed

dberenbaum deleted the examples_updates branch July 19, 2024 15:54

dberenbaum mentioned this pull request Jul 21, 2024

Examples cleanup #111

Merged

mattseddon mentioned this pull request Jul 23, 2024

removed stale examples for now #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to examples #77

Updates to examples #77

dberenbaum commented Jul 17, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 17, 2024 •

edited

Loading

dberenbaum Jul 17, 2024

dmpetrov Jul 19, 2024

codecov bot commented Jul 17, 2024

dberenbaum commented Jul 18, 2024

shcheklein commented Jul 18, 2024

mattseddon Jul 19, 2024

dmpetrov Jul 19, 2024

dmpetrov Jul 19, 2024

mattseddon Jul 19, 2024

dmpetrov Jul 19, 2024

mattseddon Jul 19, 2024

dberenbaum Jul 19, 2024

mattseddon commented Jul 19, 2024

mattseddon Jul 19, 2024

mattseddon commented Jul 19, 2024

mattseddon Jul 19, 2024

mattseddon Jul 19, 2024

mattseddon Jul 19, 2024

dberenbaum Jul 19, 2024

mattseddon Jul 19, 2024

dmpetrov left a comment

dmpetrov Jul 19, 2024

dmpetrov Jul 19, 2024

dmpetrov Jul 19, 2024

dmpetrov Jul 19, 2024

dmpetrov Jul 19, 2024

dmpetrov Jul 19, 2024

dberenbaum commented Jul 19, 2024

		@@ -1,8 +1,9 @@
		# pip install torch
		# NOTE: also need to install ffmpeg binary

Updates to examples #77

Updates to examples #77

Conversation

dberenbaum commented Jul 17, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Jul 17, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 17, 2024

dberenbaum commented Jul 18, 2024

shcheklein commented Jul 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattseddon commented Jul 19, 2024

Choose a reason for hiding this comment

mattseddon commented Jul 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmpetrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum commented Jul 19, 2024

dberenbaum commented Jul 17, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 17, 2024 •

edited

Loading