Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to examples #77

Merged
merged 21 commits into from
Jul 19, 2024
Merged

Updates to examples #77

merged 21 commits into from
Jul 19, 2024

Conversation

dberenbaum
Copy link
Contributor

@dberenbaum dberenbaum commented Jul 17, 2024

Updating or dropping old examples (WIP). Let me know if you have a different idea or prefer to drop any particular example. Just trying to do the simplest updates possible for now.

In a follow-up PR, I would like to:

  • reorganize examples into directories
  • fix imports to not use lib
  • either add tests, docstrings, etc. for lib code or move it out of lib and directly into examples
  • move all datasets into gs://datachain-demo

@dberenbaum dberenbaum requested review from dmpetrov and volkfox July 17, 2024 17:00
Copy link

cloudflare-workers-and-pages bot commented Jul 17, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: c33dfcd
Status: ✅  Deploy successful!
Preview URL: https://91e41a04.datachain-documentation.pages.dev
Branch Preview URL: https://examples-updates.datachain-documentation.pages.dev

View logs

@@ -198,7 +198,7 @@ def __call__(self, *rows, cache, download_cb):
flat.extend(flatten(obj))
else:
flat.append(obj)
res.append(flat)
res.append(tuple(flat))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check below that reshapes the output if a tuple was returned, so we need to keep it as a tuple rather than a list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! thank you for noticing this.

Copy link

codecov bot commented Jul 17, 2024

The author of this PR, dberenbaum, is not an activated member of this organization on Codecov.
Please activate this user on Codecov to display this PR comment.
Coverage data is still being uploaded to Codecov.io for purposes of overall coverage calculations.
Please don't hesitate to email us at support@codecov.io with any questions.

@dberenbaum dberenbaum marked this pull request as ready for review July 18, 2024 16:53
@dberenbaum
Copy link
Contributor Author

This one should be ready for review. I tried to update to the new api while keeping the changes as light as I could. If you think this isn't worth it and want to drop any/all of these, I'm fine with that.

@shcheklein
Copy link
Member

I think should be good to go, any blockers @dberenbaum ?

@@ -1,8 +1,8 @@
# pip install torch
import torch

from datachain.lib.hf_image_to_text import BLIP2describe
from datachain.query import C, DatasetQuery
from datachain.lib.dc import C, DataChain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[N] Use top-level imports (from datachain) where possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Lets' use top level imports in all examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd suggest importing Column instead of C. It looks more readable.

@@ -15,21 +15,19 @@


if __name__ == "__main__":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Do we still need if __name__ == "__main__"? I thought we got rid of this requirement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to run script as is without datachain query command. So, main makes sense or you can just skip it but please do not use the query command.

@@ -1,79 +0,0 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[C] Could use the example from https://github.com/iterative/dvcx/pull/1640 but name it similarity_search or similar as it is not stateful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped it because we have other examples that use similarity search and it seems out of place in this directory which is otherwise about introducing basic udf syntax.

@mattseddon
Copy link
Contributor

IMO it's important to have all examples working to move forward. You can debate the minutiae after the fact.

PR is related to

@@ -1,8 +1,9 @@
# pip install torch
# NOTE: also need to install ffmpeg binary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I] I got an error because I don't have scipy installed (no longer a dependency)

@mattseddon
Copy link
Contributor

[Q] Can any lib files be dropped as per #66?

@@ -1,15 +1,18 @@
from datachain.lib.iptc_exif_xmp import GetMetadata
from datachain.query import C, DatasetQuery
from datachain.lib.dc import C, DataChain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F] Warns about defusedxml dependency:

warnings.warn("XMP data cannot be read without defusedxml dependency")

.filter(C.name.glob("*.jpg"))
(
DataChain.from_storage(source, type="image")
.filter(C("name").glob("*.jpg"))
.limit(10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Limit to 100?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also end up with output like this:

Processed: 85022 rows [00:03, 25291.45 rows/s]
Download: 28.7MB [02:03, 244kB/s] 
Processed: 100 rows [02:03,  1.23s/ rows]
                   file xmp exif iptc error
                 source                    
0   gs://dvcx-datalakes  {}   {}   {}      
1   gs://dvcx-datalakes  {}   {}   {}      
2   gs://dvcx-datalakes  {}   {}   {}      
3   gs://dvcx-datalakes  {}   {}   {}      
4   gs://dvcx-datalakes  {}   {}   {}      
5   gs://dvcx-datalakes  {}   {}   {}      
6   gs://dvcx-datalakes  {}   {}   {}      
7   gs://dvcx-datalakes  {}   {}   {}      
8   gs://dvcx-datalakes  {}   {}   {}      
9   gs://dvcx-datalakes  {}   {}   {}      
10  gs://dvcx-datalakes  {}   {}   {}      
11  gs://dvcx-datalakes  {}   {}   {}      
12  gs://dvcx-datalakes  {}   {}   {}      
13  gs://dvcx-datalakes  {}   {}   {}      
14  gs://dvcx-datalakes  {}   {}   {}      
15  gs://dvcx-datalakes  {}   {}   {}      
16  gs://dvcx-datalakes  {}   {}   {}      
17  gs://dvcx-datalakes  {}   {}   {}      
18  gs://dvcx-datalakes  {}   {}   {}      
19  gs://dvcx-datalakes  {}   {}   {}      

[Limited by 20 rows]

Maybe bin this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unfortunately it's slow and most of the images have no metadata. Added parallelism and filtered for the non-empty rows in the latest update.

@@ -1,8 +1,8 @@
# pip install torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F] I get ImportError: Using low_cpu_mem_usage=Trueor adevice_maprequires Accelerate:pip install accelerate``

If we wanted to make it easy to run all the examples we could add an [examples] optional dependencies section to the pyproject.toml and ask users to install those in the README or at the top of each example (would be the kitchen sink install).

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing PR, thank you for doing this!

@@ -1,8 +1,8 @@
# pip install torch
import torch

from datachain.lib.hf_image_to_text import BLIP2describe
from datachain.query import C, DatasetQuery
from datachain.lib.dc import C, DataChain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Lets' use top level imports in all examples.

@@ -15,21 +15,19 @@


if __name__ == "__main__":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to run script as is without datachain query command. So, main makes sense or you can just skip it but please do not use the query command.

@@ -1,8 +1,8 @@
# pip install torch
import torch

from datachain.lib.hf_image_to_text import BLIP2describe
from datachain.query import C, DatasetQuery
from datachain.lib.dc import C, DataChain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd suggest importing Column instead of C. It looks more readable.

text = "\n\n".join([str(el) for el in elements])
df = convert_to_dataframe(elements)
return (df.to_json(), title, text, "")
def partition_object(file):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using this code inline from examples. To small module - not need to keep it in lib.

We should do this more with other modules - it's great that our UDFs are expressive enough to minimize amount of code and show to users how it actually works. No need in lib 🙂

@@ -198,7 +198,7 @@ def __call__(self, *rows, cache, download_cb):
flat.extend(flatten(obj))
else:
flat.append(obj)
res.append(flat)
res.append(tuple(flat))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! thank you for noticing this.

self.kwargs = kwargs

def raw_processor(self, obj):
def process(self, file):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as unstructured comment below - it's better to use this code inline and remove this file from lib.

PS: It's ok to keep as is to move faster.

@dberenbaum
Copy link
Contributor Author

Thanks for the comments @mattseddon and @dmpetrov! Leaving some of the comments to be addressed in a follow-up (see the checklist in the description). I wanted to keep this first PR manageable and then follow-up, because it will become very hard to track all the changes if I do it in a single PR.

IMO it's important to have all examples working to move forward.

👍 I think we should make it p1 after release.

@dberenbaum dberenbaum merged commit 088128b into main Jul 19, 2024
18 of 19 checks passed
@dberenbaum dberenbaum deleted the examples_updates branch July 19, 2024 15:54
@dberenbaum dberenbaum mentioned this pull request Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants