-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to examples #77
Conversation
Deploying datachain-documentation with
|
Latest commit: |
c33dfcd
|
Status: | ✅ Deploy successful! |
Preview URL: | https://91e41a04.datachain-documentation.pages.dev |
Branch Preview URL: | https://examples-updates.datachain-documentation.pages.dev |
@@ -198,7 +198,7 @@ def __call__(self, *rows, cache, download_cb): | |||
flat.extend(flatten(obj)) | |||
else: | |||
flat.append(obj) | |||
res.append(flat) | |||
res.append(tuple(flat)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check below that reshapes the output if a tuple was returned, so we need to keep it as a tuple rather than a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! thank you for noticing this.
The author of this PR, dberenbaum, is not an activated member of this organization on Codecov. |
This one should be ready for review. I tried to update to the new api while keeping the changes as light as I could. If you think this isn't worth it and want to drop any/all of these, I'm fine with that. |
I think should be good to go, any blockers @dberenbaum ? |
@@ -1,8 +1,8 @@ | |||
# pip install torch | |||
import torch | |||
|
|||
from datachain.lib.hf_image_to_text import BLIP2describe | |||
from datachain.query import C, DatasetQuery | |||
from datachain.lib.dc import C, DataChain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[N] Use top-level imports (from datachain
) where possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Lets' use top level imports in all examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd suggest importing Column instead of C. It looks more readable.
@@ -15,21 +15,19 @@ | |||
|
|||
|
|||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Q] Do we still need if __name__ == "__main__"
? I thought we got rid of this requirement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to run script as is without datachain query
command. So, main makes sense or you can just skip it but please do not use the query command.
@@ -1,79 +0,0 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[C] Could use the example from https://github.com/iterative/dvcx/pull/1640 but name it similarity_search
or similar as it is not stateful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped it because we have other examples that use similarity search and it seems out of place in this directory which is otherwise about introducing basic udf syntax.
IMO it's important to have all examples working to move forward. You can debate the minutiae after the fact. PR is related to |
@@ -1,8 +1,9 @@ | |||
# pip install torch | |||
# NOTE: also need to install ffmpeg binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[I] I got an error because I don't have scipy
installed (no longer a dependency)
[Q] Can any lib files be dropped as per #66? |
@@ -1,15 +1,18 @@ | |||
from datachain.lib.iptc_exif_xmp import GetMetadata | |||
from datachain.query import C, DatasetQuery | |||
from datachain.lib.dc import C, DataChain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[F] Warns about defusedxml
dependency:
warnings.warn("XMP data cannot be read without defusedxml dependency")
.filter(C.name.glob("*.jpg")) | ||
( | ||
DataChain.from_storage(source, type="image") | ||
.filter(C("name").glob("*.jpg")) | ||
.limit(10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Q] Limit to 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also end up with output like this:
Processed: 85022 rows [00:03, 25291.45 rows/s]
Download: 28.7MB [02:03, 244kB/s]
Processed: 100 rows [02:03, 1.23s/ rows]
file xmp exif iptc error
source
0 gs://dvcx-datalakes {} {} {}
1 gs://dvcx-datalakes {} {} {}
2 gs://dvcx-datalakes {} {} {}
3 gs://dvcx-datalakes {} {} {}
4 gs://dvcx-datalakes {} {} {}
5 gs://dvcx-datalakes {} {} {}
6 gs://dvcx-datalakes {} {} {}
7 gs://dvcx-datalakes {} {} {}
8 gs://dvcx-datalakes {} {} {}
9 gs://dvcx-datalakes {} {} {}
10 gs://dvcx-datalakes {} {} {}
11 gs://dvcx-datalakes {} {} {}
12 gs://dvcx-datalakes {} {} {}
13 gs://dvcx-datalakes {} {} {}
14 gs://dvcx-datalakes {} {} {}
15 gs://dvcx-datalakes {} {} {}
16 gs://dvcx-datalakes {} {} {}
17 gs://dvcx-datalakes {} {} {}
18 gs://dvcx-datalakes {} {} {}
19 gs://dvcx-datalakes {} {} {}
[Limited by 20 rows]
Maybe bin this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unfortunately it's slow and most of the images have no metadata. Added parallelism and filtered for the non-empty rows in the latest update.
examples/llava2_image_desc_lib.py
Outdated
@@ -1,8 +1,8 @@ | |||
# pip install torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[F] I get ImportError: Using
low_cpu_mem_usage=Trueor a
device_maprequires Accelerate:
pip install accelerate``
If we wanted to make it easy to run all the examples we could add an [examples]
optional dependencies section to the pyproject.toml
and ask users to install those in the README or at the top of each example (would be the kitchen sink install).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing PR, thank you for doing this!
@@ -1,8 +1,8 @@ | |||
# pip install torch | |||
import torch | |||
|
|||
from datachain.lib.hf_image_to_text import BLIP2describe | |||
from datachain.query import C, DatasetQuery | |||
from datachain.lib.dc import C, DataChain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Lets' use top level imports in all examples.
@@ -15,21 +15,19 @@ | |||
|
|||
|
|||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to run script as is without datachain query
command. So, main makes sense or you can just skip it but please do not use the query command.
@@ -1,8 +1,8 @@ | |||
# pip install torch | |||
import torch | |||
|
|||
from datachain.lib.hf_image_to_text import BLIP2describe | |||
from datachain.query import C, DatasetQuery | |||
from datachain.lib.dc import C, DataChain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd suggest importing Column instead of C. It looks more readable.
text = "\n\n".join([str(el) for el in elements]) | ||
df = convert_to_dataframe(elements) | ||
return (df.to_json(), title, text, "") | ||
def partition_object(file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using this code inline from examples. To small module - not need to keep it in lib.
We should do this more with other modules - it's great that our UDFs are expressive enough to minimize amount of code and show to users how it actually works. No need in lib 🙂
@@ -198,7 +198,7 @@ def __call__(self, *rows, cache, download_cb): | |||
flat.extend(flatten(obj)) | |||
else: | |||
flat.append(obj) | |||
res.append(flat) | |||
res.append(tuple(flat)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! thank you for noticing this.
self.kwargs = kwargs | ||
|
||
def raw_processor(self, obj): | ||
def process(self, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as unstructured comment below - it's better to use this code inline and remove this file from lib.
PS: It's ok to keep as is to move faster.
Thanks for the comments @mattseddon and @dmpetrov! Leaving some of the comments to be addressed in a follow-up (see the checklist in the description). I wanted to keep this first PR manageable and then follow-up, because it will become very hard to track all the changes if I do it in a single PR.
👍 I think we should make it p1 after release. |
48f2e4d
to
c33dfcd
Compare
Updating or dropping old examples (WIP). Let me know if you have a different idea or prefer to drop any particular example. Just trying to do the simplest updates possible for now.
In a follow-up PR, I would like to:
lib