Readme update #133

dmpetrov · 2024-07-22T22:47:44Z

No description provided.

cloudflare-workers-and-pages · 2024-07-22T22:49:07Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`019907c`
Status:	✅ Deploy successful!
Preview URL:	https://4f2a38b1.datachain-documentation.pages.dev
Branch Preview URL:	https://readme.datachain-documentation.pages.dev

View logs

shcheklein · 2024-07-22T22:52:58Z

README.rst


-Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.
+🤖 AI-Driven Data Curation: Use local ML models, LLM APIs calls to enreach your data.


enreach -> enrich probably

shcheklein · 2024-07-22T22:55:48Z

README.rst


-For example, let us consider a dataset from Karlsruhe Institute of Technology detailing dialogs between users and customer service chatbots. We can use the chain to read data from the cloud, map it onto the parallel API calls for LLM evaluation, and organize the output into a dataset :
+Datachain can serialize Python objects (via `Pydantic`_) to an embedded


seems a bit too low level for an intro?

what is the higher level value of this?

Can we say something - it combines SQL + GPU / CPU processing ... to blah blah ... (see how it works) and do a section on this below?

shcheklein · 2024-07-22T22:56:08Z

README.rst


-DataChain is built by composing wrangling operations.
+Datachain enables parallel processing of multiple data files or samples.


what is a sample?

shcheklein · 2024-07-22T22:57:31Z

README.rst


-.. code:: py
+The typical use cases are data curation, LLM analytics and validation, image


image segmentation, pose detection - sounds like we actually do them here (not like we are helping with them)

Maybe clarify with this from the blog post?
We believe that DataChain will serve as a solid foundation for new and upcoming unstructured data wrangling libraries, as well as the custom AI-driven curation solutions.

shcheklein · 2024-07-22T23:02:09Z

README.rst

-Note that DataChain represents file samples as pointers into their respective storage locations. This means a newly created dataset version does not duplicate files in storage, and storage remains the single source of truth for the original samples
+    chain = (
+       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/",
+                              object_name="file", type="text")


just to double check - do we need anon=True? (I think if someone has some credentials installed they might start getting error)

right!
Do you have a clean machine to validate this?

yep, let me try this

check it - works fine a non-contaminated machine, returns

AttributeError: 'DataChain' object has no attribute 'export_files'

(do the final release before 6am PT tomorrow)

also checked Windows - seems to be fine (miniconda env)

just to clarify - it seems to work fine w/o anon=True on a clean machine

let me check if have some creds that are limited to a different account ...

okay, works fine as well

one thing though:

python <script.py> - doesn't return anything - let's add some print at the end? or show? cc @dmpetrov

THank you for verifying this!

codecov · 2024-07-22T23:04:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.43%. Comparing base (aa8f352) to head (b3e0d91).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #133      +/-   ##
==========================================
- Coverage   86.85%   86.43%   -0.42%     
==========================================
  Files          88       88              
  Lines        9378     9378              
  Branches     1879     1878       -1     
==========================================
- Hits         8145     8106      -39     
- Misses        900      936      +36     
- Partials      333      336       +3

Flag	Coverage Δ
datachain	`86.43% <ø> (-0.36%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

README.rst

shcheklein · 2024-07-22T23:19:26Z

README.rst


-Datachain internally represents datasets as tables, so analytical queries on the chain are automatically vectorized:
+Find files with text dialogs that contains keyword "Thank you".


Create dataset with files ...

Or find files and create dataset ...

(my concern is that the first few examples look like a grep on steroids)

Replaced by evaluation using sentiment analyses (local model)

volkfox · 2024-07-22T23:21:31Z

DataChain is an open-source Python library for processing and curating unstructured data at scale.

Suggestion -

DataChain is an open-source Python library for processing and curating unstructured data at scale:

🤖 AI-Driven Data Curation: Use local ML models, LLM APIs calls to enreach your data.

Suggestion -

🤖 AI-Driven Data Curation: Use multimodal AI inferences and LLM API calls to enrich your data.

🚀 GenAI Dataset scale: Handle 10s of milions of files or file snippets.

Suggestion -
🚀 GenAI Dataset scale: Handle tens of millions of files

🐍 Python-friendly: Python objects instead of JSON to represent annotations

Suggestion -
🐍 Python-friendly: Python objects instead of JSON for annotations and metadata

Datachain enables parallel processing of multiple data files or samples. It can chain different operations such as filtering, aggregation and merging datasets. Resulting datasets can be saved, versioned, and extracted as files or converted to a PyTorch data loader.

Suggestion -
Datachain enables parallel processing of multiple dataset entries. It can chain different operations such as filtering, aggregation, grouping and merging. Upon execution, chain resolves into datasets can be saved, versioned, and exported or converted to PyTorch and TensorFlow data loaders.

Datachain can serialize Python objects (via Pydantic) to an embedded SQLite databased. It efficiently deserializes Python object or run vectorized analytical query in the DB without deserialization.

Suggestion -
DataChain automatically handles serialization/deserialization of ([Pydantic](https://github.com/pydantic/pydantic)) Python objects on the chain via an embedded [SQLite](https://www.sqlite.org/) database. It also provides lazy execution and vectorization of analytical queries.

The typical use cases are data curation, LLM analytics and validation, image segmentation, pose detection, and GenAI alignment. DataChain excels at optimizing batch operations, such as parallelizing synchronous API calls or leveraging heavy batch processing tasks.

Suggestion -
Typical use cases include data curation, LLM analytics and validation, image segmentation, pose detection, and GenAI alignment. DataChain excels at optimizing batch operations – such as parallelizing synchronous API calls or handling large-volume inferences.

dberenbaum · 2024-07-23T00:45:56Z

README.rst


-Now we have parallel-processed an LLM API-based query over cloud data and persisted the results.
+Quick Start


Let's mention somewhere that the data used is publicly available so that you can try all of these yourself.

dberenbaum

Thanks Dmitry! I like the variety of examples. Feel free to take or leave what you want from my comments.

dberenbaum · 2024-07-23T00:56:39Z

README.rst

+       .map(is_good=lambda file: "thank you" in file.read().lower(),
+            output={"is_good": bool})


A couple minor things:

Is it intentional that you do both is_good=lambda... and output={"is_good: ...}? Feels a bit confusing compared to either is_good=lambda ..., output=bool or lambda ..., output={"is_good": bool}.

Maybe consider renaming is_good. It doesn't explain what it does. Maybe thank_you or thank_you_note (if you want to be cute) would be better?

Renamed to is_positive since the 1st algo was replaced to sentiment analysis.
A side effect - we got a positive_chain variable 🙂

positive_chain = chain.filter(Column("is_positive") == True)

dberenbaum · 2024-07-23T01:02:55Z

README.rst

+    chain = (
+       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
+       .settings(parallel=4, cache=True)
+       .map(is_good=eval_dialogue)


is_good -> success?

dberenbaum · 2024-07-23T01:25:41Z

README.rst


-The “save” operation makes chain dataset persistent in the current (working) directory of the query. A hidden folder .datachain/ holds the records. A persistent dataset can be accessed later to start a derivative chain:


I think it would help to at least mention that from_dataset is loading the dataset saved from an earlier example. I'm not sure it's clear enough how datasets are being saved and loaded.

README.rst

jendefig

Typos and suggestions.

Also do we want to think about using toggles for the sections?

README.rst

jendefig · 2024-07-23T02:25:44Z

README.rst


-.. code:: py
+The typical use cases are data curation, LLM analytics and validation, image


Maybe clarify with this from the blog post?
We believe that DataChain will serve as a solid foundation for new and upcoming unstructured data wrangling libraries, as well as the custom AI-driven curation solutions.

README.rst

Readme update

d5bb56b

dmpetrov requested review from dberenbaum, shcheklein, volkfox and jendefig July 22, 2024 22:48

formatting

9a340b1

shcheklein reviewed Jul 22, 2024

View reviewed changes

Merge branch 'main' into readme

cc1e08b

shcheklein reviewed Jul 22, 2024

View reviewed changes

README.rst Show resolved Hide resolved

shcheklein reviewed Jul 22, 2024

View reviewed changes

dberenbaum reviewed Jul 23, 2024

View reviewed changes

Merge branch 'main' into readme

91816cd

dberenbaum reviewed Jul 23, 2024

View reviewed changes

jendefig reviewed Jul 23, 2024

View reviewed changes

dberenbaum mentioned this pull request Jul 23, 2024

Huggingface test updates and bug fix #140

Merged

dmpetrov and others added 3 commits July 22, 2024 22:27

feedback

c6393f6

typo

f0dc4c8

Merge branch 'main' into readme

aef41a1

shcheklein approved these changes Jul 23, 2024

View reviewed changes

remove DB section

b3e0d91

skshetry approved these changes Jul 23, 2024

View reviewed changes

add image

019907c

dmpetrov merged commit f5eec30 into main Jul 23, 2024
17 checks passed

dmpetrov deleted the readme branch July 23, 2024 05:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme update #133

Readme update #133

dmpetrov commented Jul 22, 2024

cloudflare-workers-and-pages bot commented Jul 22, 2024 •

edited

Loading

shcheklein Jul 22, 2024

shcheklein Jul 22, 2024

shcheklein Jul 22, 2024

shcheklein Jul 22, 2024

jendefig Jul 23, 2024

shcheklein Jul 22, 2024

dmpetrov Jul 23, 2024

shcheklein Jul 23, 2024

shcheklein Jul 23, 2024

shcheklein Jul 23, 2024

shcheklein Jul 23, 2024

dmpetrov Jul 23, 2024

codecov bot commented Jul 22, 2024 •

edited

Loading

shcheklein Jul 22, 2024

dmpetrov Jul 23, 2024

volkfox commented Jul 22, 2024 •

edited

Loading

dberenbaum Jul 23, 2024

dmpetrov Jul 23, 2024

dberenbaum left a comment

dberenbaum Jul 23, 2024

dmpetrov Jul 23, 2024

dberenbaum Jul 23, 2024

dmpetrov Jul 23, 2024

dberenbaum Jul 23, 2024

jendefig left a comment

jendefig Jul 23, 2024


		Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.
		🤖 AI-Driven Data Curation: Use local ML models, LLM APIs calls to enreach your data.


		For example, let us consider a dataset from Karlsruhe Institute of Technology detailing dialogs between users and customer service chatbots. We can use the chain to read data from the cloud, map it onto the parallel API calls for LLM evaluation, and organize the output into a dataset :
		Datachain can serialize Python objects (via `Pydantic`_) to an embedded


		DataChain is built by composing wrangling operations.
		Datachain enables parallel processing of multiple data files or samples.


		.. code:: py
		The typical use cases are data curation, LLM analytics and validation, image


		Datachain internally represents datasets as tables, so analytical queries on the chain are automatically vectorized:
		Find files with text dialogs that contains keyword "Thank you".


		Now we have parallel-processed an LLM API-based query over cloud data and persisted the results.
		Quick Start

		.map(is_good=lambda file: "thank you" in file.read().lower(),
		output={"is_good": bool})


		The “save” operation makes chain dataset persistent in the current (working) directory of the query. A hidden folder .datachain/ holds the records. A persistent dataset can be accessed later to start a derivative chain:

Readme update #133

Readme update #133

Conversation

dmpetrov commented Jul 22, 2024

cloudflare-workers-and-pages bot commented Jul 22, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 22, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

volkfox commented Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jendefig left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Jul 22, 2024 •

edited

Loading

codecov bot commented Jul 22, 2024 •

edited

Loading

volkfox commented Jul 22, 2024 •

edited

Loading