Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme update #133

Merged
merged 9 commits into from
Jul 23, 2024
Merged

Readme update #133

merged 9 commits into from
Jul 23, 2024

Conversation

dmpetrov
Copy link
Member

No description provided.

Copy link

cloudflare-workers-and-pages bot commented Jul 22, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 019907c
Status: ✅  Deploy successful!
Preview URL: https://4f2a38b1.datachain-documentation.pages.dev
Branch Preview URL: https://readme.datachain-documentation.pages.dev

View logs

README.rst Outdated

Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.
🤖 AI-Driven Data Curation: Use local ML models, LLM APIs calls to enreach your data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enreach -> enrich probably

README.rst Outdated

For example, let us consider a dataset from Karlsruhe Institute of Technology detailing dialogs between users and customer service chatbots. We can use the chain to read data from the cloud, map it onto the parallel API calls for LLM evaluation, and organize the output into a dataset :
Datachain can serialize Python objects (via `Pydantic`_) to an embedded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit too low level for an intro?

what is the higher level value of this?

Can we say something - it combines SQL + GPU / CPU processing ... to blah blah ... (see how it works) and do a section on this below?

README.rst Outdated

DataChain is built by composing wrangling operations.
Datachain enables parallel processing of multiple data files or samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a sample?

README.rst Outdated

.. code:: py
The typical use cases are data curation, LLM analytics and validation, image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image segmentation, pose detection - sounds like we actually do them here (not like we are helping with them)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify with this from the blog post?
We believe that DataChain will serve as a solid foundation for new and upcoming unstructured data wrangling libraries, as well as the custom AI-driven curation solutions.

Note that DataChain represents file samples as pointers into their respective storage locations. This means a newly created dataset version does not duplicate files in storage, and storage remains the single source of truth for the original samples
chain = (
DataChain.from_storage("gs://datachain-demo/chatbot-KiT/",
object_name="file", type="text")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check - do we need anon=True? (I think if someone has some credentials installed they might start getting error)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right!
Do you have a clean machine to validate this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, let me try this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check it - works fine a non-contaminated machine, returns

AttributeError: 'DataChain' object has no attribute 'export_files'

(do the final release before 6am PT tomorrow)

also checked Windows - seems to be fine (miniconda env)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify - it seems to work fine w/o anon=True on a clean machine

let me check if have some creds that are limited to a different account ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, works fine as well

one thing though:

python <script.py> - doesn't return anything - let's add some print at the end? or show? cc @dmpetrov

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THank you for verifying this!

Copy link

codecov bot commented Jul 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.43%. Comparing base (aa8f352) to head (b3e0d91).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #133      +/-   ##
==========================================
- Coverage   86.85%   86.43%   -0.42%     
==========================================
  Files          88       88              
  Lines        9378     9378              
  Branches     1879     1878       -1     
==========================================
- Hits         8145     8106      -39     
- Misses        900      936      +36     
- Partials      333      336       +3     
Flag Coverage Δ
datachain 86.43% <ø> (-0.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

README.rst Show resolved Hide resolved
README.rst Outdated

Datachain internally represents datasets as tables, so analytical queries on the chain are automatically vectorized:
Find files with text dialogs that contains keyword "Thank you".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create dataset with files ...

Or find files and create dataset ...

(my concern is that the first few examples look like a grep on steroids)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by evaluation using sentiment analyses (local model)

@volkfox
Copy link
Contributor

volkfox commented Jul 22, 2024

DataChain is an open-source Python library for processing and curating unstructured data at scale.

Suggestion -

DataChain is an open-source Python library for processing and curating unstructured data at scale:

🤖 AI-Driven Data Curation: Use local ML models, LLM APIs calls to enreach your data.

Suggestion -

🤖 AI-Driven Data Curation: Use multimodal AI inferences and LLM API calls to enrich your data.

🚀 GenAI Dataset scale: Handle 10s of milions of files or file snippets.

Suggestion -
🚀 GenAI Dataset scale: Handle tens of millions of files

🐍 Python-friendly: Python objects instead of JSON to represent annotations

Suggestion -
🐍 Python-friendly: Python objects instead of JSON for annotations and metadata

Datachain enables parallel processing of multiple data files or samples. It can chain different operations such as filtering, aggregation and merging datasets. Resulting datasets can be saved, versioned, and extracted as files or converted to a PyTorch data loader.

Suggestion -
Datachain enables parallel processing of multiple dataset entries. It can chain different operations such as filtering, aggregation, grouping and merging. Upon execution, chain resolves into datasets can be saved, versioned, and exported or converted to PyTorch and TensorFlow data loaders.

Datachain can serialize Python objects (via Pydantic) to an embedded SQLite databased. It efficiently deserializes Python object or run vectorized analytical query in the DB without deserialization.

Suggestion -
DataChain automatically handles serialization/deserialization of ([Pydantic](https://github.com/pydantic/pydantic)) Python objects on the chain via an embedded [SQLite](https://www.sqlite.org/) database. It also provides lazy execution and vectorization of analytical queries.

The typical use cases are data curation, LLM analytics and validation, image segmentation, pose detection, and GenAI alignment. DataChain excels at optimizing batch operations, such as parallelizing synchronous API calls or leveraging heavy batch processing tasks.

Suggestion -
Typical use cases include data curation, LLM analytics and validation, image segmentation, pose detection, and GenAI alignment. DataChain excels at optimizing batch operations – such as parallelizing synchronous API calls or handling large-volume inferences.


Now we have parallel-processed an LLM API-based query over cloud data and persisted the results.
Quick Start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention somewhere that the data used is publicly available so that you can try all of these yourself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dmitry! I like the variety of examples. Feel free to take or leave what you want from my comments.

README.rst Outdated
Comment on lines 64 to 65
.map(is_good=lambda file: "thank you" in file.read().lower(),
output={"is_good": bool})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor things:

  • Is it intentional that you do both is_good=lambda... and output={"is_good: ...}? Feels a bit confusing compared to either is_good=lambda ..., output=bool or lambda ..., output={"is_good": bool}.
  • Maybe consider renaming is_good. It doesn't explain what it does. Maybe thank_you or thank_you_note (if you want to be cute) would be better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to is_positive since the 1st algo was replaced to sentiment analysis.
A side effect - we got a positive_chain variable 🙂

positive_chain = chain.filter(Column("is_positive") == True)

README.rst Outdated
chain = (
DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
.settings(parallel=4, cache=True)
.map(is_good=eval_dialogue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_good -> success?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!


The “save” operation makes chain dataset persistent in the current (working) directory of the query. A hidden folder .datachain/ holds the records. A persistent dataset can be accessed later to start a derivative chain:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to at least mention that from_dataset is loading the dataset saved from an earlier example. I'm not sure it's clear enough how datasets are being saved and loaded.

README.rst Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
Copy link

@jendefig jendefig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos and suggestions.

Also do we want to think about using toggles for the sections?

README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated

.. code:: py
The typical use cases are data curation, LLM analytics and validation, image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify with this from the blog post?
We believe that DataChain will serve as a solid foundation for new and upcoming unstructured data wrangling libraries, as well as the custom AI-driven curation solutions.

README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Show resolved Hide resolved
@dmpetrov dmpetrov merged commit f5eec30 into main Jul 23, 2024
17 checks passed
@dmpetrov dmpetrov deleted the readme branch July 23, 2024 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants