-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readme update #133
Readme update #133
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
README.rst
Outdated
|
||
Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them. | ||
🤖 AI-Driven Data Curation: Use local ML models, LLM APIs calls to enreach your data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enreach -> enrich probably
README.rst
Outdated
|
||
For example, let us consider a dataset from Karlsruhe Institute of Technology detailing dialogs between users and customer service chatbots. We can use the chain to read data from the cloud, map it onto the parallel API calls for LLM evaluation, and organize the output into a dataset : | ||
Datachain can serialize Python objects (via `Pydantic`_) to an embedded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems a bit too low level for an intro?
what is the higher level value of this?
Can we say something - it combines SQL + GPU / CPU processing ... to blah blah ... (see how it works) and do a section on this below?
README.rst
Outdated
|
||
DataChain is built by composing wrangling operations. | ||
Datachain enables parallel processing of multiple data files or samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a sample?
README.rst
Outdated
|
||
.. code:: py | ||
The typical use cases are data curation, LLM analytics and validation, image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image segmentation, pose detection - sounds like we actually do them here (not like we are helping with them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify with this from the blog post?
We believe that DataChain will serve as a solid foundation for new and upcoming unstructured data wrangling libraries, as well as the custom AI-driven curation solutions.
Note that DataChain represents file samples as pointers into their respective storage locations. This means a newly created dataset version does not duplicate files in storage, and storage remains the single source of truth for the original samples | ||
chain = ( | ||
DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", | ||
object_name="file", type="text") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to double check - do we need anon=True
? (I think if someone has some credentials installed they might start getting error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right!
Do you have a clean machine to validate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, let me try this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check it - works fine a non-contaminated machine, returns
AttributeError: 'DataChain' object has no attribute 'export_files'
(do the final release before 6am PT tomorrow)
also checked Windows - seems to be fine (miniconda env)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to clarify - it seems to work fine w/o anon=True on a clean machine
let me check if have some creds that are limited to a different account ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, works fine as well
one thing though:
python <script.py> - doesn't return anything - let's add some print at the end? or show? cc @dmpetrov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THank you for verifying this!
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #133 +/- ##
==========================================
- Coverage 86.85% 86.43% -0.42%
==========================================
Files 88 88
Lines 9378 9378
Branches 1879 1878 -1
==========================================
- Hits 8145 8106 -39
- Misses 900 936 +36
- Partials 333 336 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
README.rst
Outdated
|
||
Datachain internally represents datasets as tables, so analytical queries on the chain are automatically vectorized: | ||
Find files with text dialogs that contains keyword "Thank you". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create dataset with files ...
Or find files and create dataset ...
(my concern is that the first few examples look like a grep on steroids)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced by evaluation using sentiment analyses (local model)
Suggestion -
Suggestion -
Suggestion -
Suggestion -
Suggestion -
Suggestion -
Suggestion - |
|
||
Now we have parallel-processed an LLM API-based query over cloud data and persisted the results. | ||
Quick Start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention somewhere that the data used is publicly available so that you can try all of these yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dmitry! I like the variety of examples. Feel free to take or leave what you want from my comments.
README.rst
Outdated
.map(is_good=lambda file: "thank you" in file.read().lower(), | ||
output={"is_good": bool}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor things:
- Is it intentional that you do both
is_good=lambda...
andoutput={"is_good: ...}
? Feels a bit confusing compared to eitheris_good=lambda ..., output=bool
orlambda ..., output={"is_good": bool}
. - Maybe consider renaming
is_good
. It doesn't explain what it does. Maybethank_you
orthank_you_note
(if you want to be cute) would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to is_positive
since the 1st algo was replaced to sentiment analysis.
A side effect - we got a positive_chain
variable 🙂
positive_chain = chain.filter(Column("is_positive") == True)
README.rst
Outdated
chain = ( | ||
DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file") | ||
.settings(parallel=4, cache=True) | ||
.map(is_good=eval_dialogue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_good
-> success
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
|
||
The “save” operation makes chain dataset persistent in the current (working) directory of the query. A hidden folder .datachain/ holds the records. A persistent dataset can be accessed later to start a derivative chain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to at least mention that from_dataset
is loading the dataset saved from an earlier example. I'm not sure it's clear enough how datasets are being saved and loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typos and suggestions.
Also do we want to think about using toggles for the sections?
README.rst
Outdated
|
||
.. code:: py | ||
The typical use cases are data curation, LLM analytics and validation, image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify with this from the blog post?
We believe that DataChain will serve as a solid foundation for new and upcoming unstructured data wrangling libraries, as well as the custom AI-driven curation solutions.
No description provided.