-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support object_name
in all from_
formats
#14
Conversation
else obj | ||
if isinstance(obj, tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If expected output was of type Feature
and return value was a tuple, it was being wrapped in another tuple instead of returned as is
from_
formatsobject_name
in all from_
formats
print(static_csv_ds.to_pandas()) | ||
|
||
uri = "gs://datachain-demo/laion-aesthetics-csv" | ||
print() | ||
print("========================================================================") | ||
print("dynamic CSV with header schema test parsing 3M objects") | ||
print("========================================================================") | ||
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", show_schema=True) | ||
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion") | ||
dynamic_csv_ds.print_schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds.print_schema()
and the argument show_schema=True
serve different purposes.
The former just prints the entire schema for the datachain.
The latter is a check if the CSV/JSON was auto-schemed correctly; if not, the user will need to specify a schema manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add show_schema
to all these methods but would like some consensus on if it's desirable since it seems a bit superfluous with the schema-related methods added recently by @dmpetrov. Previously, these methods were always printing the schema of the table if it was inferred but @dmpetrov commented here about it looking like a strange side effect, so I dropped it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are important differences there.
If you are trying to read a new CSV/JSON file, you likely want to check the schema but also will probably copy-paste the output if you want to modify it.
ds.print_schema() does not serve this purpose.
I can talk to Dmitry if needed as we already had this convo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get back the actual schema, though, which seems more useful than a print statement since you can parse it, validate it, and modify it. For example, you could do DataChain.from_csv(uri, object_name="csv").schema["csv"]
to get the pydantic-like feature class (I think having something like schema.to_dict()
might also be nice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataChain.from_csv(uri, object_name="csv").schema["csv"]
how does it help to copy-paste the schema to correct the errors?
Also I think you don't want to parse a 500Gb CSV file just to discover your auto-schema was wrong.
Maybe we should have show_csv_schema()
as a companion to show_json_schema()
to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does it help to copy-paste the schema to correct the errors?
That's why I mentioned schema.to_dict()
would be nice. Then you could edit the dict and pass it back.
Also I think you don't want to parse a 500Gb CSV file just to discover your auto-schema was wrong. Maybe we should have
show_csv_schema()
as a companion toshow_json_schema()
to address this.
Only a single block (10k lines) will be parsed to infer the schema (this is how arrow does it). Like all other chain operations, the processing happens lazily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataChain.from_csv(uri, object_name="csv").schema["csv"]
This seems to be the right approach for examine the schema alongside print_schema()
Otherwise, we will need to introduce similar side effects to all parsing methods which explodes API and not considering a good practice.
We can potentially consider improving print_schema()
for example by limiting number of outputs like print_schema("csv", "file")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a single block (10k lines) will be parsed to infer the schema (this is how arrow does it). Like all other chain operations, the processing happens lazily.
This is not intuitive but might be just okay.
What's not okay is not to offer a copy-paste Feature or Pydantic-to-feature schema output.
Note that neither print_schema() nor schema() gets us there (the latter should actually be renamed model():
>>> chain.schema["mistral_response"]
<class '__main__.MistralModel'>
>>> chain.schema["mistral_response"].schema()
{'$defs': {'CompletionResponseChoice': {'properties': {'message': {'allOf': [{'$ref': '#/$defs/MyChatMessage'}], 'default': {'role': '', 'content': ''}}}, 'title': 'CompletionResponseChoice', 'type': 'object'}, 'MyChatMessage': {'properties': {'role': {'default': '', 'title': 'Role', 'type': 'string'}, 'content': {'default': '', 'title': 'Content', 'type': 'string'}}, 'title': 'MyChatMessage', 'type': 'object'}, 'Usage': {'properties': {'prompt_tokens': {'default': 0, 'title': 'Prompt Tokens', 'type': 'integer'}, 'completion_tokens': {'default': 0, 'title': 'Completion Tokens', 'type': 'integer'}}, 'title': 'Usage', 'type': 'object'}}, 'properties': {'id': {'default': '', 'title': 'Id', 'type': 'string'}, 'choices': {'items': {'$ref': '#/$defs/CompletionResponseChoice'}, 'title': 'Choices', 'type': 'array'}, 'usage': {'allOf': [{'$ref': '#/$defs/Usage'}], 'default': {'prompt_tokens': 0, 'completion_tokens': 0}}}, 'required': ['choices'], 'title': 'MistralModel', 'type': 'object'}
>>> chain.print_schema()
source: str
parent: str
name: str
version: str
etag: str
size: int
vtype: str
location: dict
file: File
source: str
parent: str
name: str
size: int
version: str
etag: str
is_latest: bool
last_modified: datetime
location: Union[dict, list[dict], NoneType]
vtype: str
mistral_response: MistralModel
id: str
choices: list[CompletionResponseChoice]
* list of: CompletionResponseChoice
message: MyChatMessage
role: str
content: str
usage: Usage
prompt_tokens: int
completion_tokens: int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a copy-paste Feature or Pydantic-to-feature
Generating code for user to copy-past is uncommon because it's error prone (depends on environment) and not secure.
It might be better to handle this differently, possibly with a dedicated helper function like pydantic_to_feature().
@volkfox if you think it's an important use case - please share more context. However, it's unlikely should be a part of the data readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dreadatour great change! Thank you.
A couple comments are inline mostly about the naming.
PS: It seems I didn't clicked Approve button 😅
@@ -208,6 +208,7 @@ def from_storage( | |||
catalog: Optional["Catalog"] = None, | |||
recursive: Optional[bool] = True, | |||
anon: bool = False, | |||
object_name: str = "file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object_name is a bit too long name. And object never looks good in naming.
how about output
or just name
?
src/datachain/lib/dc.py
Outdated
@@ -208,6 +208,7 @@ def from_storage( | |||
catalog: Optional["Catalog"] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please replace catalog
with session
. We shouldn't show catalog
to users - only session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced it but also added kwargs
to pass catalog=catalog
because it looks like it's not that simple to replace it in tests and probably needs to happen in another PR.
src/datachain/lib/dc.py
Outdated
@@ -208,6 +208,7 @@ def from_storage( | |||
catalog: Optional["Catalog"] = None, | |||
recursive: Optional[bool] = True, | |||
anon: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we can move anon
to session. and use only session params instead of catalog+anon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably not in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped it as a named arg but left it possible to pass in kwargs
.
@@ -706,6 +663,7 @@ def from_features( | |||
ds_name: str = "", | |||
session: Optional[Session] = None, | |||
output: Union[None, FeatureType, Sequence[str], dict[str, FeatureType]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably rename output
to something like schema
or spec
and make object_name
-> output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas calls it dtype
. I'm not sure it's the best name tho.
src/datachain/lib/dc.py
Outdated
def from_csv( | ||
cls, | ||
path, | ||
anon: bool = False, | ||
delimiter: str = ",", | ||
header: bool = True, | ||
column_names: Optional[list[str]] = None, | ||
output: Optional[dict[str, FeatureType]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same - schema
/spec
instead?
src/datachain/lib/dc.py
Outdated
def from_csv( | ||
cls, | ||
path, | ||
anon: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great to get rid of anon
as well. Not sure it belongs to this PR.
print(static_csv_ds.to_pandas()) | ||
|
||
uri = "gs://datachain-demo/laion-aesthetics-csv" | ||
print() | ||
print("========================================================================") | ||
print("dynamic CSV with header schema test parsing 3M objects") | ||
print("========================================================================") | ||
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", show_schema=True) | ||
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion") | ||
dynamic_csv_ds.print_schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataChain.from_csv(uri, object_name="csv").schema["csv"]
This seems to be the right approach for examine the schema alongside print_schema()
Otherwise, we will need to introduce similar side effects to all parsing methods which explodes API and not considering a good practice.
We can potentially consider improving print_schema()
for example by limiting number of outputs like print_schema("csv", "file")
print(static_csv_ds.to_pandas()) | ||
|
||
uri = "gs://datachain-demo/laion-aesthetics-csv" | ||
print() | ||
print("========================================================================") | ||
print("dynamic CSV with header schema test parsing 3M objects") | ||
print("========================================================================") | ||
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", show_schema=True) | ||
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion") | ||
dynamic_csv_ds.print_schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a copy-paste Feature or Pydantic-to-feature
Generating code for user to copy-past is uncommon because it's error prone (depends on environment) and not secure.
It might be better to handle this differently, possibly with a dedicated helper function like pydantic_to_feature().
@volkfox if you think it's an important use case - please share more context. However, it's unlikely should be a part of the data readers.
@volkfox <https://github.com/volkfox> if you think it's an important use
case - please share more context.
The context is obvious - most datasets carry too many features, so parsing
them with a schema is a preferred way to go.
OTH, crafting these schemas by hand is a major drawback.
…On Thu, Jul 11, 2024 at 3:49 PM Dmitry Petrov ***@***.***> wrote:
***@***.**** approved this pull request.
@dreadatour <https://github.com/dreadatour> great change! Thank you.
A couple comments are inline mostly about the naming.
PS: It seems I didn't clicked Approve button 😅
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> @@ -208,6 +208,7 @@ def from_storage(
catalog: Optional["Catalog"] = None,
recursive: Optional[bool] = True,
anon: bool = False,
+ object_name: str = "file",
object_name is a bit too long name. And object never looks good in naming.
how about output or just name?
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> @@ -218,14 +219,17 @@ def from_storage(
type : read file as "binary", "text", or "image" data. Default is "binary".
recursive : search recursively for the given path.
anon : use anonymous mode to access the storage.
+ object_name : Generated object column name.
How about Created instead of Generated.
It might be better to try using Generate name for our generator
specifically.
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> @@ -208,6 +208,7 @@ def from_storage(
catalog: Optional["Catalog"] = None,
Could you please replace catalog with session. We shouldn't show catalog
to users - only session.
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> @@ -208,6 +208,7 @@ def from_storage(
catalog: Optional["Catalog"] = None,
recursive: Optional[bool] = True,
anon: bool = False,
I wish we can move anon to session. and use only session params instead
of catalog+anon
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> return chain.gen(_func_fr, output=output)
@classmethod
def from_pandas( # type: ignore[override]
- cls, df: "pd.DataFrame", name: str = "", session: Optional[Session] = None
+ cls,
+ df: "pd.DataFrame",
+ name: str = "",
+ session: Optional[Session] = None,
+ object_name: Optional[str] = None,
It would be great to make it just "str" (without Optional) for consistency
with other API calls.
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
>
def parse_tabular(
self,
- output: Optional[dict[str, FeatureType]] = None,
+ output: Union[None, FeatureType, dict[str, FeatureType]] = None,
+ object_name: Optional[str] = None,
same - please get rid of Optional if possible
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> @@ -706,6 +663,7 @@ def from_features(
ds_name: str = "",
session: Optional[Session] = None,
output: Union[None, FeatureType, Sequence[str], dict[str, FeatureType]] = None,
we should probably rename output to something like schema or spec and
make object_name -> output
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> return self.gen(ArrowGenerator(schema, **kwargs), output=output)
- def parse_csv(
- self,
+ @classmethod
+ def from_csv(
+ cls,
+ path,
+ anon: bool = False,
delimiter: str = ",",
header: bool = True,
column_names: Optional[list[str]] = None,
output: Optional[dict[str, FeatureType]] = None,
The same - schema/spec instead?
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> return self.gen(ArrowGenerator(schema, **kwargs), output=output)
- def parse_csv(
- self,
+ @classmethod
+ def from_csv(
+ cls,
+ path,
+ anon: bool = False,
it would be great to get rid of anon as well. Not sure it belongs to this
PR.
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> @@ -208,6 +208,7 @@ def from_storage(
catalog: Optional["Catalog"] = None,
recursive: Optional[bool] = True,
anon: bool = False,
It's probably not in this PR.
------------------------------
In src/datachain/lib/dc.py
<#14 (comment)>:
> @@ -706,6 +663,7 @@ def from_features(
ds_name: str = "",
session: Optional[Session] = None,
output: Union[None, FeatureType, Sequence[str], dict[str, FeatureType]] = None,
Pandas calls it dtype. I'm not sure it's the best name tho.
------------------------------
In examples/json-csv-reader.py
<#14 (comment)>:
> print(static_csv_ds.to_pandas())
uri = "gs://datachain-demo/laion-aesthetics-csv"
print()
print("========================================================================")
print("dynamic CSV with header schema test parsing 3M objects")
print("========================================================================")
- dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", show_schema=True)
+ dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
+ dynamic_csv_ds.print_schema()
DataChain.from_csv(uri, object_name="csv").schema["csv"]
This seems to be the right approach for examine the schema alongside
print_schema() Otherwise, we will need to introduce similar side effects
to all parsing methods which explodes API and not considering a good
practice.
We can potentially consider improving print_schema() for example by
limiting number of outputs like print_schema("csv", "file")
------------------------------
In examples/json-csv-reader.py
<#14 (comment)>:
> print(static_csv_ds.to_pandas())
uri = "gs://datachain-demo/laion-aesthetics-csv"
print()
print("========================================================================")
print("dynamic CSV with header schema test parsing 3M objects")
print("========================================================================")
- dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", show_schema=True)
+ dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
+ dynamic_csv_ds.print_schema()
a copy-paste Feature or Pydantic-to-feature
Generating code for user to copy-past is uncommon because it's error prone
(depends on environment) and not secure.
It might be better to handle this differently, possibly with a dedicated
helper function like pydantic_to_feature().
@volkfox <https://github.com/volkfox> if you think it's an important use
case - please share more context. However, it's unlikely should be a part
of the data readers.
—
Reply to this email directly, view it on GitHub
<#14 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEC4S3RFRBH34YNBTBMCHVDZL4DXNAVCNFSM6AAAAABKXBM3UCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNZSGM4TMNBSGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It looks like we need a dedicated method for this similar to |
@dmpetrov Re:
|
@dberenbaum we also should include the Unrelated to this issue but a good idea - have the
not to process 3 million rows to extract 5. |
src/datachain/lib/dc.py
Outdated
output = schema_to_output(schema) | ||
except ValueError as e: | ||
raise DatasetPrepareError(self.name, e) from e | ||
|
||
if object_name: | ||
if isinstance(output, dict): | ||
output = dict_to_feature(object_name, output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not always that you want a Feature class name to be the same as the object name.
This is why from_json() also has model_name argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more note on the proposed change:
It would be great to have some way to mark the fields of the auto-schema as 'Optional'
Without this, the last example from the test set examples/json-csv-reader.py
will fail at first reference:
from datachain.lib.dc import C, DataChain
uri = "gs://datachain-demo/laion-aesthetics-csv"
print()
print("========================================================================")
print("dynamic CSV with header schema test parsing 3M objects")
print("========================================================================")
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
dynamic_csv_ds.collect()
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
Processed: 1 rows [00:00, 1088.02 rows/s]
dynamic_csv_ds.collect()
Processed: 1 rows [00:00, 1221.05 rows/s]
Processed: 1 rows [01:57, 117.96s/ rows]
Generated: 3303469 rows [01:57, 28004.44 rows/s]
Traceback (most recent call last):
File "", line 1, in
File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 544, in collect
return list(self.iterate(*cols))
File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 537, in iterate
yield chain.signals_schema.row_to_features(row, chain.session.catalog)
File "/Users/dkh/datachain/src/datachain/lib/signal_schema.py", line 189, in row_to_features
obj = fr_cls(**json)
File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/pydantic/main.py", line 193, in init
self.pydantic_validator.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for laion
pwatermark
Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.8/v/float_type
Yeah. It was introduced for tuples that are flattened (
Right, it feels like a lack of terminology and unification! SignalSchema seems like a right data structure. The problem is, it was not designed for user-facing APIs. But it can be dome with simple tricks: making it a child of WDYT?
💯
It makes sense. @dberenbaum WDYT? PS: @dberenbaum please do not wait long for all the design decisions - we can redesign on top of this. |
aa27b4b
to
a1f9c93
Compare
The author of this PR, dberenbaum, is not an activated member of this organization on Codecov. |
Added it to unblock and move forward here. TBH I'm not sure this should be exposed to the user, and this goes back to the discussion above about whether we should print pydantic schemas or provide an editable dict-like schema.
Yes, this seems like it's the same issue as https://github.com/iterative/dvcx/issues/1697, so let's follow up there.
Yup, I'm merging and we can follow up on these questions and how to name these args. |
* support object_name in all from_ formats * refactor arrow infer schema * use session instead of catalog in datachain * fix tests * Revert "fix tests" This reverts commit 4ca848f. * hide anon arg in datachain * drop optional from object_name * accept list of col names * fix csv col names and add model_name
Closes https://github.com/iterative/dvcx/issues/1710
parse_parquet/parse_csv
withfrom_parquet/from_csv
(replacing otherfrom_csv
)parse_tabular()
as a public method for more complex logic (like doing complex filtering before parsing parquet)object_name
(taken fromfrom_json
but name suggestions welcome) tofrom_storage/from_features/from_pandas/from_parquet/from_csv