-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding JSON / JSON Lines Export Support #538
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #538 +/- ##
==========================================
+ Coverage 87.39% 87.83% +0.43%
==========================================
Files 97 97
Lines 10197 10235 +38
Branches 1396 1405 +9
==========================================
+ Hits 8912 8990 +78
+ Misses 923 884 -39
+ Partials 362 361 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
src/datachain/lib/dc.py
Outdated
opener = fsspec_fs.open | ||
|
||
headers, _ = self._effective_signals_schema.get_headers_with_length() | ||
column_names = [".".join(filter(None, header)) for header in headers] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since JSON supports nesting should we dump it with proper field names? like we do in pandas
to make sure for example (that at least in simple cases) we can read some nested JSON (it creates a nested model) and dump it later in the same way and get the same result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I changed it to writing a nested dict for JSON / JSON lines, instead of the flattened version, which I was originally doing just to keep consistency with to_csv
and to_parquet
- although I can see how nested works well for JSON / JSON lines here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/datachain/lib/dc.py
Outdated
headers, _ = self._effective_signals_schema.get_headers_with_length() | ||
column_names = [".".join(filter(None, header)) for header in headers] | ||
|
||
results_iter = self.collect_flatten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just remove this variable and do for row in self.collect_flatten()
below to avoid looking at what results_iter
actually represents ... but this is not as that important, your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, changed to just using self.collect_flatten()
directly.
# This makes the file JSON instead of JSON lines. | ||
f.write(b"\n]\n") | ||
|
||
def to_jsonl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems little redundant to have this method if the same can be achieved with to_json()
by setting include_outer_list
flag. I'm not sure what is the optimal solution though ... maybe move all the logic from to_json
to some non public private method and make public to_json()
and to_jsonl()
just wrappers around it by calling correct include_outer_list
flag (similar to current to_jsonl()
). Another solution is to just remove to_jsonl()
.
But again, this is also not that super important and neither solution seems perfect as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't sure what the best approach is either, but this one seems to have the least extra code (as making both to_json
and to_jsonl
wrappers adds extra wrapper code), and also the most consistency / user-friendliness as we have from_json
and from_jsonl
so I wanted to keep the to
functions consistent.
Deploying datachain-documentation with
|
Latest commit: |
ae9a541
|
Status: | ✅ Deploy successful! |
Preview URL: | https://78371a4f.datachain-documentation.pages.dev |
Branch Preview URL: | https://dtulga-json-export.datachain-documentation.pages.dev |
This PR adds the functions
to_json
andto_jsonl
to export to JSON / JSON lines from DataChain. These functions support remote fsspec filesystems / URLs (such ass3://
), similar toto_csv
andto_parquet
, and implement streaming writing as well, to enable writing data larger than memory. This also adds tests for these functions, tests for the existingfrom_json
andfrom_jsonl
functions, and tests for both of these on remote filesystems (such as S3).This has been tested locally and fixes #533