Adding JSON / JSON Lines Export Support #538

dtulga · 2024-10-24T18:49:30Z

This PR adds the functions to_json and to_jsonl to export to JSON / JSON lines from DataChain. These functions support remote fsspec filesystems / URLs (such as s3://), similar to to_csv and to_parquet, and implement streaming writing as well, to enable writing data larger than memory. This also adds tests for these functions, tests for the existing from_json and from_jsonl functions, and tests for both of these on remote filesystems (such as S3).

This has been tested locally and fixes #533

codecov · 2024-10-24T18:56:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.83%. Comparing base (34e7c2b) to head (ae9a541).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #538      +/-   ##
==========================================
+ Coverage   87.39%   87.83%   +0.43%     
==========================================
  Files          97       97              
  Lines       10197    10235      +38     
  Branches     1396     1405       +9     
==========================================
+ Hits         8912     8990      +78     
+ Misses        923      884      -39     
+ Partials      362      361       -1

Flag	Coverage Δ
datachain	`87.77% <100.00%> (+0.43%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shcheklein · 2024-10-25T00:40:57Z

src/datachain/lib/dc.py

+            opener = fsspec_fs.open
+
+        headers, _ = self._effective_signals_schema.get_headers_with_length()
+        column_names = [".".join(filter(None, header)) for header in headers]


since JSON supports nesting should we dump it with proper field names? like we do in pandas

to make sure for example (that at least in simple cases) we can read some nested JSON (it creates a nested model) and dump it later in the same way and get the same result

Sure, I changed it to writing a nested dict for JSON / JSON lines, instead of the flattened version, which I was originally doing just to keep consistency with to_csv and to_parquet - although I can see how nested works well for JSON / JSON lines here.

ilongin

LGTM

ilongin · 2024-10-25T11:14:37Z

src/datachain/lib/dc.py

+        headers, _ = self._effective_signals_schema.get_headers_with_length()
+        column_names = [".".join(filter(None, header)) for header in headers]
+
+        results_iter = self.collect_flatten()


I would just remove this variable and do for row in self.collect_flatten() below to avoid looking at what results_iter actually represents ... but this is not as that important, your call.

Sure, changed to just using self.collect_flatten() directly.

ilongin · 2024-10-25T11:27:10Z

src/datachain/lib/dc.py

+                # This makes the file JSON instead of JSON lines.
+                f.write(b"\n]\n")
+
+    def to_jsonl(


It seems little redundant to have this method if the same can be achieved with to_json() by setting include_outer_list flag. I'm not sure what is the optimal solution though ... maybe move all the logic from to_json to some non public private method and make public to_json() and to_jsonl() just wrappers around it by calling correct include_outer_list flag (similar to current to_jsonl()). Another solution is to just remove to_jsonl().
But again, this is also not that super important and neither solution seems perfect as well.

Yeah, I wasn't sure what the best approach is either, but this one seems to have the least extra code (as making both to_json and to_jsonl wrappers adds extra wrapper code), and also the most consistency / user-friendliness as we have from_json and from_jsonl so I wanted to keep the to functions consistent.

cloudflare-workers-and-pages · 2024-10-26T00:19:55Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`ae9a541`
Status:	✅ Deploy successful!
Preview URL:	https://78371a4f.datachain-documentation.pages.dev
Branch Preview URL:	https://dtulga-json-export.datachain-documentation.pages.dev

View logs

Adding JSON / JSON Lines Export Support

bea5710

dtulga self-assigned this Oct 24, 2024

dtulga marked this pull request as ready for review October 24, 2024 20:15

dtulga requested a review from a team October 24, 2024 20:15

shcheklein reviewed Oct 25, 2024

View reviewed changes

ilongin approved these changes Oct 25, 2024

View reviewed changes

samran5 approved these changes Oct 25, 2024

View reviewed changes

Changing JSON Export to Nested Instead of Flattened

ae9a541

shcheklein approved these changes Oct 26, 2024

View reviewed changes

dtulga merged commit cfe3d9c into main Oct 26, 2024
38 checks passed

dtulga deleted the dtulga/json-export branch October 26, 2024 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding JSON / JSON Lines Export Support #538

Adding JSON / JSON Lines Export Support #538

dtulga commented Oct 24, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

shcheklein Oct 25, 2024

dtulga Oct 26, 2024

ilongin left a comment

ilongin Oct 25, 2024

dtulga Oct 26, 2024

ilongin Oct 25, 2024 •

edited

Loading

dtulga Oct 26, 2024

cloudflare-workers-and-pages bot commented Oct 26, 2024 •

edited

Loading

Adding JSON / JSON Lines Export Support #538

Adding JSON / JSON Lines Export Support #538

Conversation

dtulga commented Oct 24, 2024

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

shcheklein Oct 25, 2024

Choose a reason for hiding this comment

dtulga Oct 26, 2024

Choose a reason for hiding this comment

ilongin left a comment

Choose a reason for hiding this comment

ilongin Oct 25, 2024

Choose a reason for hiding this comment

dtulga Oct 26, 2024

Choose a reason for hiding this comment

ilongin Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

dtulga Oct 26, 2024

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Oct 26, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

codecov bot commented Oct 24, 2024 •

edited

Loading

ilongin Oct 25, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 26, 2024 •

edited

Loading