Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding JSON / JSON Lines Export Support #538

Merged
merged 2 commits into from
Oct 26, 2024
Merged

Adding JSON / JSON Lines Export Support #538

merged 2 commits into from
Oct 26, 2024

Conversation

dtulga
Copy link
Contributor

@dtulga dtulga commented Oct 24, 2024

This PR adds the functions to_json and to_jsonl to export to JSON / JSON lines from DataChain. These functions support remote fsspec filesystems / URLs (such as s3://), similar to to_csv and to_parquet, and implement streaming writing as well, to enable writing data larger than memory. This also adds tests for these functions, tests for the existing from_json and from_jsonl functions, and tests for both of these on remote filesystems (such as S3).

This has been tested locally and fixes #533

@dtulga dtulga self-assigned this Oct 24, 2024
Copy link

codecov bot commented Oct 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.83%. Comparing base (34e7c2b) to head (ae9a541).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #538      +/-   ##
==========================================
+ Coverage   87.39%   87.83%   +0.43%     
==========================================
  Files          97       97              
  Lines       10197    10235      +38     
  Branches     1396     1405       +9     
==========================================
+ Hits         8912     8990      +78     
+ Misses        923      884      -39     
+ Partials      362      361       -1     
Flag Coverage Δ
datachain 87.77% <100.00%> (+0.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dtulga dtulga marked this pull request as ready for review October 24, 2024 20:15
@dtulga dtulga requested a review from a team October 24, 2024 20:15
opener = fsspec_fs.open

headers, _ = self._effective_signals_schema.get_headers_with_length()
column_names = [".".join(filter(None, header)) for header in headers]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since JSON supports nesting should we dump it with proper field names? like we do in pandas

to make sure for example (that at least in simple cases) we can read some nested JSON (it creates a nested model) and dump it later in the same way and get the same result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I changed it to writing a nested dict for JSON / JSON lines, instead of the flattened version, which I was originally doing just to keep consistency with to_csv and to_parquet - although I can see how nested works well for JSON / JSON lines here.

Copy link
Contributor

@ilongin ilongin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

headers, _ = self._effective_signals_schema.get_headers_with_length()
column_names = [".".join(filter(None, header)) for header in headers]

results_iter = self.collect_flatten()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just remove this variable and do for row in self.collect_flatten() below to avoid looking at what results_iter actually represents ... but this is not as that important, your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed to just using self.collect_flatten() directly.

# This makes the file JSON instead of JSON lines.
f.write(b"\n]\n")

def to_jsonl(
Copy link
Contributor

@ilongin ilongin Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems little redundant to have this method if the same can be achieved with to_json() by setting include_outer_list flag. I'm not sure what is the optimal solution though ... maybe move all the logic from to_json to some non public private method and make public to_json() and to_jsonl() just wrappers around it by calling correct include_outer_list flag (similar to current to_jsonl()). Another solution is to just remove to_jsonl().
But again, this is also not that super important and neither solution seems perfect as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't sure what the best approach is either, but this one seems to have the least extra code (as making both to_json and to_jsonl wrappers adds extra wrapper code), and also the most consistency / user-friendliness as we have from_json and from_jsonl so I wanted to keep the to functions consistent.

Copy link

cloudflare-workers-and-pages bot commented Oct 26, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: ae9a541
Status: ✅  Deploy successful!
Preview URL: https://78371a4f.datachain-documentation.pages.dev
Branch Preview URL: https://dtulga-json-export.datachain-documentation.pages.dev

View logs

@dtulga dtulga merged commit cfe3d9c into main Oct 26, 2024
38 checks passed
@dtulga dtulga deleted the dtulga/json-export branch October 26, 2024 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add JSON / JSON lines export support
4 participants