Auto load json cols #444

dberenbaum · 2024-09-13T15:34:29Z

Extracted from #441.

Python dict values are converted to json columns but read back as strings instead of loading the json. This PR loads the json values before returning them so that values saved as dicts are returned as dicts.

codecov · 2024-09-13T15:41:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@ee43fd1). Learn more about missing BASE report.
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #444   +/-   ##
=======================================
  Coverage        ?   86.78%           
=======================================
  Files           ?       93           
  Lines           ?     9782           
  Branches        ?     2023           
=======================================
  Hits            ?     8489           
  Misses          ?      936           
  Partials        ?      357

Flag	Coverage Δ
datachain	`86.72% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dberenbaum · 2024-09-13T15:54:55Z

Looks like this will require a studio change also. Having trouble getting those tests to run, but I think the change will need to be here.

dberenbaum · 2024-09-13T19:00:49Z

Studio test failures should be covered by https://github.com/iterative/studio/pull/10656

dtulga

LGTM, thanks for adding this! I would also recommend that https://github.com/iterative/studio/pull/10656 is merged in quick succession to this PR, to avoid test failures in the Studio tests.

dberenbaum · 2024-09-14T00:26:31Z

Thanks @dtulga! I will leave it to you and the team to merge this and the companion PR so we don't end up with broken tests.

cloudflare-workers-and-pages · 2024-09-16T16:02:07Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`1170b9d`
Status:	✅ Deploy successful!
Preview URL:	https://9f197af7.datachain-documentation.pages.dev
Branch Preview URL:	https://load-json.datachain-documentation.pages.dev

View logs

dreadatour

Looks good to me, thank you! 🙏
@dtulga are you going to handle this?

dtulga · 2024-09-18T02:09:00Z

I am working on this, yes, although it appears we don't support this feature anymore, as for this feature to work, a column has to be marked as the JSON type - from datachain.sql.types. However, new UDFs using the DataChain API cannot have an output column of this type, as I get this error:

datachain.lib.udf_signature.UdfSignatureError: processor signature error: output type 'JSON' of signal 'json_col' is not supported. Please use DataModel types: BaseModel, int, str, float, bool, list, dict, bytes, datetime

Which means that the code to convert the JSON string back into a dict is never called. (This was found while working on fixing the tests for this PR.)
As well, the old-style udf decorator has been removed in #438 which also deleted the tests for this feature and the tests using this type. And with DatasetQuery also being removed, there doesn't seem to be any way of using the JSON sql type anymore (and possibly none of the sql types anymore), which means this feature cannot work as described in the current design.

dreadatour · 2024-09-18T02:15:43Z

Which means that the code to convert the JSON string back into a dict is never called. (This was found while working on fixing the tests for this PR.) As well, the old-style udf decorator has been removed in #438 which also deleted the tests for this feature and the tests using this type. And with DatasetQuery also being removed, there doesn't seem to be any way of using the JSON sql type anymore (and possibly none of the sql types anymore), which means this feature cannot work as described in the current design.

Oh my. This looks like it is a bigger issue now than it was before 😢 Should we create new GH issue for this and close these PRs? 🤔

dtulga · 2024-09-18T02:18:04Z

That makes sense to me, but I'm not really sure what the plan was for this feature, or what the plan should be going forward.

mattseddon · 2024-09-18T02:37:46Z

If it helps I moved test_udf_different_types from tests/func/test_dataset_query.py to tests/func/test_datachain.py. Edit: no it doesn't.

Would the conversion be that dict is auto-loaded?

This test currently passes:

@pytest.mark.parametrize(
    "cloud_type,version_aware",
    [("s3", True)],
    indirect=True,
)
def test_udf_different_types(cloud_test_catalog):
    obj = {"name": "John", "age": 30}

    def test_types():
        return {"a": 1}

    dc = (
        DataChain.from_storage(
            cloud_test_catalog.src_uri, session=cloud_test_catalog.session
        )
        .filter(C("file.path").glob("*cat1"))
        .map(
            test_types,
            params=[],
            output={
                "dict_col": dict,
            },
        )
    )

    results = dc.select("dict_col").results()

    assert results == [(json.dumps({"a": 1}),)]

mattseddon · 2024-09-18T05:03:43Z

I looked into this a bit further. After merging main into this PR we can update the test_udf_different_types test to be:

@pytest.mark.parametrize(
    "cloud_type,version_aware",
    [("s3", True)],
    indirect=True,
)
def test_udf_different_types(cloud_test_catalog):
    obj = {"name": "John", "age": 30}

    def test_types():
        return (
            5,
            5,
            5,
            0.5,
            0.5,
            0.5,
            [0.5],
            [[0.5], [0.5]],
            [0.5],
            [0.5],
            "s",
            True,
            {"a": 1},
            pickle.dumps(obj),
        )

    dc = (
        DataChain.from_storage(
            cloud_test_catalog.src_uri, session=cloud_test_catalog.session
        )
        .filter(C("file.path").glob("*cat1"))
        .map(
            test_types,
            params=[],
            output={
                "int_col": int,
                "int_col_32": int,
                "int_col_64": int,
                "float_col": float,
                "float_col_32": float,
                "float_col_64": float,
                "array_col": list[float],
                "array_col_nested": list[list[float]],
                "array_col_32": list[float],
                "array_col_64": list[float],
                "string_col": str,
                "bool_col": bool,
                "dict_col": dict,
                "binary_col": bytes,
            },
        )
    )

    results = dc.to_records()
    col_values = [
        (
            r["int_col"],
            r["int_col_32"],
            r["int_col_64"],
            r["float_col"],
            r["float_col_32"],
            r["float_col_64"],
            r["array_col"],
            r["array_col_nested"],
            r["array_col_32"],
            r["array_col_64"],
            r["string_col"],
            r["bool_col"],
            r["dict_col"],
            pickle.loads(r["binary_col"]),  # noqa: S301
        )
        for r in results
    ]

    assert col_values == [
        (
            5,
            5,
            5,
            0.5,
            0.5,
            0.5,
            [0.5],
            [[0.5], [0.5]],
            [0.5],
            [0.5],
            "s",
            True,
            {"a": 1},
            obj,
        )
    ]

this does not pass without this change.

dberenbaum · 2024-09-18T11:11:12Z

Would the conversion be that dict is auto-loaded?

That's what I had in mind. AFAIK we used to require SQL types like JSON as output types but now expect Python types like dict.

dtulga · 2024-09-18T17:09:01Z

Yep, looks like dict is the correct solution - I have updated this PR to merge from main and change the tests to use dict for the JSON columns instead. Thanks!

auto load json cols

980d5e3

dberenbaum requested a review from a team September 13, 2024 15:34

dberenbaum mentioned this pull request Sep 13, 2024

Vfile refactor #441

Closed

dberenbaum mentioned this pull request Sep 13, 2024

IndexedFile -> ArrowRow #445

Merged

dtulga approved these changes Sep 13, 2024

View reviewed changes

Merge from main

75d1a53

Merge from main

44dcecd

dtulga self-assigned this Sep 17, 2024

dreadatour approved these changes Sep 18, 2024

View reviewed changes

dtulga added 2 commits September 18, 2024 09:57

Merge from main

96d2e13

Changing JSON column to dict

1170b9d

dtulga merged commit 16c2729 into main Sep 18, 2024
38 checks passed

dtulga deleted the load_json branch September 18, 2024 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto load json cols #444

Auto load json cols #444

dberenbaum commented Sep 13, 2024

codecov bot commented Sep 13, 2024 •

edited

Loading

dberenbaum commented Sep 13, 2024

dberenbaum commented Sep 13, 2024

dtulga left a comment

dberenbaum commented Sep 14, 2024

cloudflare-workers-and-pages bot commented Sep 16, 2024 •

edited

Loading

dreadatour left a comment

dtulga commented Sep 18, 2024

dreadatour commented Sep 18, 2024 •

edited

Loading

dtulga commented Sep 18, 2024

mattseddon commented Sep 18, 2024 •

edited

Loading

mattseddon commented Sep 18, 2024

dberenbaum commented Sep 18, 2024

dtulga commented Sep 18, 2024 •

edited

Loading

Auto load json cols #444

Auto load json cols #444

Conversation

dberenbaum commented Sep 13, 2024

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

dberenbaum commented Sep 13, 2024

dberenbaum commented Sep 13, 2024

dtulga left a comment

Choose a reason for hiding this comment

dberenbaum commented Sep 14, 2024

cloudflare-workers-and-pages bot commented Sep 16, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

dreadatour left a comment

Choose a reason for hiding this comment

dtulga commented Sep 18, 2024

dreadatour commented Sep 18, 2024 • edited Loading

dtulga commented Sep 18, 2024

mattseddon commented Sep 18, 2024 • edited Loading

mattseddon commented Sep 18, 2024

dberenbaum commented Sep 18, 2024

dtulga commented Sep 18, 2024 • edited Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Sep 16, 2024 •

edited

Loading

dreadatour commented Sep 18, 2024 •

edited

Loading

mattseddon commented Sep 18, 2024 •

edited

Loading

dtulga commented Sep 18, 2024 •

edited

Loading