Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto load json cols #444

Merged
merged 5 commits into from
Sep 18, 2024
Merged

Auto load json cols #444

merged 5 commits into from
Sep 18, 2024

Conversation

dberenbaum
Copy link
Contributor

Extracted from #441.

Python dict values are converted to json columns but read back as strings instead of loading the json. This PR loads the json values before returning them so that values saved as dicts are returned as dicts.

@dberenbaum dberenbaum requested a review from a team September 13, 2024 15:34
@dberenbaum dberenbaum mentioned this pull request Sep 13, 2024
Copy link

codecov bot commented Sep 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@ee43fd1). Learn more about missing BASE report.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #444   +/-   ##
=======================================
  Coverage        ?   86.78%           
=======================================
  Files           ?       93           
  Lines           ?     9782           
  Branches        ?     2023           
=======================================
  Hits            ?     8489           
  Misses          ?      936           
  Partials        ?      357           
Flag Coverage Δ
datachain 86.72% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dberenbaum
Copy link
Contributor Author

Looks like this will require a studio change also. Having trouble getting those tests to run, but I think the change will need to be here.

@dberenbaum
Copy link
Contributor Author

Studio test failures should be covered by https://github.com/iterative/studio/pull/10656

Copy link
Contributor

@dtulga dtulga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this! I would also recommend that https://github.com/iterative/studio/pull/10656 is merged in quick succession to this PR, to avoid test failures in the Studio tests.

@dberenbaum
Copy link
Contributor Author

Thanks @dtulga! I will leave it to you and the team to merge this and the companion PR so we don't end up with broken tests.

Copy link

cloudflare-workers-and-pages bot commented Sep 16, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1170b9d
Status: ✅  Deploy successful!
Preview URL: https://9f197af7.datachain-documentation.pages.dev
Branch Preview URL: https://load-json.datachain-documentation.pages.dev

View logs

@dtulga dtulga self-assigned this Sep 17, 2024
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you! 🙏
@dtulga are you going to handle this?

@dtulga
Copy link
Contributor

dtulga commented Sep 18, 2024

I am working on this, yes, although it appears we don't support this feature anymore, as for this feature to work, a column has to be marked as the JSON type - from datachain.sql.types. However, new UDFs using the DataChain API cannot have an output column of this type, as I get this error:

datachain.lib.udf_signature.UdfSignatureError: processor signature error: output type 'JSON' of signal 'json_col' is not supported. Please use DataModel types: BaseModel, int, str, float, bool, list, dict, bytes, datetime

Which means that the code to convert the JSON string back into a dict is never called. (This was found while working on fixing the tests for this PR.)
As well, the old-style udf decorator has been removed in #438 which also deleted the tests for this feature and the tests using this type. And with DatasetQuery also being removed, there doesn't seem to be any way of using the JSON sql type anymore (and possibly none of the sql types anymore), which means this feature cannot work as described in the current design.

@dreadatour
Copy link
Contributor

dreadatour commented Sep 18, 2024

Which means that the code to convert the JSON string back into a dict is never called. (This was found while working on fixing the tests for this PR.) As well, the old-style udf decorator has been removed in #438 which also deleted the tests for this feature and the tests using this type. And with DatasetQuery also being removed, there doesn't seem to be any way of using the JSON sql type anymore (and possibly none of the sql types anymore), which means this feature cannot work as described in the current design.

Oh my. This looks like it is a bigger issue now than it was before 😢 Should we create new GH issue for this and close these PRs? 🤔

@dtulga
Copy link
Contributor

dtulga commented Sep 18, 2024

That makes sense to me, but I'm not really sure what the plan was for this feature, or what the plan should be going forward.

@mattseddon
Copy link
Member

mattseddon commented Sep 18, 2024

If it helps I moved test_udf_different_types from tests/func/test_dataset_query.py to tests/func/test_datachain.py. Edit: no it doesn't.

Would the conversion be that dict is auto-loaded?

This test currently passes:

@pytest.mark.parametrize(
    "cloud_type,version_aware",
    [("s3", True)],
    indirect=True,
)
def test_udf_different_types(cloud_test_catalog):
    obj = {"name": "John", "age": 30}

    def test_types():
        return {"a": 1}

    dc = (
        DataChain.from_storage(
            cloud_test_catalog.src_uri, session=cloud_test_catalog.session
        )
        .filter(C("file.path").glob("*cat1"))
        .map(
            test_types,
            params=[],
            output={
                "dict_col": dict,
            },
        )
    )

    results = dc.select("dict_col").results()

    assert results == [(json.dumps({"a": 1}),)]

@mattseddon
Copy link
Member

I looked into this a bit further. After merging main into this PR we can update the test_udf_different_types test to be:

@pytest.mark.parametrize(
    "cloud_type,version_aware",
    [("s3", True)],
    indirect=True,
)
def test_udf_different_types(cloud_test_catalog):
    obj = {"name": "John", "age": 30}

    def test_types():
        return (
            5,
            5,
            5,
            0.5,
            0.5,
            0.5,
            [0.5],
            [[0.5], [0.5]],
            [0.5],
            [0.5],
            "s",
            True,
            {"a": 1},
            pickle.dumps(obj),
        )

    dc = (
        DataChain.from_storage(
            cloud_test_catalog.src_uri, session=cloud_test_catalog.session
        )
        .filter(C("file.path").glob("*cat1"))
        .map(
            test_types,
            params=[],
            output={
                "int_col": int,
                "int_col_32": int,
                "int_col_64": int,
                "float_col": float,
                "float_col_32": float,
                "float_col_64": float,
                "array_col": list[float],
                "array_col_nested": list[list[float]],
                "array_col_32": list[float],
                "array_col_64": list[float],
                "string_col": str,
                "bool_col": bool,
                "dict_col": dict,
                "binary_col": bytes,
            },
        )
    )

    results = dc.to_records()
    col_values = [
        (
            r["int_col"],
            r["int_col_32"],
            r["int_col_64"],
            r["float_col"],
            r["float_col_32"],
            r["float_col_64"],
            r["array_col"],
            r["array_col_nested"],
            r["array_col_32"],
            r["array_col_64"],
            r["string_col"],
            r["bool_col"],
            r["dict_col"],
            pickle.loads(r["binary_col"]),  # noqa: S301
        )
        for r in results
    ]

    assert col_values == [
        (
            5,
            5,
            5,
            0.5,
            0.5,
            0.5,
            [0.5],
            [[0.5], [0.5]],
            [0.5],
            [0.5],
            "s",
            True,
            {"a": 1},
            obj,
        )
    ]

this does not pass without this change.

@dberenbaum
Copy link
Contributor Author

Would the conversion be that dict is auto-loaded?

That's what I had in mind. AFAIK we used to require SQL types like JSON as output types but now expect Python types like dict.

@dtulga
Copy link
Contributor

dtulga commented Sep 18, 2024

Yep, looks like dict is the correct solution - I have updated this PR to merge from main and change the tests to use dict for the JSON columns instead. Thanks!

@dtulga dtulga merged commit 16c2729 into main Sep 18, 2024
38 checks passed
@dtulga dtulga deleted the load_json branch September 18, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants