Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Struct json_decode/3 for decoding json from string #841

Merged
merged 6 commits into from
Feb 1, 2024

Conversation

lkarthee
Copy link
Member

iex(1)> require Explorer.DataFrame, as: DF
Explorer.DataFrame
iex(2)> d = DF.new([%{a: "{\"n\": 1}"}])
#Explorer.DataFrame<
  Polars[1 x 1]
  a string ["{\"n\": 1}"]
>
iex(3)> DF.mutate(d, aj: json_decode(a))
#Explorer.DataFrame<
  Polars[1 x 2]
  a string ["{\"n\": 1}"]
  aj struct[1] [%{"n" => 1}]
>

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the string is not valid JSON?

@lkarthee
Copy link
Member Author

lkarthee commented Jan 29, 2024

Currently I get error unless I pass dtype: {:struct, %{"n" => {:s, 64}}} - due to out_df and check_df mismatch. When dtype is not passed, it is equivalent of from_list() scenario.

Please advise on how to handle this:

  • Should we accept the check_df.dtypes returned by polars in case of :df_mutate_with_exprs ?
  • Should we narrow down check to don't worry about {:struct, _} when out_df.dtypes != check_df.dtypes and accept check_df if everything else matches ?
** (RuntimeError) DataFrame mismatch.

expected:

    names: ["a", "aj"]
    dtypes: %{"a" => :string, "aj" => nil}

got:

    names: ["a", "aj"]
    dtypes: %{"a" => :string, "aj" => {:struct, %{"n" => {:s, 64}}}}

    (explorer 0.9.0-dev) lib/explorer/polars_backend/shared.ex:62: Explorer.PolarsBackend.Shared.apply_dataframe/4
    iex:3: (file)

def apply_dataframe(%DataFrame{} = df, %DataFrame{} = out_df, fun, args) do
case apply(Native, fun, [df.data | args]) do
{:ok, %module{} = new_df} when module in @polars_df ->
if @check_frames do
# We need to collect here, because the lazy frame may not have
# the full picture of the result yet.
check_df =
if match?(%PolarsLazyFrame{}, new_df) do
{:ok, new_df} = Native.lf_collect(new_df)
create_dataframe(new_df)
else
create_dataframe(new_df)
end
if Enum.sort(out_df.names) != Enum.sort(check_df.names) or
out_df.dtypes != check_df.dtypes do
raise """
DataFrame mismatch.
expected:
names: #{inspect(out_df.names)}
dtypes: #{inspect(out_df.dtypes)}
got:
names: #{inspect(check_df.names)}
dtypes: #{inspect(check_df.dtypes)}
"""
end
end
%{out_df | data: new_df}
{:error, error} ->
raise runtime_error(error)
end
end

@lkarthee
Copy link
Member Author

What happens if the string is not valid JSON?

Raises a polars error.

iex(6)> d = DF.new([%{a: "{\"n\": 1"}])
iex(7)> DF.mutate(d, aj: json_decode(a, dtype: {:struct, %{"n" => {:s, 64}}}))
** (RuntimeError) Polars Error: error deserializing JSON: json parsing error: 'ExpectedObjectContent at character 8 (']')'
    (explorer 0.9.0-dev) lib/explorer/polars_backend/shared.ex:81: Explorer.PolarsBackend.Shared.apply_dataframe/4
    iex:7: (file)

@josevalim
Copy link
Member

I would start with the simplest API possible:

  1. no options
  2. document it errors on invalid json

Update expressions.rs

add tests

Update series.ex
@@ -36,6 +36,9 @@ object_store = { version = "0.8", default-features = false, optional = true }
[target.'cfg(not(any(all(windows, target_env = "gnu"), all(target_os = "linux", target_env = "musl"))))'.dependencies]
mimalloc = { version = "*", default-features = false }

[patch.crates-io]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context here - pola-rs/polars#14008

lib/explorer/polars_backend/shared.ex Outdated Show resolved Hide resolved
@@ -172,6 +173,11 @@ defmodule Explorer.PolarsBackend.Expression do
raise ArgumentError, "missing #{inspect(__MODULE__)} nodes: #{inspect(missing)}"
end

def to_expr(%LazySeries{op: :from_json, args: _}) do
Copy link
Member Author

@lkarthee lkarthee Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def to_expr(%LazySeries{op: :from_json, args: _}) do
def to_expr(%LazySeries{op: :from_json}) do

@josevalim If i drop args: _ i get a lot of warnings even though all tests pass. Why does this happen ?

Log containing warnings
09:59 $ mix ci
    Finished dev [unoptimized + debuginfo] target(s) in 0.42s
Compiling 3 files (.ex)
     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

     warning: this clause cannot match because a previous clause at line 265 always matches
     │
 314def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/explorer/polars_backend/expression.ex:314

Compiling crate explorer in debug mode (native/explorer)
   Compiling explorer v0.1.0 (explorer/native/explorer)
    Finished dev [unoptimized + debuginfo] target(s) in 5.26s
Excluding tags: [:cloud_integration]

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................*................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Finished in 2.8 seconds (2.8s async, 0.00s sync)
582 doctests, 1390 tests, 0 failures, 18 excluded, 1 skipped
</details>

@josevalim
Copy link
Member

josevalim commented Jan 31, 2024

I think supporting option 2 is a low hanging fruit:

Unfortunately I think this is inconsistent. All of our from_* operations work on data types outside of Series and DataFrame. So we would be breaking a rule here.

Therefore, if we introduce from_json (or read_json), the argument must be either a file or a string. Ideally we would support it in the dataframe, because all of our from_json (or read_json) functions are defined in the dataframe. However, from_json supports files and URLs, so it is quite some work, but it would be consistent with everything else including polars itself: https://docs.pola.rs/py-polars/html/reference/api/polars.read_json.html

So, after sleeping on it, I think we should first add read_json/write_json, and then from_json and to_json. If it is too much work, let us know, and @philss can pick it up soon (he pretty much wrote all file integration stuff :D).

@lkarthee
Copy link
Member Author

lkarthee commented Jan 31, 2024

@josevalim I can contribute an hour a day for few more months. Now that I am familiar explorer/polars - it does not take much time for implementing things from polars. Maintainers can feel free to tag me or assign PRs/bugs to me - i will try my best to resolve them.

I am confused about your comments and from_json - feel free to correct me.

Json Decode

example_file.csv

active,ask_code,ask_id,ask_price,code,data,expiry,id,index,inserted_at,item_id,level,project_id,properties,state,type,updated_at,user_id,value
True,,,,pull,{},1970-01-01T00:00:00Z,752e072c-9907-4fd0-9276-8fd30f7d1d1c,{},2023-02-24 15:41:57+00:00,e71a8d18-16c0-4efa-8442-dc17344ae5f8,1,890d0555-f266-42b2-a1ba-66925d8d47f6,"{'r_total': 0, 'u_total': 1, 'xp_total': 1}",ok,durable,2023-03-20 09:09:44+00:00,5755aa4d-d91e-4ab4-830e-791996f41a99,1

Description about data:

  • json_decode is useful in scenario where data frame is already loaded from a non json file like from .csv
  • Not all columns are json, rather some columns are json columns - rest are primitive types.

Possible Operations:

  • convert those json columns to struct pl.col('data').str.json_decode() .
  • extract data from a json path using pl.col('data').str.json_path_match('$. xp_total').alias('xp') .

Problems with having Series.json_decode as DataFrame operation:

  • we don't know dtype in advance, hence we can't use DF.mutate as apply_dataframe validates out_df.types == check_df.dtypes

Solution:

  • implement Series.json_decode as a series operation which wont work as data frame operation/expression.
  • raise error suggesting user to use Series.json_decode/2 rather than json_decode/2 in data frame mutate.

from_json

This is a complete different operation from json_decode:

  • from_json whole file is json
  • from_ndjson every line is json
  • json_decode one or more variables/columns are json. there might be columns which are just primitives, example csv file

example.json

{
  "columns": [
    {"name": "integer",  "datatype": "Int64",  "bit_settings": "",  "values": [1,  2,  3 ]},
    {"name": "float", "datatype": "Float64", "bit_settings": "",  "values": [4, 5, 6 ]},
    {"name": "string",  "datatype": "String", "bit_settings": "", "values": ["a", "b", "c"] }
  ]
}
df = pl.DataFrame(
    {
        "integer": [1, 2, 3],
        "float": [4.0, 5.0, 6.0],
        "string": ["a", "b", "c"],
    }
)

df.write_json("output.json")
df_json = pl.read_json("output.json")
print(df_json)

@josevalim
Copy link
Member

Thank you for the context. I see what you mean and I don't have a good answer for it.

The best I can think of is to call it "json_decode" but always expect the "schema"/"dtype" to be given, so we don't have to guess.

@josevalim
Copy link
Member

We could but I would like to avoid "branching" for now, if that makes any sense.

@lkarthee
Copy link
Member Author

lkarthee commented Jan 31, 2024

We could but I would like to avoid "branching" for now, if that makes any sense.

Jose reply is to my question - whether we could have Series only op that way we could have avoid passing dtype?

I felt my question was partly answered in his previous reply and deleted it (without seeing him answer it). Posting it again after seeing Jose's reply.

@lkarthee
Copy link
Member Author

I removed infer_schema_length, we don't need it as we are passing dtype.

lib/explorer/series.ex Outdated Show resolved Hide resolved
lib/explorer/series.ex Outdated Show resolved Hide resolved
lib/explorer/series.ex Outdated Show resolved Hide resolved
@josevalim josevalim merged commit 42075a0 into elixir-explorer:main Feb 1, 2024
4 checks passed
@josevalim
Copy link
Member

Thank you! It was a bit bumpy but I am glad with where we arrived!

@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@lkarthee lkarthee deleted the struct_json_decode branch February 1, 2024 11:32
@lkarthee
Copy link
Member Author

lkarthee commented Feb 1, 2024

Thank you! It was a bit bumpy but I am glad with where we arrived!

Thank you Jose, I don't mind bumpy rides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants