Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: JSON data types #782

Closed
andrioni opened this issue Jan 20, 2016 · 12 comments
Closed

feat: JSON data types #782

andrioni opened this issue Jan 20, 2016 · 12 comments
Labels
datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) feature Features or general enhancements
Milestone

Comments

@andrioni
Copy link
Contributor

Hi, I was wondering what would be the best way to treat JSON/JSONB columns from Postgres in Ibis. Currently I'm trying to treat it as a string (although it gets converted to a dict whenever I send the data to a pandas dataframe), but I'm not sure if that's the right solution.

@wesm
Copy link
Member

wesm commented Feb 5, 2016

Good question! What are you trying to do with the data? I haven't thought through how we could incorporate schema-less JSON into Ibis's data model. For nested data in Hive, Impala, Spark SQL, etc., it's a little more clear what should be done.

@wesm wesm added this to the Future milestone Feb 5, 2016
@cpcloud
Copy link
Member

cpcloud commented May 4, 2016

FWIW, Presto has a JSON type stored as a string. The main purpose it serves is to indicate that the underlying string is valid JSON. For example,

select json '[a"]'

fails because it can't be parsed as valid json, whereas

select json '["a"]'

is converted to json.

@cpcloud
Copy link
Member

cpcloud commented May 15, 2016

Additionally Presto uses a subset of JSONPath to let users pick out parts of the JSON blob that they want.

An ibis-worthy API might look something like this:

t.blob_of_json.extract('$.foo.bar')
t.blob_of_json.extract('$.foo.bar', type='int64')
t.blob_of_json.extract('$.foo').cast('array<int64>')

@cpcloud
Copy link
Member

cpcloud commented May 15, 2016

We could also leverage jsonpath-rw which is a Python DSL that generates JSONPath expressions.

@cpcloud
Copy link
Member

cpcloud commented May 25, 2016

Should we limit the scope here? E.g., arbitrary JSON is out of scope, but JSON with a well-defined, uniform schema per record (including arbitrary nesting) is in scope.

The Apache Drill data model page is an excellent read and IMO a nice indication of reasonable expectations to set for JSON functionality.

@wesm thoughts?

@wesm
Copy link
Member

wesm commented May 26, 2016

Makes sense to me. I think one of the challenges will be supporting nested data generally given the diverging semantics of different SQL engines (for example: comparing Hive, Presto, and Impala).

@cpcloud
Copy link
Member

cpcloud commented Jun 15, 2016

For the record it looks like each of the systems mentioned by @wesm have three different ways of unnesting for arrays (similar for maps):

Given a table t:

CREATE TABLE t (
  elements ARRAY<STRING>
)

Hive:

SELECT element
FROM t
LATERAL VIEW explode(elements) s AS element;

Presto:

SELECT s.element
FROM t
CROSS JOIN unnest(elements) AS s (element)

Impala:

SELECT elements.item
FROM t, t.elements

I think we can come up with a reasonable API for doing these operations.

@wesm
Copy link
Member

wesm commented Jun 15, 2016

Cool thanks for looking into this. Might also be good to look at BigQuery for another data point. From the Ibis perspective, it might make the most sense to model this explicitly as a join. Expression analysis/validation may well become more complex, though.

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2016

BigQuery:

SELECT element
FROM t.elements

or

SELECT element
FROM UNNEST (t.elements)

The UNNEST is optional according to the docs

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2016

I'll make a new issue for the array API. I've started work on it by way of getting the type system in place and also added an "array_agg" function (called collect), indexing/slicing, and length computation. More details in the issue.

@twiecki
Copy link

twiecki commented Nov 27, 2018

Just upvoting that JSON/JSONB support would be a nice feature.

@datapythonista datapythonista removed this from the Future milestone Nov 13, 2020
@cpcloud cpcloud changed the title JSON data types feat: JSON data types Dec 29, 2021
@cpcloud cpcloud added the feature Features or general enhancements label Dec 29, 2021
@cpcloud cpcloud added this to the 4.0.0 milestone Apr 19, 2022
@cpcloud cpcloud modified the milestones: 4.0.0, 4.x Aug 31, 2022
@cpcloud cpcloud added the datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) label Sep 2, 2022
@cpcloud
Copy link
Member

cpcloud commented Sep 21, 2022

We have JSON and JSONB types now, and with 3e2efb4 we have a minimal API for extracting object values or array elements. The remaining work is implement feature requests for additional APIs, so I'm going to close this out as completed.

@cpcloud cpcloud closed this as completed Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) feature Features or general enhancements
Projects
None yet
Development

No branches or pull requests

5 participants