-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Motivation, performance and intended use? #11
Comments
It looks like these benchmarks could already be built using Don't have as good a sense for serialization speed, but I'd be tempted to guess that avoiding formatting integers etc is a similar win. Relative size: I assume compressed JSON should be close to native OVT.
Yes but we could push this compute onto user frontends and just transfer open-variant across the wire if it's a standard. In this sense, the transfer may be more efficient and the presentation can be lazy. |
@samuelcolvin There is some discussion of performance throughout this presentation: https://www.youtube.com/watch?v=jtjOfggD4YY Overall, I haven't gotten far enough where I've done much benchmarking. If you have some time and motivation, you are more than welcome to do some experiments and share your results. I'll try to share my thoughts so far and answer your questions.
Not 100% what you mean by vectorized. When I think vectorized I think operating on columnar data, and this is decidedly not columnar. But the idea is we would have fast kernels for extracting data into a columnar format, and then from there we would have fast vectorized operations. So my initial idea is that this expression:
I think the question is what format are you serializing/deserializing from? Hard to answer without knowing that. My intention for use is that unstructured data will be written to OVT. Then OVT will be used for querying.
Basically, you get the space savings of moving all keys into a dictionary. Otherwise, I would think it's roughly similar. But I don't have any benchmarks yet.
I guess I imagine that eventually I'd have some direct way to print. Then the cost of formatting wouldn't be much different than cost of formatting JSON to a terminal or HTML or whatever. |
Thanks so much @wjones127, this is useful, I've just watched the video you linked to which was illuminating. |
@wjones127 this is very interesting, but having read https://github.com/apache/spark/tree/master/common/variant, I'm struggling to understand which queries would be fast with the Open Variant Type (OVT) — that document dives straight into the details of the format, without taking time to provide the motivation for the format — where they expect it to be useful. Let alone any quantative measures of how much faster it should be.
To make these questions more concrete, here are some direct questions:
Will the following queries be faster than
datafusion-functions-json
, if so by roughly how muchselect count(*) from my_table where ovt_column ? 'foo'
select count(*) from my_table where not(ovt_column ? 'foo')
select count(*) from my_table where ovt_column->'foo'->'bar' ? 'spam'
select count(*) from my_table where ovt_column->'foo'=42
select count(*) from my_table where ovt_column->'foo'->'bar'->'spam'=42
select sum(ovt_column->'foo') from my_table
In particular, is there any way to exclude some rows from a filter by knowing they don't contain a specific key?
I guess benchmarks vs
datafusion-functions-json
would be extremely interesting.cc @davidhewitt @adriangb
Some background, based on my experience building jiter I have the kernel of an idea for a binary format to encode JSON data, but I want to avoid building it if OVT is going to really work, especially since OVT is (to one degree or another) a standard that's endorsed by apache / spark.
The text was updated successfully, but these errors were encountered: