Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format] Passing column statistics through Arrow C data interface #38837

Open
ianmcook opened this issue Nov 21, 2023 · 66 comments
Open

[Format] Passing column statistics through Arrow C data interface #38837

ianmcook opened this issue Nov 21, 2023 · 66 comments

Comments

@ianmcook
Copy link
Member

ianmcook commented Nov 21, 2023

Describe the enhancement requested

Is there any standard or convention for passing column statistics through the C data interface?

For example, say there is a module that reads a Parquet file into memory in Arrow format then passes the data arrays to another module through the C data interface. If the Parquet file metadata includes Parquet column statistics such as distinct_count, max, min, and null_count, can the sending module pass those statistics through the C data interface, to allow the receiving module to use the statistics to perform computations more efficiently?

Component(s)

Format

@zeroshade
Copy link
Member

There's no current standard convention, there's a few ways such could be sent though such as via a struct array with one row and fields for each of those or various other configurations.

Passing such statistics would be beyond the scope of the current C Data interface IMHO

@ianmcook
Copy link
Member Author

cc @Tishj @pdet

There is a related discussion at duckdb/duckdb#4636 (reply in thread)

@pdet
Copy link

pdet commented Nov 27, 2023

There's no current standard convention, there's a few ways such could be sent though such as via a struct array with one row and fields for each of those or various other configurations.

Passing such statistics would be beyond the scope of the current C Data interface IMHO

I would argue that it makes sense to include statistical information in the C data interface. Efficient execution of complex queries requires statistical information, and I believe that most Arrow Producers possess this information. Therefore, it should be somehow passed. An alternative I can think of for the C-Data interface is to wrap the statistical information in top-level objects (e.g., ArrowRecordBatch in Python), but that approach is quite cumbersome and would necessitate specific implementations for every client API.

@pdet
Copy link

pdet commented Apr 2, 2024

Just to provide a clearer example of how statistics can be useful for query optimization.

In DuckDB, join ordering is currently determined using heuristics based on table cardinalities with future plans to enhance this with sample statistics.

Not only join ordering is affected by statistics but even the choice of the probe side in a hash join, will be determined based on the expected cardinalities.

One example of a query that is affected by join ordering is Q 21 of tpch. However, the plan for it is too big to share it easily in a GitHub Discussion.

To give a simpler example of how cardinalities affect this, I've created two tables.

  1. t - Has 10^8 rows, ranging from 0 to 10^8
  2. t_2 - Has 10 rows, ranging from 0 to 10

My example query is a simple inner join of these two tables and we calculate the sum of t.i.

SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i)

Because the optimizer doesn't have any information of statistics from the Arrow side, it will basically pick the probe side depending on what's presented in the query.

SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i)

This would result in a slightly different plan, yet with significant differences in performance.

Screenshot 2024-04-02 at 14 28 03

As depicted in the screenshot of executing both queries, choosing the incorrect probe side for this query already results in a performance difference of an order of magnitude. For more complex queries, the variations in execution time could be not only larger but also more difficult to trace.

For reference, the code I used for this example:

import duckdb
import time
import statistics

con = duckdb.connect()

# Create table with 10^8
con.execute("CREATE TABLE t as SELECT * FROM RANGE(0, 100000000) tbl(i)")
# Create Table with 10
con.execute("CREATE TABLE t_2 as SELECT * FROM RANGE(0, 10) tbl(k)")

query_slow = '''
SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i);
'''

query_fast = '''
SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i);'''

t = con.execute("FROM t").fetch_arrow_table()
t_2 = con.execute("FROM t_2").fetch_arrow_table()

con_2 = duckdb.connect()

print("DuckDB Arrow - Query Slow")

print(con_2.execute("EXPLAIN " + query_slow).fetchall()[0][1])

execution_times = []

for _ in range(5):
    start_time = time.time()
    con_2.execute(query_slow)
    end_time = time.time()
    
    execution_times.append(end_time - start_time)

median_time = statistics.median(execution_times)

print(median_time)

print("DuckDB Arrow - Query Fast")

print(con_2.execute("EXPLAIN " + query_fast).fetchall()[0][1])

execution_times = []

for _ in range(5):
    start_time = time.time()
    con_2.execute(query_fast)
    end_time = time.time()
    
    execution_times.append(end_time - start_time)

median_time = statistics.median(execution_times)

print(median_time)

@kou
Copy link
Member

kou commented Apr 24, 2024

I'm considering some approaches for this use case. This is not completed yet but share my idea so far. Feedback is appreciated.

ADBC uses the following schema to return statistics:

https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L1739-L1778

It's designed for returning statistics of a database.

We can simplify this schema because we can just return statistics of a record batch. For example:

Field Name Field Type Comments
column_name utf8 (1)
statistic_key int16 not null (2)
statistic_value VALUE_SCHEMA not null
statistic_is_approximate bool not null (3)
  1. If null, then the statistic applies to the entire table.
  2. A dictionary-encoded statistic name (although we do not use the Arrow
    dictionary type). Values in [0, 1024) are reserved for ADBC. Other
    values are for implementation-specific statistics. For the definitions
    of predefined statistic types, see adbc-table-statistics. To get
    driver-specific statistic names, use AdbcConnectionGetStatisticNames().
  3. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

Field Name Field Type
int64 int64
uint64 uint64
float64 float64
binary binary

TODO: How to represent statistic key? Should we use ADBC style? (Assigning an ID for each statistic key and using it.)

If we represent statistics as a record batch, we can pass statistics through Arrow C data interface. This may be a reasonable approach.

If we use this approach, we need to do the followings:

  • Define a schema as a specification
  • Add statistics related APIs to Apache Arrow C++ and other implementation because we need two more implementations for specification change
  • Apache Arrow C++: Add support for importing statistics from Apache Parquet C++
  • ...

TODO: Consider statistics related API for Apache Arrow C++.

@zeroshade
Copy link
Member

I'd be curious what others think of this approach as opposed to actually making a format change to include statistics alongside the record batches in the API. Particular in the case of a stream of batches.

I'm not against it, I just don't know if others would be opposed to needing an entirely separate record batch being sent containing the statistics

@ianmcook
Copy link
Member Author

I'd be curious what others think of this approach as opposed to actually making a format change to include statistics alongside the record batches in the API

I think the top priority should be to avoid breaking ABI compatibility. I suspect that most users of the C data interface will not want to pass statistics. We should avoid doing anything that would cause disruption for those users.

@kou
Copy link
Member

kou commented Apr 25, 2024

Ah, I should have written an approach that changes the current Arrow C data interface.
I'll write it later.

@mapleFU
Copy link
Member

mapleFU commented Apr 25, 2024

If the Parquet file metadata includes Parquet column statistics such as distinct_count, max, min, and null_count, can the sending module pass those statistics through the C data interface, to allow the receiving module to use the statistics to perform computations more efficiently?

This proposal is great. Just a un-related issue, Parquet DistinctCount might merely used currently. And now Dataset Scanner might parsing the min-max / null as Expression

@mapleFU
Copy link
Member

mapleFU commented Apr 25, 2024

#38837 (comment)

Is {u}int64/float64/string the only supported types? What would the min-max schema being when facing the different types?

@lidavidm
Copy link
Member

At least for ADBC, the idea was that other types can be encoded in those choices. (Decimals can be represented as bytes, dates can be represented as int64, etc.)

@kou
Copy link
Member

kou commented May 1, 2024

Some approaches that are based the C Data interface https://arrow.apache.org/docs/format/CDataInterface.html :

(1) Add get_statistics callback to ArrowArray

For example:

struct ArrowArray {
  // Array data description
  int64_t length;
  int64_t null_count;
  int64_t offset;
  int64_t n_buffers;
  int64_t n_children;
  const void** buffers;
  struct ArrowArray** children;
  struct ArrowArray* dictionary;

  // Callback to return statistics of this ArrowArray
  struct ArrowArray *(*get_statistics)(struct ArrowArray*);
  // Release callback
  void (*release)(struct ArrowArray*);
  // Opaque producer-specific data
  void* private_data;
};

This uses a struct ArrowArray to represent statistics like #38837 (comment) but we can define struct ArrowStatistics or something instead.

Note that this is a backward incompatible change. struct ArrowArray doesn't have version information nor spaces for extension. We can't do this without breaking backward compatibility.

(2) Add statistics to ArrowSchema::metadata

https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata

If we choose this approach, we will preserve some metadata key such as ARROW:XXX like we did for IPC format: https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

Here are some ideas how to put statistics into ArrowSchema::metadata:

  1. Use struct ArrowArray* (pointer) as ARROW:statistics metadata value
  2. Use multiple metadata to represent statistics

Here is an example for the 2. approach:

{
  "ARROW:statistics:column1:max": 2.9,
  "ARROW:statistics:column1:max:approximate": true,
  "ARROW:statistics:column2:average_byte_width": 29.9
}

TODO:

  • How to encode each value (2.9, true and 29.9) to raw byte data? We can use only raw byte data for a value of ArrowSchema::metadata.
  • Can we support same name columns with this approach?
  • This isn't space effective because we have many duplicated texts such as ARROW:statistics:.

Note that this is a (roughly) backward compatible change. I think that most users don't use ARROW:XXX as metadata key.

This may not work with the C stream interface https://arrow.apache.org/docs/format/CStreamInterface.html . Because it shares one struct ArrowSchema with multiple struct ArrowArray. Each struct ArrowArray will have different statistics.

@lidavidm
Copy link
Member

lidavidm commented May 1, 2024

Do consumers want per-batch metadata anyways? I would assume in the context of say DuckDB is that they'd like to get statistics for the whole stream up front, without reading any data, and use that to inform their query plan.

@kou
Copy link
Member

kou commented May 1, 2024

Ah, then we should not mix statistics and ArrowArray. ArrowArray::get_statistics() may be too late.

@lidavidm
Copy link
Member

lidavidm commented May 1, 2024

This is just off the cuff, but maybe we could somehow signal to an ArrowArrayStream that a next call to get_next should instead return a batch of Arrow-encoded statistics data. That wouldn't break ABI, so long as we come up with a good way to differentiate the call. Then consumers like DuckDB could fetch the stats up front (if available) and the schema would be known to them (if we standardize on a single schema for this).

@kou
Copy link
Member

kou commented May 1, 2024

Hmm. It may be better that we provide a separated API to get statistics like #38837 (comment) approach.
It may be confused that ArrowArrayStream::get_next() returns statistics or data.

@lidavidm
Copy link
Member

lidavidm commented May 2, 2024

We talked about this a little, but what about approach (2) from Kou's comment above, but for now only defining table-level statistics (primarily row count)? AIUI, row count is the important statistic to have for @pdet's use case, and it is simple to define. We can wait and see on more complicated or column-level statistics.

Also for the ArrowDeviceArray, there is some room for extension:

arrow/cpp/src/arrow/c/abi.h

Lines 134 to 135 in 14c54bb

// Reserved bytes for future expansion.
int64_t reserved[3];

Could that be useful here? (Unfortunately there isn't any room for extension on ArrowDeviceArrayStream.)

@kou
Copy link
Member

kou commented May 2, 2024

Thanks for sharing our talked idea.

I took a look at the DuckDB implementation. It seems that DucDB uses only column-level statistics:

duckdb::TableFunction::statistics returns the statistics of a specified column:

https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/function/table_function.hpp#L253-L255
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/function/table_function.hpp#L188-L189

duckdb::BaseStatistics doesn't have row count. It has distinct count, have NULL and have non-NULL:

https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146

It seems that a numeric/string column can have min/max statistics:

https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/numeric_stats.hpp#L22-L31
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/string_stats.hpp#L23-L36

(A string column can have more statistics such as have Unicode and max length.)

Hmm. It seems that column-level statistics is also needed for real word use cases.

@pdet
Copy link

pdet commented May 3, 2024

Hey guys,

Thank you very much for starting the design of Arrow statistics! That's exciting!

We are currently interested in up-front full-column statistics. Specially:

  • Count Distinct (Approximate)
  • Cardinality of the table
  • Min-Max

As a clarification, we also utilize row-group min-max for filtering optimizations in DuckDB tables, but these cannot benefit Arrow. In Arrow, we either pushdown filters to an Arrow Scanner or create a filter node on top of the scanner, and we do not utilize Mix-Max of chunks for filter optimization.

For a more detailed picture of what we hold for statistics you can also look in our statistics folder.

But I think that approx count distinct, cardinality, and min-max are enough for a first iteration.

@lidavidm
Copy link
Member

lidavidm commented May 3, 2024

Table cardinality would be table level right? But of course the others are column level. Hmm. We didn't leave ourselves room in ArrowDeviceArrayStream...

And just to be clear "up front" means at the start of the stream, not per-batch, right?

@pdet
Copy link

pdet commented May 3, 2024

Table cardinality would be table level right? But of course the others are column level. Hmm. We didn't leave ourselves room in ArrowDeviceArrayStream..

Exactly!

And just to be clear "up front" means at the start of the stream, not per-batch, right?

Yes, at the start of the stream.

@zeroshade
Copy link
Member

@lidavidm technically the C Device structs are still marked as Experimental on the documentation and haven't been adopted by much of the ecosystem yet (as we're still adding more tooling in libarrow and the pycapsule definitions for using them) so adding room in ArrowDeviceArrayStream should still be viable without breaking people.

Either by adding members or an entire separate callback function?

@lidavidm
Copy link
Member

lidavidm commented May 4, 2024

I think it would be interesting to add an extra callback to get the statistics, yeah.

@lidavidm
Copy link
Member

lidavidm commented May 4, 2024

It's hard though, because arguably it's kind of outside the core intent of the C Data Interface. But on the other hand, they kind of need to be here to have any hope of being supported more broadly.

@mapleFU
Copy link
Member

mapleFU commented May 5, 2024

Count Distinct (Approximate)

Hmmm I don't know should we represent this "Approximate"

@kou
Copy link
Member

kou commented May 5, 2024

It seems that DuckDB uses HyperLogLog for computing distinct count: https://github.com/duckdb/duckdb/blob/d26007417b7770860ae78278c898d2ecf13f08fd/src/include/duckdb/storage/statistics/distinct_statistics.hpp#L25-L26

It may be the reason why "Approximate" is included here.

@mapleFU
Copy link
Member

mapleFU commented May 6, 2024

Nice, I mean should "distinct count" ( or ndv ) be "estimated" in our data interface? And if we add a "estimated", should we add an "exact" ndv or just "estimated" is ok here?

In (1) #38837 (comment) , would adding a key here be heavy?

@lidavidm
Copy link
Member

lidavidm commented May 6, 2024

The ADBC encoding allows the producer to mark statistics as exact/approximate, fwiw

In (1) #38837 (comment) , would adding a key here be heavy?

See #38837 (comment)

@kou
Copy link
Member

kou commented May 14, 2024

How about applying the "Member allocation" semantics?

https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation

Therefore, the consumer MUST not try to interfere with the producer’s handling of these members’ lifetime. The only way the consumer influences data lifetime is by calling the base structure’s release callback.

The current ImportRecordBatch() doesn't have an option that doesn't call the statistics array's release(). So we need to add more APIs for this use case (import member array not base array) something like ImportMemberRecordBatch().

Or we can add Schema::statistics() and ImportSchema() import ARROW:statistics automatically.

@lidavidm
Copy link
Member

Ah, that works. I suppose we could also just make release a no-op for the child.

@kou
Copy link
Member

kou commented May 15, 2024

It's a good idea!

@lidavidm
Copy link
Member

@ianmcook @zeroshade how does the sketch sound?

@kou
Copy link
Member

kou commented May 22, 2024

Updated version. Feedback is welcome. I'll share this idea to the dev@arrow.apache.org mailing list too.

It's based on #38837 (comment) .

If we have a record batch that has int32 column1 and string column2, we have the following ArrowSchema. Note that metadata has "ARROW:statistics" => ArrowArray*. ArrowArray* is a base 10 string of the address of an ArrowArray because we can use only string for metadata value. You can't release the statistics ArrowArray*. (Its release is a no-op function.) It follows https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation semantics. (The base ArrowSchema owns statistics ArrowArray*.)

ArrowSchema {
  .format = "+siu",
  .metadata = {
    "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row count */
  },
  .children = {
    ArrowSchema {
      .name = "column1",
      .format = "i",
      .metadata = {
        "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */
      },
    },
    ArrowSchema {
      .name = "column2",
      .format = "u",
      .metadata = {
        "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */
      },
    },
  },
}

ArrowArray* for statistics use the following schema:

Field Name Field Type Comments
key string not null (1)
value VALUE_SCHEMA not null
is_approximate bool not null (2)
  1. We'll provide pre-defined keys such as max, min, byte_width and distinct_count but users can use application specific keys too.
  2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

Field Name Field Type Comments
int64 int64
uint64 uint64
float64 float64
value The same type of the ArrowSchema that is belonged to. (3)
  1. If the ArrowSchema's type is string, this type is also string.

    TODO: Is value good name?

@lidavidm
Copy link
Member

One nit:

Its release is NULL

It probably shouldn't be NULL since either (1) caller may expect it to be non-NULL and try to call it or (2) caller may assume NULL means that the array was already released

It would be safer to have a release() that just does nothing.

@kou
Copy link
Member

kou commented May 22, 2024

Ah, sorry. You're right. It should be a no-op release().
I'll fix the comment.

@kou
Copy link
Member

kou commented May 22, 2024

The discussion thread on dev@arrow.apache.org: https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx

@westonpace
Copy link
Member

And just to be clear "up front" means at the start of the stream, not per-batch, right?

Apologies if I am misunderstanding but, given David's point here, aren't we talking about two different API calls? For example, from the perspective of duckdb it is often something like (this is maybe over-simplifying things)...

left_stats = get_statistics(left_table.join_id, filter)
right_stats = get_statistics(right_table.join_id, filter)
if left_stats.approx_count_distinct < right_stats.approx_count_distinct:
  hashtable = make_hashtable(load_array(left_table.join_id))
  probe(hashtable, right_table)
else:
  hashtable = make_hashtable(load_array(right_table.join_id))
  probe(hashtable, left_table)

In other words, first I get the statistics (and no data), then I do some optimization / calculation with the statistics, and then I might choose to actually load the data.

If these are two separate API calls then the statistics are just another record batch. What is the advantage of making a single stream / message that contains both data and statistics?

@kou
Copy link
Member

kou commented May 23, 2024

Thanks for joining this discussion.

My understanding of the "up front" is the same as you explained.

#38837 (comment) uses ArrowSchema not ArrowArray/ArrowArrayStream. So it doesn't contain both data and statistics. I think the following flow is used. (ArrowSchema is used before we use ArrowArrayStream.)

left_stats = get_statistics(get_schema(left_table), left_table.join_id, filter)
right_stats = get_statistics(get_schema(right_table), right_table.join_id, filter)
if left_stats.approx_count_distinct < right_stats.approx_count_distinct:
  hashtable = make_hashtable(load_array(left_table.join_id))
  probe(hashtable, right_table)
else:
  hashtable = make_hashtable(load_array(right_table.join_id))
  probe(hashtable, left_table)

Anyway, it seems that ArrowSchema based approach isn't good on the mailing list discussion: https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx
We'll provide a separate API that provides a statistics ArrowArray like #38837 (comment) .

@drin
Copy link
Contributor

drin commented May 23, 2024

Coming from the mailing list, I caught up on some of the comments above. I agree that a separate API call is nicer than packing statistics into the schema. Packing into the schema doesn't seem bad to me, but it certainly seems more limited. An additional case to consider is that a separate API call provides additional flexibility of where to get the statistics in case they may come from a different source than the schema (in the case of a table view or a distributed table).

I also interpret Dewey's comment as: the schema should describe the "structure" of the data, whereas the statistics describes the "content" of the data. This aligns with Weston's point that the statistics are essentially "just another record batch" (or at least more similar to the data itself than to the schema). I agree with both of these.

Below are some relevant points I tried considering, and I see no downside to using an additional API call:

  • duckdb does eager binding. This only requires the schema to determine number of columns and their types (arrow.cpp#238-254).
  • statistics are only considered at optimization time. Relative chronology is: binding -> logical plan in hand -> invoke optimizer.
  • duckdb's optimization workflow calls EstimateCardinality on LogicalOperator (join_order_optimizer.cpp#L68-75). Delegating through an API call to arrow is trivial and there's no need for it to go through the schema
  • as of now, duckdb seems to set cardinality statistics as recordbatches are processed (arrow.cpp#L389-L400)

It's likely relevant that the code referenced above is how duckdb interacts with Arrow that is already in memory (as far as I know). In the future, I can imagine that duckdb will still keep a similar sequence when requesting Arrow data from a data source (file or remote connection) by binding to the schema, then accessing statistics, then accessing the data.

@kou
Copy link
Member

kou commented May 24, 2024

Thanks for your comment.

Could you explain more about a separate API call?

Do you imagine a language specific API such as DataFusion's PruningStatistics? https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html
(It's introduced in the mailing thread: https://lists.apache.org/thread/hd79kp59796sqmwo24vmsnbk8t6x7h8g )

Or do you imagine an API that transmits Arrow data (arrays or a record batch) as statistics like ADBC does?

@pdet
Copy link

pdet commented May 24, 2024

  • arrow.cpp#L389-L400

In DuckDB, these statistics are created as a callback function that exists in the scanner. For example, in our Arrow integration, the statistics of the scanner are basically empty, as you can see here.

In practice, we need to be able to get the statistics at bind time.

The code at this link is not really related to cardinality estimation. This code sets the size of the current data chunk we scanned, which will later be pushed to upper nodes.

@drin
Copy link
Contributor

drin commented May 24, 2024

The code at this link is not really related to cardinality estimation

Ah, I misunderstood then.

In DuckDB, these statistics are created as a callback function that exists in the scanner

this is why I figured either approach to getting statistics (schema or API call) is viable, since the callback should be able to accommodate either.

Could you explain more about a separate API call

I just meant a function call of the scanner API (or I guess something like ADBC, but I don't know that API at all). I interpreted weston's comment to mean that a function call to get statistics is comparable to a function call to get a record batch.

But when I say it provides extra flexibility, I mean that it provides a standard way for an independent producer to specify logic to a consumer for how to get statistics. This could allow for things like storing statistics independently from the data stream itself. I also feel like this doesn't preclude statistics metadata being packed into the schema (maybe in some application-specific way).

@westonpace
Copy link
Member

#38837 (comment) uses ArrowSchema not ArrowArray/ArrowArrayStream. So it doesn't contain both data and statistics. I think the following flow is used. (ArrowSchema is used before we use ArrowArrayStream.)#38837 (comment) uses ArrowSchema not ArrowArray/ArrowArrayStream. So it doesn't contain both data and statistics. I think the following flow is used. (ArrowSchema is used before we use ArrowArrayStream.)

I see now, I had misunderstood and thought we were talking about ArrowArray. I understand why you would want statistics in the ArrowSchema instead of ArrowArray. We were already talking about two function calls (GetTableSchema() and GetTableData()). If we don't use ArrowSchema then we need three function calls (GetTableSchema(), GetTableStatistics(), and GetTableData()).

I think both approaches make sense but this is the tricky part:

How to encode each value (2.9, true and 29.9) to raw byte data? We can use only raw byte data for a value of ArrowSchema::metadata.

There are ways to solve this but they all seem like a lot of work for alignment and maintenance. I don't think the benefit (combining GetTableSchema and GetTableStatistics) is worth the development cost.

@kou
Copy link
Member

kou commented May 24, 2024

I think both approaches make sense but this is the tricky part:

How to encode each value (2.9, true and 29.9) to raw byte data? We can use only raw byte data for a value of ArrowSchema::metadata.

There are ways to solve this but they all seem like a lot of work for alignment and maintenance. I don't think the benefit (combining GetTableSchema and GetTableStatistics) is worth the development cost.

Right. #38837 (comment) uses ArrowArray to encode statistics and put its address to metadata value as a base 10 string. It's a tricky approach.

@amol-
Copy link
Member

amol- commented Jun 3, 2024

I don't want to derail the discussion, but given that formats might provide statistics also at record-batch level (think of parquet row groups), what would be the solution in that case? Binding the statistics to the Schema seems to make hard to provide per-batch statistics in ArrowArrayStream, do we expect to have two very different ways to provide statistics per batch and per-column?

@lidavidm
Copy link
Member

lidavidm commented Jun 3, 2024

Are per-batch statistics even useful? You've presumably already made the query plan and done the I/O, so there's little opportunity to make use of the statistics at that point.

@kou
Copy link
Member

kou commented Jun 4, 2024

If per-batch statistics are needed, the separated API approach (schema for statistics is only defined) is better than using ArrowSchema/ArrowArray/ArrowArrayStream.
With the separated API approach, you can control when you get statistics and what statistics are got.

@amol-
Copy link
Member

amol- commented Jun 4, 2024

Are per-batch statistics even useful? You've presumably already made the query plan and done the I/O, so there's little opportunity to make use of the statistics at that point.

When exchanging data via C-Data I don't think you can take for granted how that data was fetched, so while in the majority of cases you are correct, there might be cases where you are receiving the data from a source that doesn't know anything about the filtering that the receiver will be going to apply. In that case there is a value in passing the statistics-per-batch.

@kou
Copy link
Member

kou commented Jun 6, 2024

Summary for the discussion on the mailing list so far: https://lists.apache.org/thread/6m9xrhfktnt0nnyss7qo333n0cl76ypc

@felipecrv
Copy link
Contributor

Summary for the discussion on the mailing list so far: https://lists.apache.org/thread/6m9xrhfktnt0nnyss7qo333n0cl76ypc

Thank you for the summary @kou. I replied to that thread [1] with a statistics encoding scheme that avoids free-form strings. Strings lead to bloat of metadata and parsing bugs when consumers match only on prefixes (or other "clever" key matching ideas) of string identifiers hindering format evolution.

The value of a standard for statistics comes from producers and consumers agreeing on the statistics available, so I think centralizing the definitions in the Arrow C Data Interface specification is better than letting the ecosystem work out the naming of statistics from different producers. I'm trying to avoid a query processors (consumers of statistics) having to deal with quirks based on the system producing the statistics.

In the scheme I proposed, a single array would contain statistics for the table/batch and all the columns. That's more friendly to the memory allocator and allows for all the statistics to be processed in a single loop. If the array grows too big, it can be chunked at any position allowing for the streaming of statistics if big values start being produced.

@kou proposed this schema for an ArrowArray* carrying statistics in [2]:

Field Name Field Type Comments
key string not null (1)
value VALUE_SCHEMA not null
is_approximate bool not null (2)
  1. We'll provide pre-defined keys such as max, min, byte_width and distinct_count but users can use application specific keys too.
  2. If true, then the value is approximate or best-effort.

I will post what I tried to explain in the mailing list [1] in this format so the comparison is easier:

Field Name Field Type Comments
subject int32 Field/column index (1) (2) (3)
statistics map<int32, dense_union<...>> NOT NULL (4)
  1. 0-based.
  2. When subject is NULL the statistics refer to the whole schema (table or batch, this is inferable from the context of the ArrowArray* of statistics)
  3. The subject column isn't unique to allow for the streaming of statistics: a field or NULL can come up again in the array with more statistics about the table (NULL) or the column.
  4. This column stores maps from ArrowStatKind (int32) [3] to a dense_union that changes according to the kinds of statistics generated by a specific provider. The meaning of each key is standardized in the C Data Interface spec together with the list of types in the dense_union that consumers should expect.

Like @kou's proposal it uses dense_union for the values. But doesn't restrict the statistic to be the same type of the column (some columns might have a certain type, but a statistic about it might integer or boolean-typed like null counts and any predicate). The is_approximate was removed because that is a property of the statistic kind and affects the expected type of the value: the approximate value of an integer-typed column might be a floating-point number if it results from some estimation process.

[1] https://lists.apache.org/thread/gnjq46wn7dbkj2145dskr9tkgfg1ncdw
[2] #38837 (comment)
[3] Illustrative examples of statistic kinds

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;

// Used for non-standard statistic kinds.
// Value must be a struct<name: utf8, value: dense_union<...>>
#define ARROW_STAT_ANY 0
// Exact number of nulls in a column. Value must be int32 or int64.
#define ARROW_STAT_NULL_COUNT_EXACT 1
// Approximate number of nulls in a column. Value must be float32 or float64.
#define ARROW_STAT_NULL_COUNT_APPROX 2
// The minimum and maximum values of a column.
// Value must be the same type of the column or a wider type.
// Supported types are: ...
#define ARROW_STAT_MIN_APPROX 3
#define ARROW_STAT_MIN_EXACT 4
#define ARROW_STAT_MIN_APPROX 5
#define ARROW_STAT_MAX_EXACT 6
#define ARROW_STAT_CARDINALITY_APPROX 7
#define ARROW_STAT_CARDINALITY_EXACT 8
#define ARROW_STAT_COUNT_DISTINCT_APPROX 9
#define ARROW_STAT_COUNT_DISTINCT_EXACT 10
// ... Represented as a
// list<
//   struct<quantile: float32 | float64,
//          sum: "same as column type or a type with wider precision">>
#define ARROW_STAT_CUMULATIVE_QUANTILES 11

@kou
Copy link
Member

kou commented Aug 5, 2024

We'll use the following schema based on the mailing list discussion:

map<
  // The column index or null if the statistics refer to whole table or batch.
  int32,
  // Statistics key is string. Dictionary is used for efficiency.
  // Different keys are assigned for exact value and
  // approximate value.
  map<
    dictionary<int32, utf8>,
    dense_union<...needed types based on stat kinds in the keys...>
  >
>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants