Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

kou
Copy link
Member

@kou kou commented Aug 5, 2024

Rationale for this change

Statistics are useful for fast query processing. Many query engines
use statistics to optimize their query plan.

Apache Arrow format doesn't have statistics but other formats that can
be read as Apache Arrow data may have statistics. For example, Apache
Parquet C++ can read Apache Parquet file as Apache Arrow data and
Apache Parquet file may have statistics.

One of the Arrow C data interface use cases is the following:

  1. Module A reads Apache Parquet file as Apache Arrow data
  2. Module A passes the read Apache Arrow data to module B through the
    Arrow C data interface
  3. Module B processes the passed Apache Arrow data

If module A can pass the statistics associated with the Apache Parquet
file to module B through the Arrow C data interface, module B can use
the statistics to optimize its query plan.

What changes are included in this PR?

Add the specification to pass statistics through the Arrow C data interface based on the discussion on the dev@ mailing list: https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@kou kou marked this pull request as draft August 5, 2024 09:28
@kou
Copy link
Member Author

kou commented Aug 5, 2024

@github-actions crossbow submit preview-docs

Copy link

github-actions bot commented Aug 5, 2024

⚠️ GitHub issue #38837 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented Aug 5, 2024

I'm not a native English speaker. Wording suggestions are very welcome.

I'll add examples after I implement a convenient API to C++.

Copy link

github-actions bot commented Aug 5, 2024

Revision: 22336f4

Submitted crossbow builds: ursacomputing/crossbow @ actions-28c2a45b3d

Task Status
preview-docs GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 5, 2024
Comment on lines +56 to +58
* Provide a common way to pass statistics that can be used for
other interfaces such Arrow Flight too.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the Arrow IPC format? Can you add a sentence here that explains why we do not recommend using this to pass statistics over Arrow IPC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This may fit the Arrow IPC format. (A producer sends data and statistics as 2 separated the Arrow IPC format data.)

But the Arrow IPC format can use more approaches. For example, the Arrow IPC format can have metadata for each record batch data: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format (The Arrow C data can't have metadata for ArrowArray.)
The Arrow IPC format can be used with other mechanisms such as Arrow Flight and ADBC.

So this may not be the best approach for the Arrow IPC format. We should discuss this use case with the Arrow IPC format separately.

I'll add something to here.

Comment on lines 59 to 60
For example, ADBC has the statistics related APIs. This specification
doesn't replace them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I should have done it.

@ianmcook
Copy link
Member

ianmcook commented Aug 5, 2024

@pdet please take a look and add comments if you have any, thanks!

@pdet
Copy link

pdet commented Aug 6, 2024

The format and contents LGTM! I was just slightly confused for one second that the second mapping is the value items in the first mapping.

@Tmonster I think the proposed Statistics keys already cover the table statistics we use/produce. Could you double-check?

Copy link
Member Author

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ianmcook Thanks for your suggestions! I've merged all of them!

Comment on lines +56 to +58
* Provide a common way to pass statistics that can be used for
other interfaces such Arrow Flight too.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This may fit the Arrow IPC format. (A producer sends data and statistics as 2 separated the Arrow IPC format data.)

But the Arrow IPC format can use more approaches. For example, the Arrow IPC format can have metadata for each record batch data: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format (The Arrow C data can't have metadata for ArrowArray.)
The Arrow IPC format can be used with other mechanisms such as Arrow Flight and ADBC.

So this may not be the best approach for the Arrow IPC format. We should discuss this use case with the Arrow IPC format separately.

I'll add something to here.

Comment on lines 59 to 60
For example, ADBC has the statistics related APIs. This specification
doesn't replace them.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I should have done it.

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved
The ``ARROW`` pattern is a reserved namespace for pre-defined
statistics keys. User-defined statistics must not use it.

Here are pre-defined statistics keys:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 7, 2024
@kou
Copy link
Member Author

kou commented Aug 7, 2024

I was just slightly confused for one second that the second mapping is the value items in the first mapping.

Ah, it makes sense. I'll improve it. Thanks.

@kou
Copy link
Member Author

kou commented Aug 7, 2024

@github-actions crossbow submit preview-docs

@github-actions github-actions bot removed the awaiting changes Awaiting changes label Aug 7, 2024
@github-actions github-actions bot added the awaiting changes Awaiting changes label Aug 22, 2024
@kou kou force-pushed the docs-statistics-c-data-interface branch from 64dea6b to 3cdb559 Compare August 27, 2024 00:09
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 27, 2024
@kou kou force-pushed the docs-statistics-c-data-interface branch from 3cdb559 to e22ec6c Compare September 9, 2024 02:25
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 9, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 24, 2024
@kou kou force-pushed the docs-statistics-c-data-interface branch from e22ec6c to 9087113 Compare September 30, 2024 08:20
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 30, 2024
kou added a commit that referenced this pull request Nov 8, 2024
### Rationale for this change

Statistics schema for Arrow C data interface (GH-43553) is complex because it uses nested types (struct, map and union). So reusable implementation to make statistics array is useful. 

### What changes are included in this PR?

`arrow::RecordBatch::MakeStatisticsArray()` is a convenient function that converts `arrow::ArrayStatistics` in a `arrow::RecordBatch` to `arrow::Array` for the Arrow C data interface.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #44010

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants