Skip to content

Commit 33a2531

Browse files
alambcomphead
andauthored
Clarify documentation about gathering statistics for parquet files (#16157)
* Improve CREATE EXTERNAL TABLE documatation, add note about statistics * Add comments to SessionContext::read_parquet and register_parquet * fmt * Apply suggestions from code review Co-authored-by: Oleks V <comphead@users.noreply.github.com> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>
1 parent 081e95c commit 33a2531

File tree

2 files changed

+54
-2
lines changed

2 files changed

+54
-2
lines changed

datafusion/core/src/execution/context/parquet.rs

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,22 @@ impl SessionContext {
3131
/// [`read_table`](Self::read_table) with a [`super::ListingTable`].
3232
///
3333
/// For an example, see [`read_csv`](Self::read_csv)
34+
///
35+
/// # Note: Statistics
36+
///
37+
/// NOTE: by default, statistics are not collected when reading the Parquet
38+
/// files as this can slow down the initial DataFrame creation. However,
39+
/// collecting statistics can greatly accelerate queries with certain
40+
/// filters.
41+
///
42+
/// To enable collect statistics, set the [config option]
43+
/// `datafusion.execution.collect_statistics` to `true`. See
44+
/// [`ConfigOptions`] and [`ExecutionOptions::collect_statistics`] for more
45+
/// details.
46+
///
47+
/// [config option]: https://datafusion.apache.org/user-guide/configs.html
48+
/// [`ConfigOptions`]: crate::config::ConfigOptions
49+
/// [`ExecutionOptions::collect_statistics`]: crate::config::ExecutionOptions::collect_statistics
3450
pub async fn read_parquet<P: DataFilePaths>(
3551
&self,
3652
table_paths: P,
@@ -41,6 +57,13 @@ impl SessionContext {
4157

4258
/// Registers a Parquet file as a table that can be referenced from SQL
4359
/// statements executed against this context.
60+
///
61+
/// # Note: Statistics
62+
///
63+
/// Statistics are not collected by default. See [`read_parquet`] for more
64+
/// details and how to enable them.
65+
///
66+
/// [`read_parquet`]: Self::read_parquet
4467
pub async fn register_parquet(
4568
&self,
4669
table_ref: impl Into<TableReference>,

docs/source/user-guide/sql/ddl.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,8 @@ For a comprehensive list of format-specific options that can be specified in the
8282
a path to a file or directory of partitioned files locally or on an
8383
object store.
8484

85+
### Example: Parquet
86+
8587
Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement such as the following. It is not necessary to
8688
provide schema information for Parquet files.
8789

@@ -91,6 +93,23 @@ STORED AS PARQUET
9193
LOCATION '/mnt/nyctaxi/tripdata.parquet';
9294
```
9395

96+
:::{note}
97+
Statistics
98+
: By default, when a table is created, DataFusion will _NOT_ read the files
99+
to gather statistics, which can be expensive but can accelerate subsequent
100+
queries substantially. If you want to gather statistics
101+
when creating a table, set the `datafusion.execution.collect_statistics`
102+
configuration option to `true` before creating the table. For example:
103+
104+
```sql
105+
SET datafusion.execution.collect_statistics = true;
106+
```
107+
108+
See the [config settings docs](../configs.md) for more details.
109+
:::
110+
111+
### Example: Comma Separated Value (CSV)
112+
94113
CSV data sources can also be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. The schema will be inferred based on
95114
scanning a subset of the file.
96115

@@ -101,6 +120,8 @@ LOCATION '/path/to/aggregate_simple.csv'
101120
OPTIONS ('has_header' 'true');
102121
```
103122

123+
### Example: Compression
124+
104125
It is also possible to use compressed files, such as `.csv.gz`:
105126

106127
```sql
@@ -111,6 +132,8 @@ LOCATION '/path/to/aggregate_simple.csv.gz'
111132
OPTIONS ('has_header' 'true');
112133
```
113134

135+
### Example: Specifying Schema
136+
114137
It is also possible to specify the schema manually.
115138

116139
```sql
@@ -134,6 +157,8 @@ LOCATION '/path/to/aggregate_test_100.csv'
134157
OPTIONS ('has_header' 'true');
135158
```
136159

160+
### Example: Partitioned Tables
161+
137162
It is also possible to specify a directory that contains a partitioned
138163
table (multiple files with the same schema)
139164

@@ -144,7 +169,9 @@ LOCATION '/path/to/directory/of/files'
144169
OPTIONS ('has_header' 'true');
145170
```
146171

147-
With `CREATE UNBOUNDED EXTERNAL TABLE` SQL statement. We can create unbounded data sources such as following:
172+
### Example: Unbounded Data Sources
173+
174+
We can create unbounded data sources using the `CREATE UNBOUNDED EXTERNAL TABLE` SQL statement.
148175

149176
```sql
150177
CREATE UNBOUNDED EXTERNAL TABLE taxi
@@ -154,6 +181,8 @@ LOCATION '/mnt/nyctaxi/tripdata.parquet';
154181

155182
Note that this statement actually reads data from a fixed-size file, so a better example would involve reading from a FIFO file. Nevertheless, once Datafusion sees the `UNBOUNDED` keyword in a data source, it tries to execute queries that refer to this unbounded source in streaming fashion. If this is not possible according to query specifications, plan generation fails stating it is not possible to execute given query in streaming fashion. Note that queries that can run with unbounded sources (i.e. in streaming mode) are a subset of those that can with bounded sources. A query that fails with unbounded source(s) may work with bounded source(s).
156183

184+
### Example: `WITH ORDER` Clause
185+
157186
When creating an output from a data source that is already ordered by
158187
an expression, you can pre-specify the order of the data using the
159188
`WITH ORDER` clause. This applies even if the expression used for
@@ -190,7 +219,7 @@ WITH ORDER (sort_expression1 [ASC | DESC] [NULLS { FIRST | LAST }]
190219
[, sort_expression2 [ASC | DESC] [NULLS { FIRST | LAST }] ...])
191220
```
192221

193-
### Cautions when using the WITH ORDER Clause
222+
#### Cautions when using the WITH ORDER Clause
194223

195224
- It's important to understand that using the `WITH ORDER` clause in the `CREATE EXTERNAL TABLE` statement only specifies the order in which the data should be read from the external file. If the data in the file is not already sorted according to the specified order, then the results may not be correct.
196225

0 commit comments

Comments
 (0)