Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions datafusion/core/src/execution/context/parquet.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,22 @@ impl SessionContext {
/// [`read_table`](Self::read_table) with a [`super::ListingTable`].
///
/// For an example, see [`read_csv`](Self::read_csv)
///
/// # Note: Statistics
///
/// NOTE: by default, statistics are not collected when reading the Parquet
/// files as this can slow down the initial DataFrame creation. However,
/// collecting statistics can greatly accelerate queries with certain
/// filters.
///
/// To enable collect statistics, set the [config option]
/// `datafusion.execution.collect_statistics` to `true`. See
/// [`ConfigOptions`] and [`ExecutionOptions::collect_statistics`] for more
/// details.
///
/// [config option]: https://datafusion.apache.org/user-guide/configs.html
/// [`ConfigOptions`]: crate::config::ConfigOptions
/// [`ExecutionOptions::collect_statistics`]: crate::config::ExecutionOptions::collect_statistics
pub async fn read_parquet<P: DataFilePaths>(
&self,
table_paths: P,
Expand All @@ -41,6 +57,13 @@ impl SessionContext {

/// Registers a Parquet file as a table that can be referenced from SQL
/// statements executed against this context.
///
/// # Note: Statistics
///
/// Statistics are not collected by default. See [`read_parquet`] for more
/// details and how to enable them.
///
/// [`read_parquet`]: Self::read_parquet
pub async fn register_parquet(
&self,
table_ref: impl Into<TableReference>,
Expand Down
33 changes: 31 additions & 2 deletions docs/source/user-guide/sql/ddl.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ For a comprehensive list of format-specific options that can be specified in the
a path to a file or directory of partitioned files locally or on an
object store.

### Example: Parquet

Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement such as the following. It is not necessary to
provide schema information for Parquet files.

Expand All @@ -91,6 +93,23 @@ STORED AS PARQUET
LOCATION '/mnt/nyctaxi/tripdata.parquet';
```

:::{note}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of what this looks like rendered

Screenshot 2025-05-22 at 3 22 13 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of what this looks like rendered

TIL

Statistics
: By default, when a table is created, DataFusion will _NOT_ read the files
to gather statistics, which can be expensive but can accelerate subsequent
queries substantially. If you want to gather statistics
when creating a table, set the `datafusion.execution.collect_statistics`
configuration option to `true` before creating the table. For example:

```sql
SET datafusion.execution.collect_statistics = true;
```

See the [config settings docs](../configs.md) for more details.
:::

### Example: Comma Separated Value (CSV)

CSV data sources can also be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. The schema will be inferred based on
scanning a subset of the file.

Expand All @@ -101,6 +120,8 @@ LOCATION '/path/to/aggregate_simple.csv'
OPTIONS ('has_header' 'true');
```

### Example: Compression

It is also possible to use compressed files, such as `.csv.gz`:

```sql
Expand All @@ -111,6 +132,8 @@ LOCATION '/path/to/aggregate_simple.csv.gz'
OPTIONS ('has_header' 'true');
```

### Example: Specifying Schema

It is also possible to specify the schema manually.

```sql
Expand All @@ -134,6 +157,8 @@ LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');
```

### Example: Partitioned Tables

It is also possible to specify a directory that contains a partitioned
table (multiple files with the same schema)

Expand All @@ -144,7 +169,9 @@ LOCATION '/path/to/directory/of/files'
OPTIONS ('has_header' 'true');
```

With `CREATE UNBOUNDED EXTERNAL TABLE` SQL statement. We can create unbounded data sources such as following:
### Example: Unbounded Data Sources

We can create unbounded data sources using the `CREATE UNBOUNDED EXTERNAL TABLE` SQL statement.

```sql
CREATE UNBOUNDED EXTERNAL TABLE taxi
Expand All @@ -154,6 +181,8 @@ LOCATION '/mnt/nyctaxi/tripdata.parquet';

Note that this statement actually reads data from a fixed-size file, so a better example would involve reading from a FIFO file. Nevertheless, once Datafusion sees the `UNBOUNDED` keyword in a data source, it tries to execute queries that refer to this unbounded source in streaming fashion. If this is not possible according to query specifications, plan generation fails stating it is not possible to execute given query in streaming fashion. Note that queries that can run with unbounded sources (i.e. in streaming mode) are a subset of those that can with bounded sources. A query that fails with unbounded source(s) may work with bounded source(s).

### Example: `WITH ORDER` Clause

When creating an output from a data source that is already ordered by
an expression, you can pre-specify the order of the data using the
`WITH ORDER` clause. This applies even if the expression used for
Expand Down Expand Up @@ -190,7 +219,7 @@ WITH ORDER (sort_expression1 [ASC | DESC] [NULLS { FIRST | LAST }]
[, sort_expression2 [ASC | DESC] [NULLS { FIRST | LAST }] ...])
```

### Cautions when using the WITH ORDER Clause
#### Cautions when using the WITH ORDER Clause

- It's important to understand that using the `WITH ORDER` clause in the `CREATE EXTERNAL TABLE` statement only specifies the order in which the data should be read from the external file. If the data in the file is not already sorted according to the specified order, then the results may not be correct.

Expand Down