Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add parquet file reading support for s3fdw #103

Merged
merged 3 commits into from
Jun 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 59 additions & 4 deletions docs/s3.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
[AWS S3](https://aws.amazon.com/s3/) is an object storage service offering industry-leading scalability, data availability, security, and performance. The S3 wrapper is under development. It is read-only and supports 2 file formats:
[AWS S3](https://aws.amazon.com/s3/) is an object storage service offering industry-leading scalability, data availability, security, and performance. The S3 wrapper is under development. It is read-only and supports below file formats:

1. CSV - with or without header line
2. [JSON Lines](https://jsonlines.org/)
3. [Parquet](https://parquet.apache.org/)

The S3 FDW also supports below compression algorithms:

Expand All @@ -10,7 +11,27 @@ The S3 FDW also supports below compression algorithms:
3. xz
4. zlib

**Note: currently all columns in S3 files must be defined in the foreign table and their types must be `text` type**
**Note for CSV and JSONL files: currently all columns in S3 files must be defined in the foreign table and their types must be `text` type**

**Note for Parquet files: the whole Parquet file will be loaded into local memory if it is compressed, so keep its size small**

### Supported Data Types For Parquet File

The S3 FDW uses Parquet file data types from [arrow_array::types](https://docs.rs/arrow-array/41.0.0/arrow_array/types/index.html), below are their mappings to Postgres data types.

| Postgres Type | Parquet Type |
| ------------------ | ------------------------ |
| boolean | BooleanType |
| char | Int8Type |
| smallint | Int16Type |
| real | Float32Type |
| integer | Int32Type |
| double precision | Float64Type |
| bigint | Int64Type |
| numeric | Float64Type |
| text | ByteArrayType |
| date | Date64Type |
| timestamp | TimestampNanosecondType |

### Wrapper
To get started with the S3 wrapper, create a foreign data wrapper specifying `handler` and `validator` as below.
Expand Down Expand Up @@ -90,14 +111,17 @@ create server s3_server

S3 wrapper is implemented with [ELT](https://hevodata.com/learn/etl-vs-elt/) approach, so the data transformation is encouraged to be performed locally after data is extracted from remote data source.

One file in S3 corresponds a foreign table in Postgres, all columns must be present in the foreign table and type must be `text`. You can do custom transformations, like type conversion, by creating a view on top of the foreign table or using a subquery.
One file in S3 corresponds a foreign table in Postgres. For CSV and JSONL file, all columns must be present in the foreign table and type must be `text`. You can do custom transformations, like type conversion, by creating a view on top of the foreign table or using a subquery.

For Parquet file, no need to define all columns in the foreign table but column names must match between Parquet file and its foreign table.


#### Foreign Table Options

The full list of foreign table options are below:

- `uri` - S3 URI, required. For example, `s3://bucket/s3_table.csv`
- `format` - File format, required. `csv` or `jsonl`
- `format` - File format, required. `csv`, `jsonl`, or `parquet`
- `has_header` - If the CSV file has header, optional. `true` or `false`, default is `false`
- `compress` - Compression algorithm, optional. One of `gzip`, `bzip2`, `xz`, `zlib`, default is no compression

Expand Down Expand Up @@ -148,5 +172,36 @@ create foreign table s3_table_csv_gzip (
has_header 'true',
compress 'gzip'
);

-- Parquet file, no compression
create foreign table s3_table_parquet (
id integer,
bool_col boolean,
bigint_col bigint,
float_col real,
date_string_col text,
timestamp_col timestamp
)
server s3_server
options (
uri 's3://bucket/s3_table.parquet',
format 'parquet'
);

-- GZIP compressed Parquet file
create foreign table s3_table_parquet_gz (
id integer,
bool_col boolean,
bigint_col bigint,
float_col real,
date_string_col text,
timestamp_col timestamp
)
server s3_server
options (
uri 's3://bucket/s3_table.parquet.gz',
format 'parquet',
compress 'gzip'
);
```

Loading