[Proposal] Support `SELECT INTO OUTFILE` feature.

## Motivation

Doris' current data export method is relatively simple, and data can only be exported through the `Export` feature. The `Export` feature can only export data in partition level, and does not support advanced SQL functions such as Expressions, Joins, and Group By.

In some cases, Doris users will have the function to export the query results:

1. Export the query results, provide other users to download.
2. Export query results for further processing by other systems.
3. Bitmap type export: Sometimes users store a large number of ids (100 million +) through bitmap, and need to return these ids. At present, Doris does not support UDTF. And If it is returned through the MySQL protocol, it will not be processed because the result set is too large.

## Alternatives

### Support `SELECT ... FROM tbl INTO OUTFILE xxx ...`

This is the most intuitive and user-friendly way. The grammar is referenced from the [MySQL grammar manual](https://dev.mysql.com/doc/refman/8.0/en/select-into.html). In the implementation of Mysql, the exported file will be written to the local file system of the MySQL server. Since Doris is a distributed system, query results may be produced by any node. So here are two options to export:

1. Write the result to the local file system of a BE. And provide users with an http connection for downloading.

    This solution is relatively simple. But some problems are involved:

    1. File cleanup problem. As a result, the file may be very large and it will take up too much BE disk space. And the file cannot exist for a long time, it needs a mechanism to clean it regularly.
    2. Connection problems. The user may not have permission or cannot directly access a BE node, so the file cannot be downloaded.

    For the first problem, it can be satisfied by limiting the size of a single file, monitoring disk space, and increasing the file timeout setting and regularly cleaning up. The second problem cannot be solved, but it can be used as a temporary solution to meet the needs of some users.

2. Write the result to a remote storage system, such as hdfs.

    This solution can use the Broker to write files to the remote system. This solves the two problems in Option 1.

~~I plan to implement **Option 1** in the first phase so that users who do not have remote storage in time can use this feature. And then only need to make less changes, add different write file interface, we can support Option 2.~~

I plan to implement **Option 2**. Because for local file, we have to care about all messy things, such as the authentication, GC, disk space, etc.. And for remote storage, all things become simple.

### Other system

For `AWS Redshift`, it provides a [UPLOAD](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD_command_examples.html) command to satisfy this requirement. And it acts same as `SELECT INTO OUTFILE` in MySQL.

For `Vertica` and `Snowflake`, they provides shell tools to export the query result. [Vertica](https://dwgeek.com/how-to-export-vertica-query-results-into-csv.html/), [Snowflake](https://dwgeek.com/export-snowflake-table-data-to-local-csv-format.html/). And Snowflake also support a [Copy Into](https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html) command.

### Export format

**CSV** is a common way to export the query result. But some columns are in binary type, such as BITMAP_UNION and HLL. So that we should find a way to save these kind of types.
**Parquet**, I think, is a good format to do this. And for binary type, we can use `BINARY_ARRAY` type in Parquet to save. 
But Parquet does not support `Largeint`. I will think about it later. But for most case, Parquet works.

## Implementation

### Syntax

```
query_stmt
INTO OUFILE "file:///path/to/file_prefix"
FORMAT AS CSV|PARQUET
WITH BROKER broker_name
(broker_properties)
PROPERTIES
(other_properties);
```

Example:

```
SELECT a, sum(b) FROM tbl1 GROUP BY a LIMIT 10
INTO OUFILE "file:///path/to/file_prefix"
FORMAT AS CSV
WITH BROKER my_broker
("username" = "user", "password" = "pwd")
PROPERTIES
("column_separator" = ",");
```

In MySQL, the grammar is slightly different that it has, for example,`FIELD TERMINATED BY...` clause at the end. But I think this is not good for further extension. So I decide to use a property map instead.  

The `file_prefix` is the prefix of the exported files. The final file name will be assembled as
`prefix + seq no + format suffix`. Like

```
file_prefix_0.csv
file_prefix_1.csv
file_prefix_2.csv

file_prefix_0.parquet
file_prefix_1.parquet
file_prefix_2.parquet
```

### Result Sink

The `ResultSink` class in BE now has a `ResultWriter` and a `Sender` inside it. the `ResultWriter` is to writer the batch of results to `Sender`, And the `Sender` is to send the batch of results back to FE client.

I will make `ResultWriter` an abstract class, and implements 2 subclass: `MysqlResultWriter` and `FileResultWriter`.
The `MysqlResultWriter` act same as the old `ResultWriter`. And the `FileResultWriter` is a new class. This class is to writer the results to file.

### Query Plan

There is almost no need to change the query plan, just add additional information to the `TResultSink` to explain what `ResultWriter` to use.

### Statement return value

In MySQL, the `SELECT INTO OUTFILE` will return with number of rows exported.

```
mysql> select * from tbl1 into outfile "/tmp/file_prefix";
Query OK, 100 row affected (0.06 sec)
```

We keep consistent with it.

### Options

Some options can be specified in `properties` of the statement:

1. `column_separator`

    The column separator of the output format. Only for CSV format.

2. `line_delimiter`

   The line delimiter of the output format. Only for CSV format.

~~3. `split_size_bytes`~~
3. `max_file_size_bytes`

    If the size of export file is large, it will be splited by this size.

### Remaining problem

1. Whether to support asynchronous export operations

   Some queries may take a long time or have a large result set, which may cause the request to occupy the connection for a long time.
   We can consider supporting asynchronous commands. And through the `Session Variable` to control whether to use asynchronous commands.

### Scheduling Plan

1. Implement export file with CSV format
2. Support PARQUET format
3. Support split files


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Support `SELECT INTO OUTFILE` feature. #3549

Motivation

Alternatives

Support `SELECT ... FROM tbl INTO OUTFILE xxx ...`

Other system

Export format

Implementation

Syntax

Result Sink

Query Plan

Statement return value

Options

Remaining problem

Scheduling Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Support SELECT INTO OUTFILE feature. #3549

Description

Motivation

Alternatives

Support SELECT ... FROM tbl INTO OUTFILE xxx ...

Other system

Export format

Implementation

Syntax

Result Sink

Query Plan

Statement return value

Options

Remaining problem

Scheduling Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Proposal] Support `SELECT INTO OUTFILE` feature. #3549

Support `SELECT ... FROM tbl INTO OUTFILE xxx ...`