Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add export capabilities to MSQ with SQL syntax #15689

Merged
merged 53 commits into from
Feb 7, 2024

Conversation

adarshsanjeev
Copy link
Contributor

@adarshsanjeev adarshsanjeev commented Jan 16, 2024

Problem

Druid currently does not allow export of tables in a programmatic manner. While is is possible to download results from a SELECT query, this relies on writing the results to a single query report, which cannot support large datasets. An export syntax which writes the results in a desired format directly to an external location (such as s3 or hdfs) would be useful.

(INSERT/REPLACE) INTO
EXTERN(<external source function>)
AS <format>
[OVERWRITE ALL]
<select query>

For example: A statement to export all rows from a table into S3 as CSV files would look like

REPLACE INTO 
EXTERN(s3(bucket='bucket1', prefix='export/', tempDir='/var/temp'))
AS CSV
OVERWRITE ALL
SELECT * FROM wikipedia

Initially, only CSV is supported as an export format, but this can be expanded to support other formats easily.

Release note

  • Adds export statements to MSQ, as a part of INSERT and REPLACE statements. This will allow the results of a query to be written to destination in a configurable format.

Key changed/added classes in this PR
  • sql/src/main/codegen/includes/common.ftl
  • sql/src/main/codegen/includes/replace.ftl
  • IngestHandler

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jan 16, 2024
@cryptoe cryptoe added this to the Druid 29.0.0 milestone Jan 17, 2024
@adarshsanjeev adarshsanjeev marked this pull request as ready for review January 23, 2024 05:30
@adarshsanjeev adarshsanjeev added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Jan 23, 2024
Copy link
Contributor

@vogievetsky vogievetsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for incorporating my feedback

Copy link
Contributor

@317brian 317brian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some copyedits to the docs as suggestions. They can either be merged as part of this PR, or I can open a followup PR with the changes.

docs/multi-stage-query/concepts.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
Comment on lines +99 to +100
This variation of EXTERN requires one argument, the details of the destination as specified below.
This variation additionally requires an `AS` clause to specify the format of the exported rows.
Copy link
Contributor

@317brian 317brian Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This variation of EXTERN requires one argument, the details of the destination as specified below.
This variation additionally requires an `AS` clause to specify the format of the exported rows.
This variation of EXTERN has two required parts: an argument that details the destination and an `AS` clause to specify the format of the exported rows.

Copy link
Contributor Author

@adarshsanjeev adarshsanjeev Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AS clause would not be an argument to extern, it's present elsewhere in the query. Would it be confusing to call it an argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the change I just made?

docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments around removal of dead code and error messages.

} else {
throw new ISE("Unsupported destination [%s]", querySpec.getDestination());
shuffleSpecFactory = querySpec.getDestination()
.getShuffleSpecFactory(MultiStageQueryContext.getRowsPerPage(querySpec.getQuery().context()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor. Its much cleaner now.
We should add a comment saying all select partitions are controlled by a context value rowsPerPage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a comment every where the function is being called? We don't pass the whole context to getShuffleSpecFactory(), just the integer, so would this need to be specifically mentioned somewhere?

if (Intervals.ONLY_ETERNITY.equals(exportMSQDestination.getReplaceTimeChunks())) {
StorageConnector storageConnector = storageConnectorProvider.get();
try {
storageConnector.deleteRecursively("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think code flow wise, make query definition may not be the correct place to delete the file.
Maybe it can be done after we create the query definition object. (Clear files if needed)

docs/multi-stage-query/concepts.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
docs/multi-stage-query/concepts.md Outdated Show resolved Hide resolved
docs/multi-stage-query/reference.md Outdated Show resolved Hide resolved
return;
}
}
throw DruidException.forPersona(DruidException.Persona.USER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is user facing error. Please mention that the user should reach out to the cluster admin for the paths for export. The paths are controlled via xxx property

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the error be better addressed for Persona.ADMIN then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is more likely to be due to user error in specifying the path than a permission issue, so keeping it as user makes sense.

@cryptoe cryptoe merged commit 514b3b4 into apache:master Feb 7, 2024
83 checks passed
adarshsanjeev added a commit to adarshsanjeev/druid that referenced this pull request Feb 8, 2024
* Add test

* Parser changes to support export statements

* Fix builds

* Address comments

* Add frame processor

* Address review comments

* Fix builds

* Update syntax

* Webconsole workaround

* Refactor

* Refactor

* Change export file path

* Update docs

* Remove webconsole changes

* Fix spelling mistake

* Parser changes, add tests

* Parser changes, resolve build warnings

* Fix failing test

* Fix failing test

* Fix IT tests

* Add tests

* Cleanup

* Fix unparse

* Fix forbidden API

* Update docs

* Update docs

* Address review comments

* Address review comments

* Fix tests

* Address review comments

* Fix insert unparse

* Add external write resource action

* Fix tests

* Add resource check to overlord resource

* Fix tests

* Add IT

* Update syntax

* Update tests

* Update permission

* Address review comments

* Address review comments

* Address review comments

* Add tests

* Add check for runtime parameter for bucket and path

* Add check for runtime parameter for bucket and path

* Add tests

* Update docs

* Fix NPE

* Update docs, remove deadcode

* Fix formatting
abhishekagarwal87 pushed a commit that referenced this pull request Feb 8, 2024
[Backport] Add export capabilities to MSQ with SQL syntax
abhishekagarwal87 pushed a commit that referenced this pull request Apr 5, 2024
Support for exporting msq results to gcs bucket. This is essentially copying the logic of s3 export for gs, originally done by @adarshsanjeev in this PR - #15689
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Documentation Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying Needs web console change Backend API changes that would benefit from frontend support in the web console
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants