-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: allow planning a backup in user txn #47539
Comments
Ideally, we would like to return just the job id as the result of executing backup/import/restore statements if those statements run under explicit transaction. To do this, we need to add a new ast node in order to indicate that we would like to execute a job We have discussed various options: -- regular backup for reference. -- regular imports IMPORT IN BACKGROUND INTO foo CSV DATA 'foo://bar'; My preference is "ASYNC" word, but |
Some ideas that have come out of prior discussion for syntax to request this behavior are A regular BACKUP TO 'foo://bar' WITH x; Some of these would them look like:
Similarly here are a few forms of IMPORTs that run blocking right now:
And with some of those proposed modifiers in them:
|
One of the motivations of However in that case, it is arbitrary SQL -- we'll wrap it up in some serialized record and then when it is time to run it, we'll unpack that and execute the payload. Importantly, we will not be executing that payload or planning the statements in it using the user's transaction in which they ran |
I think my preference leans towards a trailing modifier of either |
Between "RETURNING JOB" and "IN BACKGROUND" my preference would be "IN BACKGROUND". "RETURNING JOB" -- not sure what that means, really. I think "IN BACKGROUND" is more explicit. |
Okay, let's go ahead and prototype with |
Looks like "IN BACKGROUND" is a no go. "IN" is an operator, so "IN BACKGROUND" makes I would still prefer to have suffix options (as opposed to e.g. IMPORT ASYNC...).
IMPORT INTO foo CSV DATA 'foo://bar' RUNNING ASYNC;
I kinda like option 2: DETACHED @dt what's your take? Any other options/suggestions? |
Just published #49172 to have something concrete to discuss. I implemented "DETACHED" option. |
I'd like to hear more:
|
I'm not yacc expert (never actually had a chance to work with yacc), but it appears that if I use "IN BACKGROUND", I get the following error: Grammar is ambiguous because there are 2 ways to interpret "IN BACKGROUND", as "IN" (args) and as "IN BACKGROUND". "IN" operator currently is marked as "%nonassoc" token, which I think explains why there is ambiguity. Perhaps I could spend (probably quite a bit more) time to get it to work (and perhaps "IN" can be changed to be "%left" associate). I didn't think it was that critical to get "IN BACKGROUND" working. Also, I personally really dislike overloading the behavior of an operator. Since I didn't want to spend more time wrestling w/ yacc, changing grammar to make (likely) dangerous change to the "IN" precedence and associativity, I just wanted to come up with something that is synonymous in meaning and spirit to "in background". First candidate was ASYNC. And my initial attempt actually did use ASYNC. However, I felt that "async" is a bit too much "tech speak". It's an abbreviation, it's not as clear as "in background". Then I thought, how do I specify "in background" normally? That's the "&" ampersand on the command line; that's also how "daemons" run. And that's where "DETACHED" came in. I chose "DETACHED". I thought it's better than "ASYNC", though "ASYNC" at the end of the statement works equally well. I guess it's a matter of what we think is a nicer/clearer syntax. Regarding "RETURNING NOTHING" -- I discarded this approach because we are returning something; namely "job id". Also, I didn't want to use "RETURNING" at all because I thought what if the user wants to run (attached) job, but doesn't care about any of the result fields and just wants to see subset of results: "IMPORT ... RETURNING status" (that's not currently supported, but I din't want to close the door for that"). I could have used "RETURNING NOTHING", but then I'm overloading the meaning of "RETURNING" to both mean how the job runs and what it returns. Again, I don't like overloading like that. Another knock against "RETURNING" is that what it returns is actually an expression list. Expressions are evaluated at the time plan hook runs. |
The following diff does not produce any yacc error: diff --git a/pkg/sql/parser/sql.y b/pkg/sql/parser/sql.y
index 2ed9c5f892..923fd26128 100644
--- a/pkg/sql/parser/sql.y
+++ b/pkg/sql/parser/sql.y
@@ -549,7 +549,7 @@ func (u *sqlSymUnion) alterTypeAddValuePlacement() *tree.AlterTypeAddValuePlacem
%token <str> ALL ALTER ALWAYS ANALYSE ANALYZE AND AND_AND ANY ANNOTATE_TYPE ARRAY AS ASC
%token <str> ASYMMETRIC AT ATTRIBUTE AUTHORIZATION AUTOMATIC
-%token <str> BACKUP BEFORE BEGIN BETWEEN BIGINT BIGSERIAL BIT
+%token <str> BACKGROUND BACKUP BEFORE BEGIN BETWEEN BIGINT BIGSERIAL BIT
%token <str> BUCKET_COUNT
%token <str> BOOLEAN BOTH BUNDLE BY
@@ -2111,46 +2111,50 @@ import_format:
// comment = '...' [CSV-specific]
//
// %SeeAlso: CREATE TABLE
+
+opt_in_bg:
+ | IN BACKGROUND
+
import_stmt:
- IMPORT import_format '(' string_or_placeholder ')' opt_with_options
+ IMPORT opt_in_bg import_format '(' string_or_placeholder ')' opt_with_options
{
/* SKIP DOC */
- $$.val = &tree.Import{Bundle: true, FileFormat: $2, Files: tree.Exprs{$4.expr()}, Options: $6.kvOptions()}
+ $$.val = &tree.Import{Bundle: true, FileFormat: $3, Files: tree.Exprs{$5.expr()}, Options: $7.kvOptions()}
}
-| IMPORT import_format string_or_placeholder opt_with_options
+| IMPORT opt_in_bg import_format string_or_placeholder opt_with_options
{
- $$.val = &tree.Import{Bundle: true, FileFormat: $2, Files: tree.Exprs{$3.expr()}, Options: $4.kvOptions()}
+ $$.val = &tree.Import{Bundle: true, FileFormat: $3, Files: tree.Exprs{$4.expr()}, Options: $5.kvOptions()}
}
-| IMPORT TABLE table_name FROM import_format '(' string_or_placeholder ')' opt_with_options
+| IMPORT opt_in_bg TABLE table_name FROM import_format '(' string_or_placeholder ')' opt_with_options
{
/* SKIP DOC */
- name := $3.unresolvedObjectName().ToTableName()
- $$.val = &tree.Import{Bundle: true, Table: &name, FileFormat: $5, Files: tree.Exprs{$7.expr()}, Options: $9.kvOptions()}
+ name := $4.unresolvedObjectName().ToTableName()
+ $$.val = &tree.Import{Bundle: true, Table: &name, FileFormat: $6, Files: tree.Exprs{$8.expr()}, Options: $10.kvOptions()}
}
-| IMPORT TABLE table_name FROM import_format string_or_placeholder opt_with_options
+| IMPORT opt_in_bg TABLE table_name FROM import_format string_or_placeholder opt_with_options
{
- name := $3.unresolvedObjectName().ToTableName()
- $$.val = &tree.Import{Bundle: true, Table: &name, FileFormat: $5, Files: tree.Exprs{$6.expr()}, Options: $7.kvOptions()}
+ name := $4.unresolvedObjectName().ToTableName()
+ $$.val = &tree.Import{Bundle: true, Table: &name, FileFormat: $6, Files: tree.Exprs{$7.expr()}, Options: $8.kvOptions()}
}
-| IMPORT TABLE table_name CREATE USING string_or_placeholder import_format DATA '(' string_or_placeholder_list ')' opt_with_options
+| IMPORT opt_in_bg TABLE table_name CREATE USING string_or_placeholder import_format DATA '(' string_or_placeholder_list ')' opt_with_options
{
- name := $3.unresolvedObjectName().ToTableName()
- $$.val = &tree.Import{Table: &name, CreateFile: $6.expr(), FileFormat: $7, Files: $10.exprs(), Options: $12.kvOptions()}
+ name := $4.unresolvedObjectName().ToTableName()
+ $$.val = &tree.Import{Table: &name, CreateFile: $7.expr(), FileFormat: $8, Files: $11.exprs(), Options: $13.kvOptions()}
}
-| IMPORT TABLE table_name '(' table_elem_list ')' import_format DATA '(' string_or_placeholder_list ')' opt_with_options
+| IMPORT opt_in_bg TABLE table_name '(' table_elem_list ')' import_format DATA '(' string_or_placeholder_list ')' opt_with_options
{
- name := $3.unresolvedObjectName().ToTableName()
- $$.val = &tree.Import{Table: &name, CreateDefs: $5.tblDefs(), FileFormat: $7, Files: $10.exprs(), Options: $12.kvOptions()}
+ name := $4.unresolvedObjectName().ToTableName()
+ $$.val = &tree.Import{Table: &name, CreateDefs: $6.tblDefs(), FileFormat: $8, Files: $11.exprs(), Options: $13.kvOptions()}
}
-| IMPORT INTO table_name '(' insert_column_list ')' import_format DATA '(' string_or_placeholder_list ')' opt_with_options
+| IMPORT opt_in_bg INTO table_name '(' insert_column_list ')' import_format DATA '(' string_or_placeholder_list ')' opt_with_options
{
- name := $3.unresolvedObjectName().ToTableName()
- $$.val = &tree.Import{Table: &name, Into: true, IntoCols: $5.nameList(), FileFormat: $7, Files: $10.exprs(), Options: $12.kvOptions()}
+ name := $4.unresolvedObjectName().ToTableName()
+ $$.val = &tree.Import{Table: &name, Into: true, IntoCols: $6.nameList(), FileFormat: $8, Files: $11.exprs(), Options: $13.kvOptions()}
}
-| IMPORT INTO table_name import_format DATA '(' string_or_placeholder_list ')' opt_with_options
+| IMPORT opt_in_bg INTO table_name import_format DATA '(' string_or_placeholder_list ')' opt_with_options
{
- name := $3.unresolvedObjectName().ToTableName()
- $$.val = &tree.Import{Table: &name, Into: true, IntoCols: nil, FileFormat: $4, Files: $7.exprs(), Options: $9.kvOptions()}
+ name := $4.unresolvedObjectName().ToTableName()
+ $$.val = &tree.Import{Table: &name, Into: true, IntoCols: nil, FileFormat: $5, Files: $8.exprs(), Options: $10.kvOptions()}
}
| IMPORT error // SHOW HELP: IMPORT
@@ -10201,6 +10205,7 @@ unreserved_keyword:
| ATTRIBUTE
| AUTOMATIC
| AUTHORIZATION
+| BACKGROUND
| BACKUP
| BEFORE
| BEGIN |
Also you could out |
I personally recommend the new qualifier completely at the end. The rationale is that it makes the "adding the qualifier to an existing SQL string" easier, by simple string concatenation |
Interesting... I didn't even try "IN BACKGROUND" as a "prefix" option (IMPORT IN BACKGROUND)... Always had it as a suffix. I Don't see in your patch that you actually set any values for the e..g import specification. Like, where is the information on whether or not that option was specified? |
As I said, I'm not yacc expert, it may well be possible to make "in background" work. I have defined detached option (and previously tried the same with "in background") as follows:
I think I need to have true/false value on whether that option was set. And then, the statements need to set true/false value for that option as well. Regardless though: do we really think "IN BACKGROUND" is better than "DETACHED"? |
Yes looking into it now thanks |
Here is my updated backup statement:
|
btw it's SQL so we don't do abbreviations. It would be "ASYNCHRONOUS" not "async" - but that's a detail. I'm not going to let the keyword discission block you, but give a minute to try out the other UX thing I'm thinking about. |
Google gives definition for detached as "separate or disconnected"... The disconnected bit is exactly what it is . Detached, I think is equally familiar with anybody who used command line; or batch control system as well. I'm not hard set on "detached" either way -- just something that I could prototype to start discussion. |
I have an intermediate finding for y'all: there was a mistake made in the past to pass The There is thus a large exposure by the backup statement to future breakage due to further innovation/additions in SQL scalar expressions. Not sure what to make of it, but it explains why "IN" didn't work at the end. |
Ok so the idea I had doesn't work because of the thing I just found. How am I to remember that "AS OF" must be present before "WITH", and "INCREMENTAL" must be after "AS OF", but before "WITH"? Conversely, what's fundamentally blocking the ability to accept things in arbitrary order? (There's not a very good reason - the main reason is the mistake pointed to above) And now you're thinking about adding yet one more positional argument, which compounds the difficulty of remembering the order. IMHO I'd rather see the new "detached" option come as an option after WITH, for this reason. This way I know I can stick it in there interleaved in any order with the other options. Arguably I'd find it easier to use if the "incremental" option was also after WITH, for the same reason. |
Ooh actually I just discovered if we adopt DETACHED instead of IN BACKGROUND, we can do the "any order" thing. I'll show you the diff in a moment. |
The problem with adding "async/detached/background" to the "WITH" option clause is that it is evaluated too late. |
FWIW, I think we can add keyword/existence-only options to the |
Here's your diff: diff --git a/pkg/sql/parser/sql.y b/pkg/sql/parser/sql.y
index 2ed9c5f892..0710c3f528 100644
--- a/pkg/sql/parser/sql.y
+++ b/pkg/sql/parser/sql.y
@@ -563,7 +563,7 @@ func (u *sqlSymUnion) alterTypeAddValuePlacement() *tree.AlterTypeAddValuePlacem
%token <str> CURRENT_USER CYCLE
%token <str> DATA DATABASE DATABASES DATE DAY DEC DECIMAL DEFAULT DEFAULTS
-%token <str> DEALLOCATE DECLARE DEFERRABLE DEFERRED DELETE DESC
+%token <str> DEALLOCATE DECLARE DEFERRABLE DEFERRED DELETE DESC DETACHED
%token <str> DISCARD DISTINCT DO DOMAIN DOUBLE DROP
%token <str> ELSE ENCODING END ENUM ESCAPE EXCEPT EXCLUDE EXCLUDING
@@ -835,9 +835,9 @@ func (u *sqlSymUnion) alterTypeAddValuePlacement() *tree.AlterTypeAddValuePlacem
%type <tree.Statement> declare_cursor_stmt
%type <tree.Statement> reindex_stmt
-%type <[]string> opt_incremental
+%type <[]string> opt_incremental incremental
%type <tree.KVOption> kv_option
-%type <[]tree.KVOption> kv_option_list opt_with_options var_set_list
+%type <[]tree.KVOption> kv_option_list opt_with_options kv_with var_set_list
%type <str> import_format
%type <tree.StorageParam> storage_parameter
%type <[]tree.StorageParam> storage_parameter_list opt_table_with
@@ -2014,9 +2014,9 @@ alter_attribute_action:
//
// %SeeAlso: RESTORE, WEBDOCS/backup.html
backup_stmt:
- BACKUP TO partitioned_backup opt_as_of_clause opt_incremental opt_with_options
+ BACKUP TO partitioned_backup opt_bulk_io_options
{
- $$.val = &tree.Backup{DescriptorCoverage: tree.AllDescriptors, To: $3.partitionedBackup(), IncrementalFrom: $5.exprs(), AsOf: $4.asOfClause(), Options: $6.kvOptions()}
+ $$.val = &tree.Backup{DescriptorCoverage: tree.AllDescriptors, To: $3.partitionedBackup()}
}
| BACKUP targets TO partitioned_backup opt_as_of_clause opt_incremental opt_with_options
{
@@ -2185,6 +2185,23 @@ string_or_placeholder:
$$.val = p
}
+opt_bulk_io_options:
+ /* EMPTY */ {}
+| bulk_io_options {}
+
+bulk_io_options:
+ bulk_io_option {}
+| bulk_io_options bulk_io_option {}
+
+bulk_io_option:
+ as_of_clause {}
+| incremental {}
+| kv_with {}
+| detached {}
+
+detached:
+ DETACHED {}
+
string_or_placeholder_list:
string_or_placeholder
{
@@ -2196,15 +2213,18 @@ string_or_placeholder_list:
}
opt_incremental:
- INCREMENTAL FROM string_or_placeholder_list
- {
- $$.val = $3.exprs()
- }
+ incremental
| /* EMPTY */
{
$$.val = tree.Exprs(nil)
}
+incremental:
+ INCREMENTAL FROM string_or_placeholder_list
+ {
+ $$.val = $3.exprs()
+ }
+
kv_option:
name '=' string_or_placeholder
{
@@ -2234,6 +2254,10 @@ kv_option_list:
}
opt_with_options:
+ kv_with {}
+| /* EMPTY */ {}
+
+kv_with:
WITH kv_option_list
{
$$.val = $2.kvOptions()
@@ -2242,10 +2266,6 @@ opt_with_options:
{
$$.val = $4.kvOptions()
}
-| /* EMPTY */
- {
- $$.val = nil
- }
copy_from_stmt:
COPY table_name opt_column_list FROM STDIN opt_with_options
@@ -10241,6 +10261,7 @@ unreserved_keyword:
| DELETE
| DEFAULTS
| DEFERRED
+| DETACHED
| DISCARD
| DOMAIN
| DOUBLE Note that I have left the yacc actions empty With this change, all the orderings are now valid:
which IMHO is a big UX improvement |
I still prefer a WITH-based solution, with the caveat that we need to do what @dt suggests. |
Sorry to resurrect this issue... What exactly have we agreed on? Change the WITH option so that it also takes "DETACHED"? My concern is that it suddenly and significantly expands the scope of this particular issue. Just import statement along has few dozen options that now need to be explicitly parsed and set inside tree structures (not to mention the expanded scope in proving that new parsed with options behave the same way as the old key/value based ones; plus much larger code change to all of the bulk-y statement implementations) Or, just take @knz diff above and try to make DETACHED option more flexible (wrt to its position)? Or, alternatively, go with what I have in #49172 and refactor/modify/improve later? |
50723: sql: Switch BACKUP statement to use allowed options r=miretskiy a=miretskiy Informs #47539 Modify BACKUP statements "WITH" clause to only accept allowed explicit options. This change replaces the "WITH" clause implementation that used to accept any set of key/value pairs with the one that only accepts options allowed by the BACKUP statement. This change does not introduce any new options. However, this change does introduce two changes on how we parse those options: 1. We now check to make sure that each option is specified only once and produce error message if the option is specified multiple times (which is a mistake that could not have been detected prior to this change). 2. We now disallow quoted option name. Release Notes (usability): Improve WITH option parsing for backup. When BACKUP statement is used incorrectly with the same option specified multiple times inside the WITH clause, we now produce useful error message. In addition, we now disallow quoted option names. This is a backward incompatible change. 50752: cloud: update orchestrator configs to point to v20.1.3 image r=jlinder a=asubiotto Release note: None (orchestrator config change) Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Alfonso Subiotto Marques <alfonso@cockroachlabs.com>
50724: sql: Add 'DETACHED" option to the list of allowed BACKUP options r=miretskiy a=miretskiy Informs #47539 Add `DETACHED` option to the list of allowed BACKUP options. This option requests the backup to run in detached mode: that is, backup statement should not block, and instead simply return the job id. Release Notes : None Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
50775: bulkio: Support running BACKUP under transaction. r=miretskiy a=miretskiy Informs #47539 Allow execution of BACKUP statement under explicit transaction, provided the "DETACHED" option is specified: ``` BACKUP TO <location_uri> ... WITH DETACHED ``` When backup runs in `detached` mode, we do not wait for the job completion. Instead, we return the job id to the caller, and the job executes asynchronously. Release notes (enterprise change): BACKUP can now run in "detached" mode. That is, we do not wait for the BACKUP job to finish. Instead, the job ID is returned immediately, and the job itself runs in background. Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Closed by #50775 |
Currently BACKUP executes, including running the actual job and its side-effects of files writing to external storage, during the execution of the statement. This differs slightly from most OLTP statements where their side-effects (e.g. writing to rows) are not observable outside the txn until it commits -- i.e. it lacks transactional isolation. This is why the BACKUP statement currently refuses to run inside an explicitly started transaction, since the explicit transaction implies we want transactional isolation, but we can't provide that.
In theory though, the backup could be planned inside a transaction that then wrote a row to the jobs table, and then once that transaction, that job would become runnable and be executed. Doing this will some technical legwork to ensure we use the planner txn and move all the external storage writes to the job itself, which is somewhat debatable: right now we like to confirm it can be written to before scheduling a job that writes to it -- do we want lose that? are we okay with a slightly leaky transaction?
Even more fundamental though is that the backup needs to actually run in order to compute things like rows backed up, bytes, etc and those are currently returns as the result rows from the BACKUP statement. Backups planned in transactions will be unable to return such rows, so we'll only be able to say
OK
or the job ID inside the txn, and when it commits, our chance to send results has passed. We can useSHOW JOB
or something to get the actual results.The text was updated successfully, but these errors were encountered: