-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Returning informative exit codes #1409
Comments
I've updated the above so that each error type is grouped together with similar errors. And I have added specific notes about the actions Meltano (or another orchestrator) might take on seeing a given code. In regards to monitoring/reporting connector quality, "Group F" and "Group G" are the ones we'd watch for. We'd generally also want to watch for Group D (configuration errors) as a sign of poor or outdated documentation. |
@aaronsteers I made https://github.com/meltano/internal-product/issues/187 to track the meta requirements |
@tayloramurphy - sounds good! I've added to the office hours board to collect ideas. Spec-wise, and in terms of defining the path forward from here: The last piece I'm not sure of would be which specific codes to use. We could try to find precedent of integer codes used already in prior art, or we could just start fresh and declare a new domain of custom return code integer values. |
I've updated the issue description to include a set of proposed (and tentative) exit code integers and reserved ranges. Feedback and counter-suggestions much appreciated. |
Short comment to say I like this a lot! One situation we would like to report back is when the tap runs out of quota. I'm not sure which code above I'd use for this specific situation (maybe a custom one for the specific tap, though I feel like this use-case might be general enough to consider adding it to the sdk itself?). |
As a way of communicating back to the orchestrator, it would be helpful to have 10-15 predefined exit codes for common failure scenarios.
Guidelines and best practices for exit codes
From https://tldp.org/LDP/abs/html/exitcodes.html:
From https://unix.stackexchange.com/a/604262/487180:
Therefor, a foundation strategy could be:
Grouping of suggested return types, by category and remediation path
Each of these categories and sub-items could have a distinct return code so that the caller can understand what happened during the requested operation:
0
)0
3
)3
enable_noop_exit_code
:True
to return non-0
exit code for no-op sync operations.False
to return0
. Default isFalse
(no-op sync tracked as success.)4-9
)FULL_TABLE
sync's were fully completed, or one or moreINCREMENTAL
state messages were successfully delivered as resumable bookmarks.FULL_TABLE
streams and reached at least one resumable bookmark for anINCREMENTAL
stream. (Full table syncs may need be ordered first by the tap to prioritize a partial success status. Not yet implemented in the SDK.)STATE
to resume sync on the same stream where the previous sync left off. (Not yet implemented in the SDK.)4
SIGTERM
/KeyboardInterrupt
received and sync operation was wrapped up successfully; sync is resumable.5
Max record volume limit reached; sync is resumable. (Additional records available on source.)6
Max elapsed time limit reached; sync is resumable. (Additional records available on source.)7-8
Reserved for future use.9
General/Other.(Additional records available on source.)
10-19
,130
,137
)10-19
,130
,137
10
Operation aborted due to elapsed time restriction.11
Operation aborted due to record count restriction.130
Operation aborted bySIGINT
orKeyboardInterrupt
(Control+C).137
Operation aborted bySIGKILL
.12-18
Reserved for future use.19
General/Other.20-29
)20-29
20
Config validation error: missing required value.21
Config validation error: data type mismatch.22
Config validation error: validation failed (other).23
Authentication or authorization error. (Permission denied, password incorrect, etc.)24
Invalid input file paths. (For instance, theconfig.json
orcatalog.json
do not exist or cannot be reached.)25-28
Reserved for future use.29
General/Other.30-39
)30-39
30
Out of memory.31
Out of disk space.32
Network issue or host-not-found.33
File not found.34
File not writeable.35-38
Reserved for future use.39
General/Other.1
,40-59
,141
)40-69
,141
40
Singer Spec error inSTDIN
stream or input files.41-48
Reserved for future use.49
or1
General/Other. Connector experienced unhandled exception.50
Misshapen data from source system or source data failed validation.51
Source data processing error.141
Target stopped listening ("Broken pipe")52-54
Reserved for future use.55
Data validation error in input stream.56
Data processing error.57-59
Reserved for future use.60-79
)60-69
Application Failures (Retriable) - These are likely to succeed if retried later.70-79
Application Failures (Non-Retriable) - These are unlikely to succeed unless action is taken by the user.For purposes of monitoring and reporting the quality and stability of taps and targets, really only "Connector Failure" codes relevant here. The "Configuration Errors" category might also be a sign of poor docs or outdated docs. Assuming the other errors are correctly raised, all other issue groups are: user errors, OS/container issues, or networking issues.
Why do we need this?
Today orchestrators like Meltano have no way to distinguish what actually happened if a subprocess fails - except for a human to manually read over the detailed log files. By adding this into the SDK, the return code of the subprocess would immediately tell Meltano how to advise the user on next steps. Other orchestrators like Airflow could also incorporate these return codes when deciding whether to attempt a retry, and how to message back to users on next steps.
Regarding "partial success" codes
Details
All of the partial-success codes discussed here, should probably have some config option to let them return
0
status if the caller doesn't care about one or all of the detailed status codes.There are use cases where we want to open up the idea of "partial" success - but importantly to tell the caller of the process what actually happened that made the sync not a "full" success.
For instance, if running in lambda , we will need an execution time limit. At the end of that time limit (provided in
config.json
, most likely), we'll expect the tap to try to wrap things up and close out its processes. Its return value in these cases should indicate0
if all upstream records were successfully received within the window or something non-zero if more records were available which were not synced.An orchestrator like Meltano will also want to know the difference between "Sync complete" and "Sync complete (no data found)". By providing a non-zero return code for the "no data found" case, we let Meltano message this properly to the user - rather than only being able to provide a simple "sync completed" message.
Precedent and existing return code conventions
Details
Below is a subset of linux return codes found with some googling. We don't need to use these integer codes, and we don't need to keep these categories. Listed here for discussion/inspiration.
The text was updated successfully, but these errors were encountered: