-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add link to scheduled pipeline #7536
Conversation
) (apache#7518) * [WIP] Live query validation, where supported This builds on apache#7422 to build check-as-you-type sql query validation in Sql Lab. This closes apache#6707 too. It adds a (debounced) call to the validate_sql_json API endpoint with the querytext, and on Lyft infra is able to return feedback to the user (end to end) in $TBD seconds. At present feedback is provided only through the "annotations" mechanism build in to ACE, although I'd be open to adding full text elsewhere on the page if there's interest. * fix: Unbreak lints and tests
…#7517) (apache#7519) This change makes the query progress bar only show whole number percentage changes, instead of numbers like 12.13168276%.
* Making Talisman configurable * Fixing double quotes * Fixing flake8 * Removing default
Codecov Report
@@ Coverage Diff @@
## lyft-develop #7536 +/- ##
================================================
- Coverage 65.19% 65.18% -0.01%
================================================
Files 433 434 +1
Lines 21431 21446 +15
Branches 2362 2368 +6
================================================
+ Hits 13971 13980 +9
- Misses 7340 7346 +6
Partials 120 120
Continue to review full report at Codecov.
|
@@ -906,6 +906,12 @@ To allow scheduled queries, add the following to your `config.py`: | |||
'container': 'end_date', | |||
}, | |||
], | |||
# link to the scheduler; this example links to an Airflow pipeline | |||
# that uses the query id and the output table as its name | |||
'linkback': ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not a circular dependency? Superset should not know anything about the scheduler (eg Airflow), should it? The scheduler knows about superset, and grabs work from a known endpoint, and neither the user nor superset system itself should actually care who is doing that work.
I think we should consider letting any arbitrary scheduler PUT back information (like a URL) about how to view it pipelines or whatever representation it uses for the work it is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it establishes a bi-directional connection, but Superset still doesn't know anything about Airflow with this (it's just an example config). The user is simply saying "when you show the scheduled information, put a link to this URL", and Superset does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it does require the running instance of superset to have internal airflow details (via its configuration). This has a "correctness" problem IMO which could manifest as actual issues. It requires the configurator of superset to know (at deployment time?) who will be servicing these and what their URLs look like.
With this approach it is coupled such that it prevents the possibility of multiple systems being able to service these scheduled queries, or if the owners of those services decide to migrate them to an new system it will break the feature in superset. Imagine if the load was migrated partially to another internal system like flyte for example, this would unnecessarily cause us to have to do significant eng work to accomodate that (if it even can be accomodated at all), whereas if the servicer itself PUTs the URL to superset, we don't have any concerns or opinions about that at all, it will just work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it does require the running instance of superset to have internal airflow details (via its configuration). This has a "correctness" problem IMO which could manifest as actual issues. It requires the configurator of superset to know (at deployment time?) who will be servicing these and what their URLs look like.
The SCHEDULED_QUERIES
feature flag config is a way of informing Superset of the internals of a scheduler: it basically tells what information is needed for a given scheduler. I don't see how the linkback is different from the information stored in the configuration, since the configuration is already scheduler-specific.
With this approach it is coupled such that it prevents the possibility of multiple systems being able to service these scheduled queries, or if the owners of those services decide to migrate them to an new system it will break the feature in superset. Imagine if the load was migrated partially to another internal system like flyte for example, this would unnecessarily cause us to have to do significant eng work to accomodate that (if it even can be accomodated at all), whereas if the servicer itself PUTs the URL to superset, we don't have any concerns or opinions about that at all, it will just work.
Migrating to a new scheduler would most probably require updating the extra_json
in all the existing queries, in addition to updating the SCHEDULE_QUERIES
config, so the significant engineering work would be already expected.
And while I agree that that having the consumers PUT
ting the URL would be nice because it could support multiple schedulers (and we get the information from the system that knows more about it) I don't think it's a likely scenario to happen in practice.
I'm also worried about PUT
ting the URL because in order for the consumer to update the scheduled query with the pipeline URL it needs it to impersonate the user, opening a backdoor for running arbitrary queries in the user's name. And technically it could also result in race conditions, but I think that's an unlikely scenario.
* Validate start/end when scheduling queries * Use chrono instead of Sugar
* Show scheduled queries * Remove column * Secure views * Add import * Fix unit tests * Reuse existing db connection from view * Remove unnecessary import
* feat: add header tooltip (apache#7531)
@DiggidyDave is this good to go then? |
👍 |
Closing since I merged to master in #7584. |
CATEGORY
Choose one
SUMMARY
This PR makes it possible to add a link from the scheduled query to the pipeline running it. The user can provide a URL template, that gets formatted with the query to produce a link that takes to the corresponding pipeline (see example in updated docs).
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
I also added some CSS to remove the disabled
+
button in the form, so it looks better.TEST PLAN
Tested locally, and added unit tests for the helper functions.
ADDITIONAL INFORMATION
REVIEWERS
@khtruong @DiggidyDave @datability-io