Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add link to scheduled pipeline #7536

Closed
wants to merge 11 commits into from
Closed

Conversation

betodealmeida
Copy link
Member

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

This PR makes it possible to add a link from the scheduled query to the pipeline running it. The user can provide a URL template, that gets formatted with the query to produce a link that takes to the corresponding pipeline (see example in updated docs).

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screen Shot 2019-05-16 at 10 28 18 PM

I also added some CSS to remove the disabled + button in the form, so it looks better.

TEST PLAN

Tested locally, and added unit tests for the helper functions.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

REVIEWERS

@khtruong @DiggidyDave @datability-io

Alex Berghage and others added 4 commits May 15, 2019 15:32
) (apache#7518)

* [WIP] Live query validation, where supported

This builds on apache#7422 to build check-as-you-type sql
query validation in Sql Lab. This closes apache#6707 too.

It adds a (debounced) call to the validate_sql_json
API endpoint with the querytext, and on Lyft infra is
able to return feedback to the user (end to end) in
$TBD seconds.

At present feedback is provided only through the
"annotations" mechanism build in to ACE, although
I'd be open to adding full text elsewhere on the
page if there's interest.

* fix: Unbreak lints and tests
…#7517) (apache#7519)

This change makes the query progress bar only show
whole number percentage changes, instead of numbers
like 12.13168276%.
* Making Talisman configurable

* Fixing double quotes

* Fixing flake8

* Removing default
@codecov-io
Copy link

codecov-io commented May 17, 2019

Codecov Report

Merging #7536 into lyft-develop will decrease coverage by <.01%.
The diff coverage is 45%.

Impacted file tree graph

@@               Coverage Diff                @@
##           lyft-develop    #7536      +/-   ##
================================================
- Coverage         65.19%   65.18%   -0.01%     
================================================
  Files               433      434       +1     
  Lines             21431    21446      +15     
  Branches           2362     2368       +6     
================================================
+ Hits              13971    13980       +9     
- Misses             7340     7346       +6     
  Partials            120      120
Impacted Files Coverage Δ
superset/views/sql_lab.py 85.71% <0%> (-1.25%) ⬇️
superset/assets/src/showSavedQuery/index.jsx 0% <0%> (ø) ⬆️
superset/assets/src/showSavedQuery/utils.js 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0acbb04...0702068. Read the comment docs.

@@ -906,6 +906,12 @@ To allow scheduled queries, add the following to your `config.py`:
'container': 'end_date',
},
],
# link to the scheduler; this example links to an Airflow pipeline
# that uses the query id and the output table as its name
'linkback': (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not a circular dependency? Superset should not know anything about the scheduler (eg Airflow), should it? The scheduler knows about superset, and grabs work from a known endpoint, and neither the user nor superset system itself should actually care who is doing that work.

I think we should consider letting any arbitrary scheduler PUT back information (like a URL) about how to view it pipelines or whatever representation it uses for the work it is doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it establishes a bi-directional connection, but Superset still doesn't know anything about Airflow with this (it's just an example config). The user is simply saying "when you show the scheduled information, put a link to this URL", and Superset does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it does require the running instance of superset to have internal airflow details (via its configuration). This has a "correctness" problem IMO which could manifest as actual issues. It requires the configurator of superset to know (at deployment time?) who will be servicing these and what their URLs look like.

With this approach it is coupled such that it prevents the possibility of multiple systems being able to service these scheduled queries, or if the owners of those services decide to migrate them to an new system it will break the feature in superset. Imagine if the load was migrated partially to another internal system like flyte for example, this would unnecessarily cause us to have to do significant eng work to accomodate that (if it even can be accomodated at all), whereas if the servicer itself PUTs the URL to superset, we don't have any concerns or opinions about that at all, it will just work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it does require the running instance of superset to have internal airflow details (via its configuration). This has a "correctness" problem IMO which could manifest as actual issues. It requires the configurator of superset to know (at deployment time?) who will be servicing these and what their URLs look like.

The SCHEDULED_QUERIES feature flag config is a way of informing Superset of the internals of a scheduler: it basically tells what information is needed for a given scheduler. I don't see how the linkback is different from the information stored in the configuration, since the configuration is already scheduler-specific.

With this approach it is coupled such that it prevents the possibility of multiple systems being able to service these scheduled queries, or if the owners of those services decide to migrate them to an new system it will break the feature in superset. Imagine if the load was migrated partially to another internal system like flyte for example, this would unnecessarily cause us to have to do significant eng work to accomodate that (if it even can be accomodated at all), whereas if the servicer itself PUTs the URL to superset, we don't have any concerns or opinions about that at all, it will just work.

Migrating to a new scheduler would most probably require updating the extra_json in all the existing queries, in addition to updating the SCHEDULE_QUERIES config, so the significant engineering work would be already expected.

And while I agree that that having the consumers PUTting the URL would be nice because it could support multiple schedulers (and we get the information from the system that knows more about it) I don't think it's a likely scenario to happen in practice.

I'm also worried about PUTting the URL because in order for the consumer to update the scheduled query with the pipeline URL it needs it to impersonate the user, opening a backdoor for running arbitrary queries in the user's name. And technically it could also result in race conditions, but I think that's an unlikely scenario.

betodealmeida and others added 3 commits May 17, 2019 17:30
* Validate start/end when scheduling queries

* Use chrono instead of Sugar
* Show scheduled queries

* Remove column

* Secure views

* Add import

* Fix unit tests

* Reuse existing db connection from view

* Remove unnecessary import
@betodealmeida
Copy link
Member Author

@DiggidyDave is this good to go then?

@DiggidyDave
Copy link
Contributor

👍

@betodealmeida
Copy link
Member Author

Closing since I merged to master in #7584.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants