Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #927, #928: Schema refresh improvements. #930

Merged
merged 1 commit into from
Apr 3, 2019

Conversation

emtwo
Copy link

@emtwo emtwo commented Apr 3, 2019

This PR makes a variety of changes to fix bugs and more gracefully handle schema processing for data sources with many tables. A summary:

  • Batch insert and update of column and table metadata
  • Time limit on schema refresh increased to a soft limit of 5 min and hard limit of 10
  • Column types are truncated to fit 255 character limit
  • Fixed bug with str() function on TableMetadata accessing a non-existent variable
  • Refresh schema every 12 hours instead of 30 min
  • Check whether we should collect data samples before we spin off tasks to do so
  • When a user requests a refresh, do it asynchronously
  • Don’t force a schema refresh when there is no schema data to display (this clogs up the queue), just wait for the next refresh
  • Changes to data sources (in data_sources.py) should trigger a refresh for the changed data source only (instead of all data sources)

@emtwo emtwo requested a review from jezdez April 3, 2019 16:35
Copy link

@washort washort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice improvement, r+

@@ -46,7 +46,7 @@ def all_settings():
QUERY_RESULTS_CLEANUP_COUNT = int(os.environ.get("REDASH_QUERY_RESULTS_CLEANUP_COUNT", "100"))
QUERY_RESULTS_CLEANUP_MAX_AGE = int(os.environ.get("REDASH_QUERY_RESULTS_CLEANUP_MAX_AGE", "7"))

SCHEMAS_REFRESH_SCHEDULE = int(os.environ.get("REDASH_SCHEMAS_REFRESH_SCHEDULE", 30))
SCHEMAS_REFRESH_SCHEDULE = int(os.environ.get("REDASH_SCHEMAS_REFRESH_SCHEDULE", 720))
SCHEMAS_REFRESH_QUEUE = os.environ.get("REDASH_SCHEMAS_REFRESH_QUEUE", "celery")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the faster schema refresh do we still want this longer interval?

Copy link
Author

@emtwo emtwo Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary at the moment, but I'm being cautious because the column examples are still a bit slow since they do select * from table limit 1 for each table. We can always turn it back up if it seems too slow I think.

redash/tasks/queries.py Outdated Show resolved Hide resolved
@emtwo emtwo force-pushed the emtwo/fix_schema_bugs branch from 527eddc to 01ef94d Compare April 3, 2019 22:50
@emtwo emtwo force-pushed the emtwo/fix_schema_bugs branch from 01ef94d to f7c12a2 Compare April 3, 2019 22:52
@emtwo emtwo merged commit 1662f47 into master Apr 3, 2019
@emtwo emtwo deleted the emtwo/fix_schema_bugs branch April 3, 2019 23:05
emtwo pushed a commit that referenced this pull request Apr 4, 2019
emtwo pushed a commit that referenced this pull request Apr 4, 2019
emtwo pushed a commit that referenced this pull request Apr 4, 2019
@jezdez jezdez mentioned this pull request May 13, 2019
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants