Fix #8 - Migration script to backup database file and restart the s… #9

deepthivenkat · 2016-07-25T14:45:01Z

…erver

…art the server

webcompat_migration_backup_db.py

+            #           check if a previous version of a backup exist and delete it
+            print 'Backup database already exists. Removing older version of the back up'
+            call('rm -r backup_db', shell=True)
+            call('git rm -r backup_db/.', shell=True)


…time column to the table

dump_webcompat_to_db.py

@@ -47,7 +60,7 @@ def main():
    # stuff data into database..
    for bug in data['bugs']:
        db_session.add(
-            Issue(bug['id'], bug['summary'], bug['url'], bug['body']))
+            Issue(bug['id'], bug['summary'], bug['url'], extract_domain_name(bug['url']), bug['body'], bug['state'], bug['creation_time'], bug['last_change_time']))


miketaylr · 2016-07-26T15:42:18Z

@deepthivenkat can you summarize the strategy and design here, before addressing review comments? Like, what is the main entry point, by who and when is this script run, what are the expected outcomes.

Right now it just looks like it copies issues.db and then puts fresh data into a new issues.db. Why are we doing this? Presumably something to do with migrating db schemas -- but I don't see any code to handle the actual schema changes.

I'm also concerned about the idea of creating and populating a new db between killing and starting the app. What if GitHub is down, or slow? Or what happens when there are 10,000 or 100,000 issues? Do we hit API limits? Does this take 2 or 10 minutes, and does that mean the site is down until it's finished?

miketaylr · 2016-07-26T15:48:06Z

I also have concerns generally about basing this on extract_id_title_url.py -- it was written by Hallvord for a different project. RIght now it's not up to date (for example, it doesn't know about needstriage and it only collects info about Firefox related bugs.

Have you verified that the database created from the extract_data method matches what's on GitHub?

deepthivenkat · 2016-07-26T17:13:39Z

@miketaylr As you mentioned in IRC, few lines in extract_id_title_url.py like

https://github.com/webcompat/issue_parser/blob/master/extract_id_title_url.py#L84

is currently unused. Should there be another script for extracting issues specifically for webcompat inside webcompat.com root folder?

deepthivenkat · 2016-07-27T04:14:44Z

Summarising the strategy:

The script would be run by the person who is handling the fabfile for deployment. The script will update the db schema, and create new data dump for the schema, backup old db inside a .gitignored folder in webcompat root repository.

The entry point would be to use one of the following tools to handle

SQLAlchemy
Alembic to handle the schema changes.

If I choose the second option, I will be able to use flask-migrate that runs on alembic.

The github api rate limits is 5000 per hour for authenticated users and it is linked to the github account. The number of issues we have currently is already about half of maximum limit. So if we run extract_id_title_url.py a couple of times, we will end up with 403 status code.

So it is not very viable to kill the app while populating the new db.

Suggestions:

We can start the app as soon as running the script for db schema migration(yet to be done using flask-migrate) is complete.

When make the db schema changes, the db will be locked for changes. Once the schema change is done, the webhooks we have in webhooks/init.py may try to insert the newly opened issue which is ok.

Solutions:

Extract the issues from github before killing the app and handle only dump_to_db after killing the app.

Suggestions for extracting issues:

We can make use of Github-Backup. I read the following from their documentation:

Bear in mind that this uses the GitHub API; don't run it every 5 minutes. GitHub rate limits the API to some small number of requests per hour when used without authentication. To avoid this limit, you can set GITHUB_USER and GITHUB_PASSWORD (or GITHUB_OAUTH_TOKEN obtained from https://github.com/settings/tokens) in the environment and it will log in when making (most) API requests.

Anyway, github-backup does do an incremental backup, picking up where it left off, so will complete the backup eventually even if it's rate limited.

Write a new extract_dump_issues python script to retrieve issues and dump to db as and when the issues are retrieved.. We can rely on

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4987
X-RateLimit-Reset: 1350085394

and add sleep until the X-RateLimit-Reset and retrieve issues again.

If the webhooks/init.py
tries to enter a newly created issues while the db dump is happening in the extract_dump_issues python script, the db will be locked for editing which will throw:

sqlalchemy.exc.OperationalError: (OperationalError) database is locked

We can edit the webhook code to handle this error and attempt dumping the data after a time period.

We can also move this discussion and files to webcompat repository and let issue_parser to be in peace ^_^

@miketaylr @karlcow

miketaylr · 2016-07-28T15:57:11Z

The script will update the db schema

This current PR doesn't touch any db schema though, right? (Or am I missing something)?

backup old db inside a .gitignored folder in webcompat root repository

I think we should start here. Doing a "back-up" is simple, it's literally copying a file and moving it somewhere. But we also want some mechanism to bootstrap a new database with current GitHub information.

Right now, the issues.db is 100% unused (and doesn't even contain 100% of issues). If we made any schema changes today, and re-built the DB nothing would break. So I think it's probably too early to add the complexity of alembic, etc when we don't need it yet.

Write a new extract_dump_issues python script to retrieve issues and dump to db as and when the issues are retrieved

I think this is what we should do first. It should live in the webcompat.com repo as well -- this issue_parser repo is a side-project that was not written with our needs in mind. github-backup
can be used for inspiration, but it shouldn't be too complicated to just query the GitHub API and pull the data we want.

Issue webcompat#8 - Migration script to backup database file and rest…

eb09558

…art the server

karlcow reviewed Jul 25, 2016
View reviewed changes

deepthivenkat added 2 commits July 26, 2016 16:32

Issue webcompat#8 - Extract issues of all states.Change status to state

33902e7

Issue webcompat#8 - add domain, state, creation time and last change …

87c0452

…time column to the table

miketaylr reviewed Jul 26, 2016
View reviewed changes

miketaylr closed this Jul 28, 2016

miketaylr mentioned this pull request Jul 28, 2016

Create method to bootstrap issues.db from GitHub API webcompat/webcompat.com#1146

Closed

karlcow mentioned this pull request Aug 22, 2016

Create method to execute alter commands on issues.db old schema webcompat/webcompat.com#1160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #8 - Migration script to backup database file and restart the s… #9

Fix #8 - Migration script to backup database file and restart the s… #9

deepthivenkat commented Jul 25, 2016

This comment was marked as abuse.

This comment was marked as abuse.

miketaylr commented Jul 26, 2016

miketaylr commented Jul 26, 2016

deepthivenkat commented Jul 26, 2016

deepthivenkat commented Jul 27, 2016 •

edited

Loading

miketaylr commented Jul 28, 2016

Fix #8 - Migration script to backup database file and restart the s… #9

Fix #8 - Migration script to backup database file and restart the s… #9

Conversation

deepthivenkat commented Jul 25, 2016

This comment was marked as abuse.

This comment was marked as abuse.

miketaylr commented Jul 26, 2016

miketaylr commented Jul 26, 2016

deepthivenkat commented Jul 26, 2016

deepthivenkat commented Jul 27, 2016 • edited Loading

miketaylr commented Jul 28, 2016

deepthivenkat commented Jul 27, 2016 •

edited

Loading