Skip to content
This repository has been archived by the owner on Oct 22, 2019. It is now read-only.

Fix #8 - Migration script to backup database file and restart the s… #9

Closed
wants to merge 3 commits into from

Conversation

deepthivenkat
Copy link
Member

…erver

r? @miketaylr @karlcow

# check if a previous version of a backup exist and delete it
print 'Backup database already exists. Removing older version of the back up'
call('rm -r backup_db', shell=True)
call('git rm -r backup_db/.', shell=True)

This comment was marked as abuse.

@@ -47,7 +60,7 @@ def main():
# stuff data into database..
for bug in data['bugs']:
db_session.add(
Issue(bug['id'], bug['summary'], bug['url'], bug['body']))
Issue(bug['id'], bug['summary'], bug['url'], extract_domain_name(bug['url']), bug['body'], bug['state'], bug['creation_time'], bug['last_change_time']))

This comment was marked as abuse.

@miketaylr
Copy link
Member

@deepthivenkat can you summarize the strategy and design here, before addressing review comments? Like, what is the main entry point, by who and when is this script run, what are the expected outcomes.

Right now it just looks like it copies issues.db and then puts fresh data into a new issues.db. Why are we doing this? Presumably something to do with migrating db schemas -- but I don't see any code to handle the actual schema changes.

I'm also concerned about the idea of creating and populating a new db between killing and starting the app. What if GitHub is down, or slow? Or what happens when there are 10,000 or 100,000 issues? Do we hit API limits? Does this take 2 or 10 minutes, and does that mean the site is down until it's finished?

@miketaylr
Copy link
Member

I also have concerns generally about basing this on extract_id_title_url.py -- it was written by Hallvord for a different project. RIght now it's not up to date (for example, it doesn't know about needstriage and it only collects info about Firefox related bugs.

Have you verified that the database created from the extract_data method matches what's on GitHub?

@deepthivenkat
Copy link
Member Author

@miketaylr As you mentioned in IRC, few lines in extract_id_title_url.py like

https://github.com/webcompat/issue_parser/blob/master/extract_id_title_url.py#L84

is currently unused. Should there be another script for extracting issues specifically for webcompat inside webcompat.com root folder?

@deepthivenkat
Copy link
Member Author

deepthivenkat commented Jul 27, 2016

Summarising the strategy:

The script would be run by the person who is handling the fabfile for deployment. The script will update the db schema, and create new data dump for the schema, backup old db inside a .gitignored folder in webcompat root repository.

The entry point would be to use one of the following tools to handle

  1. SQLAlchemy
  2. Alembic to handle the schema changes.

If I choose the second option, I will be able to use flask-migrate that runs on alembic.

The github api rate limits is 5000 per hour for authenticated users and it is linked to the github account. The number of issues we have currently is already about half of maximum limit. So if we run extract_id_title_url.py a couple of times, we will end up with 403 status code.

So it is not very viable to kill the app while populating the new db.

Suggestions:

We can start the app as soon as running the script for db schema migration(yet to be done using flask-migrate) is complete.

When make the db schema changes, the db will be locked for changes. Once the schema change is done, the webhooks we have in webhooks/init.py may try to insert the newly opened issue which is ok.

Solutions:

  1. Extract the issues from github before killing the app and handle only dump_to_db after killing the app.

Suggestions for extracting issues:

  1. We can make use of Github-Backup. I read the following from their documentation:

Bear in mind that this uses the GitHub API; don't run it every 5 minutes. GitHub rate limits the API to some small number of requests per hour when used without authentication. To avoid this limit, you can set GITHUB_USER and GITHUB_PASSWORD (or GITHUB_OAUTH_TOKEN obtained from https://github.com/settings/tokens) in the environment and it will log in when making (most) API requests.

Anyway, github-backup does do an incremental backup, picking up where it left off, so will complete the backup eventually even if it's rate limited.

  1. Write a new extract_dump_issues python script to retrieve issues and dump to db as and when the issues are retrieved.. We can rely on

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4987
X-RateLimit-Reset: 1350085394

and add sleep until the X-RateLimit-Reset and retrieve issues again.

If the webhooks/init.py
tries to enter a newly created issues while the db dump is happening in the extract_dump_issues python script, the db will be locked for editing which will throw:

sqlalchemy.exc.OperationalError: (OperationalError) database is locked

We can edit the webhook code to handle this error and attempt dumping the data after a time period.

We can also move this discussion and files to webcompat repository and let issue_parser to be in peace ^_^

@miketaylr @karlcow

@miketaylr
Copy link
Member

The script will update the db schema

This current PR doesn't touch any db schema though, right? (Or am I missing something)?

backup old db inside a .gitignored folder in webcompat root repository

I think we should start here. Doing a "back-up" is simple, it's literally copying a file and moving it somewhere. But we also want some mechanism to bootstrap a new database with current GitHub information.

Right now, the issues.db is 100% unused (and doesn't even contain 100% of issues). If we made any schema changes today, and re-built the DB nothing would break. So I think it's probably too early to add the complexity of alembic, etc when we don't need it yet.

  1. Write a new extract_dump_issues python script to retrieve issues and dump to db as and when the issues are retrieved

I think this is what we should do first. It should live in the webcompat.com repo as well -- this issue_parser repo is a side-project that was not written with our needs in mind. github-backup
can be used for inspiration, but it shouldn't be too complicated to just query the GitHub API and pull the data we want.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants