Discussion for Issue database needs a migration script. #1131

deepthivenkat · 2016-07-14T16:31:18Z

I have some questions about Issue #968 - writing a migration script for issues.db backup.

Issue #968 is a blocker for Issue #865 - Backend implementation of detect bugs with similar domain

The subtasks for the for the script as mentioned by @karlcow in Issue#968:

Stop webcompat.com
Backup the existing DB file.
Delete the DB..
"Dump to DB": Parse GitHub (pagination involved, probably many HTTP requests)
Check the HTTP status code for each request
Grab the pagination link
Check that the body contains the required information
Report any issues during the dump to DB.

Currently there are two scripts in the Issue-parser repoditory:

extract_id_title_url.py - this script reads the set of all open issues from web-bugs and creates two files :
a) webcompatdata-bzlike.json
b)webcompatdata.csv

Right now these files contain only 254 issues- the number of issues that were open in the web-bugs repository when I ran the extract_id_title_url.py.

dump_webcompat_to_db.py - this script adds the contents from webcompatdata-bzlike.json to the webcompat_issues table in the issues.db file

Additionally I am adding few more subtasks:

Modify dump to db to read issues irrespective of status (Currently only open issues are extracted)
Add separate entry for status in issues.db and update it when the issue status changes

Questions to be discussed

Should the data that is being backup-ed up be webcompat_issues table or issues.db?
Can I use some libraries like:
- rdiff-backup
- bup
If we make use of these, the backup data may not be saved in the webcompat.com server. We need to investigate this.
Right now, I plan to append timestamp to the backup file every time the script is run. We can set up a crone job to take backup periodically. Why should we stop webcompat.com? Wouldn't the issues reported during backup be extracted the next time the backup is taken?
Is running a crone job defies the purpose of migration script? I am not sure.
Should I also include code for creating a local backup copy in the system of the person running the migration script in addition to server?

Any ideas? @miketaylr @magsout @adamopenweb @karlcow

The text was updated successfully, but these errors were encountered:

karlcow · 2016-07-15T04:41:51Z

about 2. I don't think you need an incremental backup system. We don't intend to keep the DB for recovering in the future (or at least not yet. GitHub didn't kick us out).
About 3 and 4. The scope of the project
- Let's say we have issues.db which have a record of our issues.
- Each time a new issue is being added through webcompat.com, it also records it into issues.db
- issues.db is used to do for now dynamic domain search
- if an issue is added through github and not through webcompat.com the issues.db is incomplete. So we need a script to refresh issues.db with latest data. It could be once a day.
- During the process of mv issues_new.db issues.db if the Flask app is trying to write on issues.db, we will get a mess. (it's why I talked about a migration script).
- The dump of the db has to be done prior to the mv so we do not block for long.

but I'm open to suggestions.

The columns for the table:

id
issue_number
issue_title
issue_status
issue_updated_date (aka last time it was updated, practical if we end up using the DB for feeds too)
issue_creation_date
domain_name

something else?

…ime to webcompat db

… issue db

deepthivenkat · 2016-07-21T05:59:20Z

I plan to append timestamp to the backup file every time the script is run. I want to store this backup db file under a git repository in github.com/webcompat. I do not have permission to create a new repository.

@karlcow Is there an alternative way to store the backup file under webcompat server?

karlcow · 2016-07-21T06:02:42Z

@deepthivenkat What is the intent of keeping all versions of the DB? (aka in the current circumstances of the project). My feeling is that it is too early and it doesn't really work given that multiple people might play with this script.

@miketaylr ?

deepthivenkat · 2016-07-21T06:09:37Z

One possible intent may be to track the changes in db and the option to retrieve any version of the backup available in the repo.

If we keep all versions we end up making multiple copies of almost the same set of issues.
If that is not necessary, should we keep the latest version of the db? Is storing in the latest version in a new github repository ok?

deepthivenkat · 2016-07-21T06:37:36Z

@karlcow @miketaylr Why do we need the 'status' column in the webcompat_issues table? Do we need a column called issue_state for the webcompat_issues table?

The value of the issue_state would be 'OPEN' and 'RESOLVED'

miketaylr · 2016-07-21T18:29:59Z

if an issue is added through github and not through webcompat.com the issues.db is incomplete. So we need a script to refresh issues.db with latest data. It could be once a day.

(I don't think this is true, as long as webcompat server is online -- GitHub will send issue creation hook payloads to us, even if the issues come from GitHub. but if the server is down, we'll miss them)

miketaylr · 2016-07-21T18:47:06Z

One possible intent may be to track the changes in db and the option to retrieve any version of the backup available in the repo.

I'm trying to think of why we would want this... is there a specific use case we're trying to address with this? Seems like just having the most recent one is good enough.

miketaylr · 2016-07-21T18:48:45Z

Is storing in the latest version in a new github repository ok?

It seems like putting database backups in a git repo is probably overkill. What if we just made a directory somewhere in the project root (which is .gitignore'd) and copied the backup file there?

deepthivenkat · 2016-07-21T18:51:48Z

There is no specific use case for having all the version of backup db file.

Cool! I will just add the recent one to a folder like:

https://github.com/webcompat/webcompat.com/backup_db

deepthivenkat · 2016-07-25T05:35:26Z

Now I have written a python script that:

Takes a backup of issues.db.
deletes issues.db

Now I want to regenerate issues.db. To do this in my local host, I run python run.py. issues.db gets regenerated. Now I can run the scripts extract_id_title_url.py and dump_webcompat_to_db.py to get the data dump.

But this method applies only to local host.
Should I instead restart the webcompat server?
If so, I want to know how to restart the webcompat server. How is it being done now?

karlcow · 2016-07-25T05:39:41Z

But this method applies only to local host.

This is fine. Because the admins of webcompat server will be likely the ones to run the script. It's better that a minimum of persons have access to the server itself for security reasons.

miketaylr · 2016-07-28T16:00:30Z

Let's tackle the very first item in #968 (comment). Then go from there.

miketaylr · 2016-07-28T16:06:21Z

backup existing db: #1145
bootstrap db with GitHub data: #1146

deepthivenkat · 2016-08-03T06:28:21Z

@miketaylr There are two separate implementations for extracting the domain name from url.

To consolidate which implementation to go ahead with, I am checking the domain extracted from both the implementation for the same sample set of urls.

Input : http://money.cnn.com/2004/01/05/commentary/game_over/column_gaming/
Output :
TldExtract : cnn
UrlParser : money.cnn.com

Input : www.google.co.uk
Output :
TldExtract : google
UrlParser : www.google.co.uk

Input : https://www.google.co.uk/
Output :
TldExtract : google
UrlParser : www.google.co.uk

Input : http://www.mx.iucr.org/iucr-top/comm/cpd/QARR/raw/cpd-1a.raw
Output :
TldExtract : iucr
UrlParser : www.mx.iucr.org

Input : http://sdpd.univ-lemans.fr/course/week-1/sample2.raw
Output :
TldExtract : univ-lemans
UrlParser : sdpd.univ-lemans.fr

Input : samples/Cu3Au-1.raw
Output :
TldExtract : samples
UrlParser : samples

Input : mail.google.com
Output :
TldExtract : mail.google
UrlParser : mail.google.com

Input : https://www.mail.google.com
Output :
TldExtract : mail.google
UrlParser : www.mail.google.com

Input : deepthivenkat.github.io
Output :
TldExtract : github
UrlParser : deepthivenkat.github.io

Input : http://www.empassion.com.au
Output :
TldExtract : empassion
UrlParser : www.empassion.com.au

Input : https://myaccount.shaw.ca/MyBills
Output :
TldExtract : shaw
UrlParser : myaccount.shaw.ca

Input : https://touch.www.linkedin.com/
Output :
TldExtract : linkedin
UrlParser : touch.www.linkedin.com

Input : http://www.droid-life.com/
Output :
TldExtract : droid-life
UrlParser : www.droid-life.com

Input : http://www.nb.no/nbsok/search?mediatype=bøker
Output :
TldExtract : nb
UrlParser : www.nb.no

Input : http://paramountproperty.my/
Output :
TldExtract : paramountproperty
UrlParser : paramountproperty.my

Input : http://m.mlb.com/game/2016/07/07/448148/angels-vs-rays
Output :
TldExtract : mlb
UrlParser : m.mlb.com

Input : http://news.nicovideo.jp/watch/nw2258329?news_ref=sp_list_latest
Output :
TldExtract : nicovideo
UrlParser : news.nicovideo.jp

Input : http://www.tsn.ca/draft-day-blog-flames-heating-up-1.514591
Output :
TldExtract : tsn
UrlParser : www.tsn.ca

Input : http://beta.euronews.com/
Output :
TldExtract : euronews
UrlParser : beta.euronews.com

Input : https://m.reddit.com/r/PersonalFinanceCanada/search
Output :
TldExtract : reddit
UrlParser : m.reddit.com

Input : https://redsox.m.mlb.tickets.com/seat-select/8158391
Output :
TldExtract : tickets
UrlParser : redsox.m.mlb.tickets.com

Input : http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Output :
TldExtract : mozilla
UrlParser : mxr.mozilla.org

The url parser is maintaining the subdomain and suffix such as 'co.uk'.
If we are using urlparser, we need to tweak the code to remove 'www.' from the domain names - it is retaining www. currently.

In TLD Extract the subdomain , domain and suffix information is available. It has been coded to retain only the domain names and discard subdomains except for the domains in:

blacklist = ['.google.com', '.live.com', '.yahoo.com', 'go.com', '.js']

For the domains in the above list, subdomain.domain is returned.Now we have 2 options:

We may have to edit the blacklist and include more domains for which we need to retain subdomains.
Tweak the code to retain subdomain for all domains and suffix as the subdomain and suffix also contains valuable information and help in exact mapping when the user starts typing the subdomain.

TLD extract maintains a cache file containing list of all suffix that is updates regularly. Have attached the file.

test.txt

We need to decide if we have to go ahead with TLD Extract or url parser.

miketaylr · 2016-08-03T14:57:42Z

(to keep the discussion focused, I asked @deepthivenkat to open an issue for the specific problem she's trying to solve and move the previous comment there, because it's tangential to db migrations).

deepthivenkat added a commit to deepthivenkat/webcompat.com that referenced this issue Jul 20, 2016

Issue webcompat#1131 - Added status, creation time and last changed t…

123dc6a

…ime to webcompat db

deepthivenkat added a commit to deepthivenkat/webcompat.com that referenced this issue Jul 20, 2016

Issue webcompat#1131 - Adds status, creation and last changed time to…

7131fce

… issue db

deepthivenkat mentioned this issue Jul 25, 2016

Issue database needs a migration script (for webcompat - issues.db backup) webcompat/issue_parser#8

Closed

karlcow mentioned this issue Jul 25, 2016

Fix #8 - Migration script to backup database file and restart the s… webcompat/issue_parser#9

Closed

miketaylr added the status: discussion label Jul 28, 2016

miketaylr closed this as completed Apr 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion for Issue database needs a migration script. #1131

Discussion for Issue database needs a migration script. #1131

deepthivenkat commented Jul 14, 2016 •

edited by miketaylr

Loading

karlcow commented Jul 15, 2016 •

edited

Loading

deepthivenkat commented Jul 21, 2016

karlcow commented Jul 21, 2016

deepthivenkat commented Jul 21, 2016 •

edited

Loading

deepthivenkat commented Jul 21, 2016 •

edited

Loading

miketaylr commented Jul 21, 2016

miketaylr commented Jul 21, 2016

miketaylr commented Jul 21, 2016

deepthivenkat commented Jul 21, 2016

deepthivenkat commented Jul 25, 2016

karlcow commented Jul 25, 2016

miketaylr commented Jul 28, 2016

miketaylr commented Jul 28, 2016 •

edited

Loading

deepthivenkat commented Aug 3, 2016

miketaylr commented Aug 3, 2016 •

edited

Loading

Discussion for Issue database needs a migration script. #1131

Discussion for Issue database needs a migration script. #1131

Comments

deepthivenkat commented Jul 14, 2016 • edited by miketaylr Loading

karlcow commented Jul 15, 2016 • edited Loading

deepthivenkat commented Jul 21, 2016

karlcow commented Jul 21, 2016

deepthivenkat commented Jul 21, 2016 • edited Loading

deepthivenkat commented Jul 21, 2016 • edited Loading

miketaylr commented Jul 21, 2016

miketaylr commented Jul 21, 2016

miketaylr commented Jul 21, 2016

deepthivenkat commented Jul 21, 2016

deepthivenkat commented Jul 25, 2016

karlcow commented Jul 25, 2016

miketaylr commented Jul 28, 2016

miketaylr commented Jul 28, 2016 • edited Loading

deepthivenkat commented Aug 3, 2016

miketaylr commented Aug 3, 2016 • edited Loading

deepthivenkat commented Jul 14, 2016 •

edited by miketaylr

Loading

karlcow commented Jul 15, 2016 •

edited

Loading

deepthivenkat commented Jul 21, 2016 •

edited

Loading

deepthivenkat commented Jul 21, 2016 •

edited

Loading

miketaylr commented Jul 28, 2016 •

edited

Loading

miketaylr commented Aug 3, 2016 •

edited

Loading