Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion for Issue database needs a migration script. #1131

Closed
1 of 6 tasks
deepthivenkat opened this issue Jul 14, 2016 · 15 comments
Closed
1 of 6 tasks

Discussion for Issue database needs a migration script. #1131

deepthivenkat opened this issue Jul 14, 2016 · 15 comments

Comments

@deepthivenkat
Copy link
Member

deepthivenkat commented Jul 14, 2016

I have some questions about Issue #968 - writing a migration script for issues.db backup.

Issue #968 is a blocker for Issue #865 - Backend implementation of detect bugs with similar domain

The subtasks for the for the script as mentioned by @karlcow in Issue#968:

  • Stop webcompat.com
  • Backup the existing DB file.
  • Delete the DB..
  • "Dump to DB": Parse GitHub (pagination involved, probably many HTTP requests)
    Check the HTTP status code for each request
    Grab the pagination link
    Check that the body contains the required information
    Report any issues during the dump to DB.

Currently there are two scripts in the Issue-parser repoditory:

  1. extract_id_title_url.py - this script reads the set of all open issues from web-bugs and creates two files :
    a) webcompatdata-bzlike.json
    b)webcompatdata.csv

Right now these files contain only 254 issues- the number of issues that were open in the web-bugs repository when I ran the extract_id_title_url.py.

  1. dump_webcompat_to_db.py - this script adds the contents from webcompatdata-bzlike.json to the webcompat_issues table in the issues.db file

Additionally I am adding few more subtasks:

  • Modify dump to db to read issues irrespective of status (Currently only open issues are extracted)
  • Add separate entry for status in issues.db and update it when the issue status changes

Questions to be discussed

  1. Should the data that is being backup-ed up be webcompat_issues table or issues.db?
  2. Can I use some libraries like:
    - rdiff-backup
    - bup
    If we make use of these, the backup data may not be saved in the webcompat.com server. We need to investigate this.
  3. Right now, I plan to append timestamp to the backup file every time the script is run. We can set up a crone job to take backup periodically. Why should we stop webcompat.com? Wouldn't the issues reported during backup be extracted the next time the backup is taken?
    Is running a crone job defies the purpose of migration script? I am not sure.
  4. Should I also include code for creating a local backup copy in the system of the person running the migration script in addition to server?

Any ideas? @miketaylr @magsout @adamopenweb @karlcow

@karlcow
Copy link
Member

karlcow commented Jul 15, 2016

  • about 2. I don't think you need an incremental backup system. We don't intend to keep the DB for recovering in the future (or at least not yet. GitHub didn't kick us out).
  • About 3 and 4. The scope of the project
    • Let's say we have issues.db which have a record of our issues.
    • Each time a new issue is being added through webcompat.com, it also records it into issues.db
    • issues.db is used to do for now dynamic domain search
    • if an issue is added through github and not through webcompat.com the issues.db is incomplete. So we need a script to refresh issues.db with latest data. It could be once a day.
    • During the process of mv issues_new.db issues.db if the Flask app is trying to write on issues.db, we will get a mess. (it's why I talked about a migration script).
    • The dump of the db has to be done prior to the mv so we do not block for long.

but I'm open to suggestions.

The columns for the table:

  • id
  • issue_number
  • issue_title
  • issue_status
  • issue_updated_date (aka last time it was updated, practical if we end up using the DB for feeds too)
  • issue_creation_date
  • domain_name

something else?

deepthivenkat added a commit to deepthivenkat/webcompat.com that referenced this issue Jul 20, 2016
deepthivenkat added a commit to deepthivenkat/webcompat.com that referenced this issue Jul 20, 2016
@deepthivenkat
Copy link
Member Author

I plan to append timestamp to the backup file every time the script is run. I want to store this backup db file under a git repository in github.com/webcompat. I do not have permission to create a new repository.

@karlcow Is there an alternative way to store the backup file under webcompat server?

@karlcow
Copy link
Member

karlcow commented Jul 21, 2016

@deepthivenkat What is the intent of keeping all versions of the DB? (aka in the current circumstances of the project). My feeling is that it is too early and it doesn't really work given that multiple people might play with this script.

@miketaylr ?

@deepthivenkat
Copy link
Member Author

deepthivenkat commented Jul 21, 2016

One possible intent may be to track the changes in db and the option to retrieve any version of the backup available in the repo.

If we keep all versions we end up making multiple copies of almost the same set of issues.
If that is not necessary, should we keep the latest version of the db? Is storing in the latest version in a new github repository ok?

@deepthivenkat
Copy link
Member Author

deepthivenkat commented Jul 21, 2016

@karlcow @miketaylr Why do we need the 'status' column in the webcompat_issues table? Do we need a column called issue_state for the webcompat_issues table?

The value of the issue_state would be 'OPEN' and 'RESOLVED'

@miketaylr
Copy link
Member

if an issue is added through github and not through webcompat.com the issues.db is incomplete. So we need a script to refresh issues.db with latest data. It could be once a day.

(I don't think this is true, as long as webcompat server is online -- GitHub will send issue creation hook payloads to us, even if the issues come from GitHub. but if the server is down, we'll miss them)

@miketaylr
Copy link
Member

One possible intent may be to track the changes in db and the option to retrieve any version of the backup available in the repo.

I'm trying to think of why we would want this... is there a specific use case we're trying to address with this? Seems like just having the most recent one is good enough.

@miketaylr
Copy link
Member

Is storing in the latest version in a new github repository ok?

It seems like putting database backups in a git repo is probably overkill. What if we just made a directory somewhere in the project root (which is .gitignore'd) and copied the backup file there?

@deepthivenkat
Copy link
Member Author

There is no specific use case for having all the version of backup db file.

Cool! I will just add the recent one to a folder like:

https://github.com/webcompat/webcompat.com/backup_db

@deepthivenkat
Copy link
Member Author

Now I have written a python script that:

  1. Takes a backup of issues.db.
  2. deletes issues.db

Now I want to regenerate issues.db. To do this in my local host, I run python run.py. issues.db gets regenerated. Now I can run the scripts extract_id_title_url.py and dump_webcompat_to_db.py to get the data dump.

But this method applies only to local host.
Should I instead restart the webcompat server?
If so, I want to know how to restart the webcompat server. How is it being done now?

@karlcow
Copy link
Member

karlcow commented Jul 25, 2016

But this method applies only to local host.

This is fine. Because the admins of webcompat server will be likely the ones to run the script. It's better that a minimum of persons have access to the server itself for security reasons.

@miketaylr
Copy link
Member

Let's tackle the very first item in #968 (comment). Then go from there.

@miketaylr
Copy link
Member

miketaylr commented Jul 28, 2016

backup existing db: #1145
bootstrap db with GitHub data: #1146

@deepthivenkat
Copy link
Member Author

@miketaylr There are two separate implementations for extracting the domain name from url.

  1. dshgna@ebf615f
  2. form.py

To consolidate which implementation to go ahead with, I am checking the domain extracted from both the implementation for the same sample set of urls.

Input : http://money.cnn.com/2004/01/05/commentary/game_over/column_gaming/
Output :
TldExtract : cnn
UrlParser : money.cnn.com

Input : www.google.co.uk
Output :
TldExtract : google
UrlParser : www.google.co.uk

Input : https://www.google.co.uk/
Output :
TldExtract : google
UrlParser : www.google.co.uk

Input : http://www.mx.iucr.org/iucr-top/comm/cpd/QARR/raw/cpd-1a.raw
Output :
TldExtract : iucr
UrlParser : www.mx.iucr.org

Input : http://sdpd.univ-lemans.fr/course/week-1/sample2.raw
Output :
TldExtract : univ-lemans
UrlParser : sdpd.univ-lemans.fr

Input : samples/Cu3Au-1.raw
Output :
TldExtract : samples
UrlParser : samples

Input : mail.google.com
Output :
TldExtract : mail.google
UrlParser : mail.google.com

Input : https://www.mail.google.com
Output :
TldExtract : mail.google
UrlParser : www.mail.google.com

Input : deepthivenkat.github.io
Output :
TldExtract : github
UrlParser : deepthivenkat.github.io

Input : http://www.empassion.com.au
Output :
TldExtract : empassion
UrlParser : www.empassion.com.au

Input : https://myaccount.shaw.ca/MyBills
Output :
TldExtract : shaw
UrlParser : myaccount.shaw.ca

Input : https://touch.www.linkedin.com/
Output :
TldExtract : linkedin
UrlParser : touch.www.linkedin.com

Input : http://www.droid-life.com/
Output :
TldExtract : droid-life
UrlParser : www.droid-life.com

Input : http://www.nb.no/nbsok/search?mediatype=bøker
Output :
TldExtract : nb
UrlParser : www.nb.no

Input : http://paramountproperty.my/
Output :
TldExtract : paramountproperty
UrlParser : paramountproperty.my

Input : http://m.mlb.com/game/2016/07/07/448148/angels-vs-rays
Output :
TldExtract : mlb
UrlParser : m.mlb.com

Input : http://news.nicovideo.jp/watch/nw2258329?news_ref=sp_list_latest
Output :
TldExtract : nicovideo
UrlParser : news.nicovideo.jp

Input : http://www.tsn.ca/draft-day-blog-flames-heating-up-1.514591
Output :
TldExtract : tsn
UrlParser : www.tsn.ca

Input : http://beta.euronews.com/
Output :
TldExtract : euronews
UrlParser : beta.euronews.com

Input : https://m.reddit.com/r/PersonalFinanceCanada/search
Output :
TldExtract : reddit
UrlParser : m.reddit.com

Input : https://redsox.m.mlb.tickets.com/seat-select/8158391
Output :
TldExtract : tickets
UrlParser : redsox.m.mlb.tickets.com

Input : http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Output :
TldExtract : mozilla
UrlParser : mxr.mozilla.org

The url parser is maintaining the subdomain and suffix such as 'co.uk'.
If we are using urlparser, we need to tweak the code to remove 'www.' from the domain names - it is retaining www. currently.

In TLD Extract the subdomain , domain and suffix information is available. It has been coded to retain only the domain names and discard subdomains except for the domains in:

blacklist = ['.google.com', '.live.com', '.yahoo.com', 'go.com', '.js']

For the domains in the above list, subdomain.domain is returned.Now we have 2 options:

  1. We may have to edit the blacklist and include more domains for which we need to retain subdomains.
  2. Tweak the code to retain subdomain for all domains and suffix as the subdomain and suffix also contains valuable information and help in exact mapping when the user starts typing the subdomain.

TLD extract maintains a cache file containing list of all suffix that is updates regularly. Have attached the file.

test.txt

We need to decide if we have to go ahead with TLD Extract or url parser.

@miketaylr
Copy link
Member

miketaylr commented Aug 3, 2016

(to keep the discussion focused, I asked @deepthivenkat to open an issue for the specific problem she's trying to solve and move the previous comment there, because it's tangential to db migrations).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants