-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion for Issue database needs a migration script. #1131
Comments
but I'm open to suggestions. The columns for the table:
something else? |
…ime to webcompat db
I plan to append timestamp to the backup file every time the script is run. I want to store this backup db file under a git repository in github.com/webcompat. I do not have permission to create a new repository. @karlcow Is there an alternative way to store the backup file under webcompat server? |
@deepthivenkat What is the intent of keeping all versions of the DB? (aka in the current circumstances of the project). My feeling is that it is too early and it doesn't really work given that multiple people might play with this script. |
One possible intent may be to track the changes in db and the option to retrieve any version of the backup available in the repo. If we keep all versions we end up making multiple copies of almost the same set of issues. |
@karlcow @miketaylr Why do we need the 'status' column in the webcompat_issues table? Do we need a column called issue_state for the webcompat_issues table? The value of the issue_state would be 'OPEN' and 'RESOLVED' |
(I don't think this is true, as long as webcompat server is online -- GitHub will send issue creation hook payloads to us, even if the issues come from GitHub. but if the server is down, we'll miss them) |
I'm trying to think of why we would want this... is there a specific use case we're trying to address with this? Seems like just having the most recent one is good enough. |
It seems like putting database backups in a git repo is probably overkill. What if we just made a directory somewhere in the project root (which is |
There is no specific use case for having all the version of backup db file. Cool! I will just add the recent one to a folder like: https://github.com/webcompat/webcompat.com/backup_db |
Now I have written a python script that:
Now I want to regenerate issues.db. To do this in my local host, I run python run.py. issues.db gets regenerated. Now I can run the scripts extract_id_title_url.py and dump_webcompat_to_db.py to get the data dump. But this method applies only to local host. |
This is fine. Because the admins of webcompat server will be likely the ones to run the script. It's better that a minimum of persons have access to the server itself for security reasons. |
Let's tackle the very first item in #968 (comment). Then go from there. |
@miketaylr There are two separate implementations for extracting the domain name from url. To consolidate which implementation to go ahead with, I am checking the domain extracted from both the implementation for the same sample set of urls. Input : http://money.cnn.com/2004/01/05/commentary/game_over/column_gaming/ Input : www.google.co.uk Input : https://www.google.co.uk/ Input : http://www.mx.iucr.org/iucr-top/comm/cpd/QARR/raw/cpd-1a.raw Input : http://sdpd.univ-lemans.fr/course/week-1/sample2.raw Input : samples/Cu3Au-1.raw Input : mail.google.com Input : https://www.mail.google.com Input : deepthivenkat.github.io Input : http://www.empassion.com.au Input : https://myaccount.shaw.ca/MyBills Input : https://touch.www.linkedin.com/ Input : http://www.droid-life.com/ Input : http://www.nb.no/nbsok/search?mediatype=bøker Input : http://paramountproperty.my/ Input : http://m.mlb.com/game/2016/07/07/448148/angels-vs-rays Input : http://news.nicovideo.jp/watch/nw2258329?news_ref=sp_list_latest Input : http://www.tsn.ca/draft-day-blog-flames-heating-up-1.514591 Input : http://beta.euronews.com/ Input : https://m.reddit.com/r/PersonalFinanceCanada/search Input : https://redsox.m.mlb.tickets.com/seat-select/8158391 Input : http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 The url parser is maintaining the subdomain and suffix such as 'co.uk'. In TLD Extract the subdomain , domain and suffix information is available. It has been coded to retain only the domain names and discard subdomains except for the domains in: blacklist = ['.google.com', '.live.com', '.yahoo.com', 'go.com', '.js'] For the domains in the above list, subdomain.domain is returned.Now we have 2 options:
TLD extract maintains a cache file containing list of all suffix that is updates regularly. Have attached the file. We need to decide if we have to go ahead with TLD Extract or url parser. |
(to keep the discussion focused, I asked @deepthivenkat to open an issue for the specific problem she's trying to solve and move the previous comment there, because it's tangential to db migrations). |
I have some questions about Issue #968 - writing a migration script for issues.db backup.
Issue #968 is a blocker for Issue #865 - Backend implementation of detect bugs with similar domain
The subtasks for the for the script as mentioned by @karlcow in Issue#968:
Check the HTTP status code for each request
Grab the pagination link
Check that the body contains the required information
Report any issues during the dump to DB.
Currently there are two scripts in the Issue-parser repoditory:
open
issues from web-bugs and creates two files :a) webcompatdata-bzlike.json
b)webcompatdata.csv
Right now these files contain only 254 issues- the number of issues that were open in the web-bugs repository when I ran the extract_id_title_url.py.
Additionally I am adding few more subtasks:
Questions to be discussed
- rdiff-backup
- bup
If we make use of these, the backup data may not be saved in the webcompat.com server. We need to investigate this.
Is running a crone job defies the purpose of migration script? I am not sure.
Any ideas? @miketaylr @magsout @adamopenweb @karlcow
The text was updated successfully, but these errors were encountered: