Two data sources are used: (1) Wikipedia articles of politicians and (2) world population data.
Wikipedia articles - The Wikipedia articles can be found on Figshare. It contains politiciaans by country from the English-language wikipedia. The already downloaded dataset can be found in Wiki articles
Population data - This dataset is drawn from the world population datasheet published by the Population Reference Bureau (downloaded 2020-11-13 10:14 AM). The already downloaded dataset can be found in World population data
The predicted quality scores for each article in the Wikipedia dataset are accessed via a machine learning system called ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of the six quality categories. The options are, from best to worst:
ID | Quality Category | Explanation |
---|---|---|
1 | FA | Featured article |
2 | GA | Good article |
3 | B | B-class article |
4 | C | C-class article |
5 | Start | Start-class article |
6 | Stub | Stub-class article |
The ORES REST API is used to access ORES scores. It expects the following parameters:
The "Amazing Python Data Workflow with Poetry, Pandas, and Jupyter"1 is used for this project.
In order to take part in the course, please ensure that you have a Python version greater or equal to 3.6.1
, a working installation of Poetry and git installed.
-
Clone this repository (or use SSH) and move it into the repo root.
git clone https://github.com/Francosinus/A3-hcds-hcc.git cd A3-hcds-hcc
-
Install the dependencies in the repo root.
poetry install
-
Create a subshell within the virtual environment by running:
poetry shell
-
Open the project with Jupyter in your browser.
jupyter notebook
Dependencies:
python = "^3.6.1"
pandas = "^1.1.3"
jupyter = "^1.0.0"
ipykernel = "^5.3.4"
requests = "^2.24.0"
matplotlib = "^3.3.2"
seaborn = "^0.11.0"
altair = "^4.1.0"
bs4 = "^0.0.1"
xlrd = "^1.2.0"
The analysis consists of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country and for each region. By "high quality"
arcticle an article that ORES predicted as FA
(featured article) or GA
(good article) is meant.
Examples:
- if a country has a population of
10,000
people, and you found10
articles about politicians from that country, then the percentage ofarticles-per-population
would be0.1%
. - if a country has
10
articles about politicians, and2
of them areFA
orGA
class articles, then the percentage ofhigh-quality-articles
would be20%
.
The final atbles can be found in the folder Results.
-
Problems when installing
poetry
? When installingpoetry
something goes wrong. It's not automatically in your path, so if you runpoetry --version
nothing happens. If you usezsh
oroh-my-zsh
then you need to add the following line to your.zshrc
fileexport PATH="$HOME/.poetry/bin:$PATH
. -
Trouble with previewing notebooks directly in GitHub? --> https://nbviewer.jupyter.org/
[1]
https://mungingdata.com/python/jupyter-workflow-poetry-pandas/, accessed: 2020-10-28