A small webscraper for Transparency International-Russia, which helped a horse-related tax evasion investigation.
Hippodrom.ru is a Russian website about horse racing. They maintain a database of race horses, horse owners, breeders, race results, etc. However, you can't query the database in arbitrary way.
This script can help you inquire, how much money the horses won for their owner: it compiles a table of all the horses of a horse owner, which got prizes.
The output is an .xlsx or .csv file with fields:
horse_name
: name of the horse,horse_url
: link to the horse's page,race_url
: link to race page,race_date
: date of the race,owner
: name of the horse owner at the time of the race,prize
: prize money that the horse won,currency
: currency of the prize.
- Clone or download https://github.com/Bormoglot/horse_scraper
cd horse_scraper
- Create a virtual environment:
python -m venv .
- Activate a virtual environment:
- Windows cmd:
.\Scripts\activate.bat
(to deactivate use.\Scripts\deactivate.bat
) - Windows PowerShell:
.\Scripts\Activate.ps1
(to deactivate usedeactivate
)
- Windows cmd:
pip install -r requirements.txt
The script runs on python >= 3.6
python horse_scraper.py <owner_id> --min_year <2010> --path <somewhere/there> -csv
<owner_id>
- the last digits in urls like https://hippodrom.ru/modules/owners/owner.php?owner_id=582--min_year <2010>
- optional, the earliest race year to parse. Default year is 2005.--path <somewhere/there>
- optional, path to the place to save results. By default creates directory 'results'.-csv
- optional, change default .xlsx output format to .csv
For help on arguments use python horse_scraper.py --help
.