- Web Scraping
# get the code
git clone https://github.com/sottom/scraping_tutorial.git
cd scraping_tutorial
# create virtual environment
python3 -m venv venv
# install dependencies
pip install -r requirements.txt
# run a any file you like
python {any_file}.py
- https://www.ratemyprofessors.com/
- https://www.premierleague.com/players
- https://www.zacks.com/stock/research/AAPL/earnings-announcements
- https://free-proxy-list.net/
- https://www.timeanddate.com/holidays/us/
- https://codepen.io/gaearon/pen/oWWQNa
- Look at and follow the /robots.txt file
- Don't make too many requests
- don't publish data that isn't yours (be careful about this)
- Yes.
- Change User-Agents
- Change IP addresses
- Don't follow the same pattern every time you scrape
- Scraping Intervals
- Scraping Click Paths
- Click Timing
- Click position is hard, because a click has no
screenX
orscreenY
- Great overview of web scraping
- How to prevent getting blacklisted
- Most common User-Agents
- Solving Captchas
- Really come to understand how the web works
- get data not available from APIs
- interview prep
- check for APIs
- 1_professor.py
- 2_sports.py
- unfortunately, this site has started blocking unauthorized calls.
- check for data in the global scope
- 3_zachs.py
- you will need to download the correct chromedriver. I have chrome version 80 on windows, so I am using that chromedriver.
- 3_zachs.py
- when the data you want is loaded on page startup
- 4_proxy.py
- 5_holidays.py
- could use regex, could use other parsers, doesn't matter
- when code is rendered by javascript, otherwise you don't get what you expect (React App)
- when you need to login
- use for web automation
- 6_holidays2.py
- learningsuite (not included)
- when selenium doesn't work
- chrome extensions do a ton more than scrape
- out of scope
- When you want to run a big operation and scrape on multiple threads (for complex projects)
- comparison site