Web Scraping

Web Scraping

Getting Started

# get the code
git clone https://github.com/sottom/scraping_tutorial.git
cd scraping_tutorial

# create virtual environment
python3 -m venv venv

# install dependencies
pip install -r requirements.txt

# run a any file you like
python {any_file}.py

Websites for Scraping Tutorial (Tech Talk):

Important Points

Be Respectful

Look at and follow the /robots.txt file
Don't make too many requests
don't publish data that isn't yours (be careful about this)

Can websites figure out you're scraping them?

Yes.

How do I make my scraper more humanlike?

Change User-Agents
Change IP addresses
Don't follow the same pattern every time you scrape
- Scraping Intervals
- Scraping Click Paths
- Click Timing
- Click position is hard, because a click has no screenX or screenY

When you run into issues, start Googling

Why Scrape?

Really come to understand how the web works
get data not available from APIs
interview prep

What to do before scraping

check for APIs
- 1_professor.py
- 2_sports.py
  - unfortunately, this site has started blocking unauthorized calls.
check for data in the global scope
- 3_zachs.py
  - you will need to download the correct chromedriver. I have chrome version 80 on windows, so I am using that chromedriver.

Requests & BeautifulSoup

When to use it

when the data you want is loaded on page startup

Example usage

4_proxy.py
5_holidays.py

Notes

could use regex, could use other parsers, doesn't matter

Headless Browser

When to use it

when code is rendered by javascript, otherwise you don't get what you expect (React App)
when you need to login
use for web automation

Example usage

6_holidays2.py
learningsuite (not included)

Notes

Chrome Extension

When to use it

when selenium doesn't work

Notes

chrome extensions do a ton more than scrape
out of scope

Scraping Framework like Scrapy

When to use it

When you want to run a big operation and scrape on multiple threads (for complex projects)
comparison site

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping

Getting Started

Websites for Scraping Tutorial (Tech Talk):

Important Points

Be Respectful

Can websites figure out you're scraping them?

How do I make my scraper more humanlike?

When you run into issues, start Googling

Why Scrape?

What to do before scraping

Requests & BeautifulSoup

When to use it

Example usage

Notes

Headless Browser

When to use it

Example usage

Notes

Chrome Extension

When to use it

Notes

Scraping Framework like Scrapy

When to use it

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
1_professors.py		1_professors.py
2_sports.py		2_sports.py
3_zachs.py		3_zachs.py
4_proxy.py		4_proxy.py
5_holidays.py		5_holidays.py
6_holidays2.py		6_holidays2.py
README.md		README.md
chromedriver		chromedriver
random_quote_selenium.py		random_quote_selenium.py
requirements.txt		requirements.txt

sottom/scraping_tutorial

Folders and files

Latest commit

History

Repository files navigation

Web Scraping

Getting Started

Websites for Scraping Tutorial (Tech Talk):

Important Points

Be Respectful

Can websites figure out you're scraping them?

How do I make my scraper more humanlike?

When you run into issues, start Googling

Why Scrape?

What to do before scraping

Requests & BeautifulSoup

When to use it

Example usage

Notes

Headless Browser

When to use it

Example usage

Notes

Chrome Extension

When to use it

Notes

Scraping Framework like Scrapy

When to use it

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages