TechCrunch Web Scraper

Authors: Daniella Grimberg and Eddie Mattout

Description

Web scraper for TechCrunch and related articles. Scrapes main page and accesses all articles, printing out their title, date published and the tags associated with the article. Loads more pages of articles until user quits program or no more articles available. Also uses GNews API to get related articles and scrapes them.

Requirements & Installations

Use the package manager pip to install required packages.

pip3 install -r requirements.txt

Install chromedriver specific to your chrome version.

Instructions & Usage Options (Command-Line Interface)

Setting up database (first time use)

Clone repository
Unzip installed chromedriver and insert it into your project folder
For first time use, you must setup the database. On your Terminal run the following commands:

mysql -u root -p 
#enter your password
mysql> CREATE SCHEMA techcrunch_cp_2;
#exit mysql

Run the scraper with the make database option set to true in order to initiate the tables.

#Final step
python3 main.py --make_db=True 

# Observe default behavior of scraper when it is scraping for all articles with 
# no constraints and adding them to a database.

Command Line Interface Usage Options

  --tags TEXT        Option to scrape subset of tags (separated by commas, no
                     spaces). Default: all 
**Example: python3 main.py --tags=gaming,fintech**

  --authors TEXT     Option to scrape subset of authors (format: name_lastname
                     separated by commas, no spaces). Default: all 
**Example: python3 main.py --authors=Julian_Willson,Martha_Janes**

  --today BOOLEAN    Option to scrape only todays articles. Default:False

**Example: python3 main.py --today=True**

  --months TEXT      Option to scrape only articles from specified
                     months(number indexes separated bycommas, no spaces)
                     Default: all 

**Example: python3 main.py --months=1,2**

  --display TEXT     Option to select information to display from tags, title,
                     author, twitter, date, count (separated by commas, no
                     spaces) Default: all 

**Example: python3 main.py --display=tags,title**

  --limit INTEGER    Option to limit number of articles. Default: None
  
**Example: --limit=250**

  --make_db BOOLEAN  Option to initialize database and create necessary
                     tables. set to True in first time running scraper.
                     
**Example: python3 main.py --make_db=True**

  --help             Show this message and exit.

#Example:
python3 main.py --display=tags,count,title --limit=10 --months=10,11 --today=False --authors=mary_johnson,john_doe --tags=blockchain,gaming

Database ERD

Tables

Tags
1. tag_id: Primary Key (Integer) .
2. Tag_text: The specific tag included in the article (String)
Articles
1. Article_id: Primary key identifying the article (Integer)
2. Link: link to the article (String)
3. Title: title of article (String)
4. Date: date when article was published (Datetime)
Authors
1. Author_id: primary key identifying the author (Integer)
2. Full_name: full name of the author (String)
3. Twitter_handle: twitter handle of the author (String)
Article to tags
1. Creates a one to many relationship between articles and tags
Article to authors
1. Creates a one to many relationship between authors and articles

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.idea		.idea
.DS_Store		.DS_Store
.gitignore		.gitignore
Article.py		Article.py
NewsApi.py		NewsApi.py
Orchestrator.py		Orchestrator.py
README.md		README.md
Scraper.py		Scraper.py
Techcrunch_ERD.png		Techcrunch_ERD.png
config.py		config.py
database.py		database.py
database_utils.py		database_utils.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TechCrunch Web Scraper

Authors: Daniella Grimberg and Eddie Mattout

Description

Requirements & Installations

Instructions & Usage Options (Command-Line Interface)

Setting up database (first time use)

Command Line Interface Usage Options

Database ERD

Tables

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

danigrim/data-mining-project

Folders and files

Latest commit

History

Repository files navigation

TechCrunch Web Scraper

Authors: Daniella Grimberg and Eddie Mattout

Description

Requirements & Installations

Instructions & Usage Options (Command-Line Interface)

Setting up database (first time use)

Command Line Interface Usage Options

Database ERD

Tables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages