Skip to content

montyhall/insureData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

insureData

insurance data pipelines

Setup

> pip install conda
> cd $PROJECT_ROOT
> conda env create; conda activate insuredata-env

Then follow these steps to install

  • tor, privoxy TorIpChanger

Steps

We use scrapy to scrape EDGAR site. Set $CRAWLER_ROOT to be where crawler directory is

Generate crawl seeds

Crawl all the states and cities links from Yellow Pages Sitemap page

> cd $PRPJECT_ROOT/crawler
> scrapy crawl yp_locations -a statsFile=cities_stats.csv -a seedsFile=seeds.json

Next crawl each city

> scrapy crawl yp_insurance \
-a seedsFile='seeds/seeds.json' \
-a searchTerm=insurance \
-a statsFile=stats.json \
-a failedFile=failed.txt \
-o data.json

About

insurance data pipelines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages