insurance data pipelines
> pip install conda
> cd $PROJECT_ROOT
> conda env create; conda activate insuredata-env
Then follow these steps to install
tor
,privoxy
TorIpChanger
We use scrapy to scrape EDGAR site. Set $CRAWLER_ROOT
to be where crawler
directory is
Crawl all the states and cities links from Yellow Pages Sitemap page
> cd $PRPJECT_ROOT/crawler
> scrapy crawl yp_locations -a statsFile=cities_stats.csv -a seedsFile=seeds.json
Next crawl each city
> scrapy crawl yp_insurance \
-a seedsFile='seeds/seeds.json' \
-a searchTerm=insurance \
-a statsFile=stats.json \
-a failedFile=failed.txt \
-o data.json