DesiQuant: News Scraper

A scrapy crawler that scrapes market news from Indian financial news outlets A scrapy crawler that scrapes market news from Indian financial news outlets

Publisher	Sitemap Type
News 18	Daily
The Hindu	Daily
The Hindu Business Line	Daily
Business Today	Daily
Money Control	Monthly
Business Standard	Monthly
Economic Times	Monthly
Firstpost	Daily
NDTV Profit	Daily
Free Press Journal	Daily
Outlook India	Daily
Zee News	Monthly
Financial Express	Daily
Indian Express	Daily
CNBCTV18	Daily

Data Usage

You can readily consume the data without setting up any infrastructure. The data is publicly accessible and updated every 1 hour. The entire data dump is around 5 GB and increases by ~10 MB everyday.

Data Files

The entire consolidated data is accessible at s3://desiquant/data/news.parquet

The individual news publisher data files are accessible at s3://desiquant/data/news/cnbctv18.csv. To view a list of all available publishers you can check out below

Pandas Usage

import s3fs
import pandas as pd

df = pd.read_parquet("s3://desiquant/data/news.parquet", storage_options={
    "key": "sceN1eFOJQmBIWHNEMd8",
    "secret": "w1BERx7F6LTe87sk9K9deoBcfYXNCwlol5xcLeev",
    "endpoint_url": "http://data.desiquant.com:9000",
})
df

Data Preview

title	author	description	url	article_text	date_modified	date_published	scrapy_scraped_at	scrapy_parsed_at
These smallcaps gain 10-50% despite the underperformance...	Rakesh Patil	A sustainable plunge below 24500 is likely to provide some more respite...	https://www.moneycontrol.com/news/business/markets/these-smallcaps-gain-10-50-despite-broader-indices-underperform-12773539.html	The broader indices underperformed the main indices...	2024-07-20T12:07:07+05:30	2024-07-20T12:05:48+05:30	Sun, 21 Jul 2024 17:09:08 GMT	2024-07-21T17:12:27.979827
Budget Buzz: Quantum AMC’s Chirag Mehta expects a populist...	Anishaa Kumar	Mehta says that the impact of earlier government spends has been skewed...	https://www.moneycontrol.com/news/business/markets/budget-buzz-quantum-amcs-chirag-mehta-expects-a-populist-budget-airs-his-mf-tax-wishlist-12773526.html	Talking about the upcoming budget, Quantum AMC's Chief Investment Officer (CIO)...	2024-07-20T13:18:33+05:30	2024-07-20T13:17:21+05:30	Sun, 21 Jul 2024 17:12:27 GMT	2024-07-21T17:12:28.985228
Digital curbs impact Kotak Bank's unsecured loan book growth...	Lovisha Darad	Kotak Mahindra Bank reported a 20 basis points (bps) sequential drop...	https://www.moneycontrol.com/news/business/markets/digital-curbs-impact-kotak-banks-unsecured-loan-book-growth-margin-in-q1fy25-12773719.html	As Kotak Mahindra Bank marks three months of facing a ban...	2024-07-20T18:55:10+05:30	2024-07-20T18:55:10+05:30	Sun, 21 Jul 2024 17:12:27 GMT	2024-07-21T17:12:29.051476

Local Development

Setup

Install all python dependencies

pip install -e . # editable mode

To scrape all market articles from MoneyControl, you can run the following spider. By default, only the articles from the last week are scraped.

scrapy crawl moneycontrol

That's it! You can now view the scraped news articles at data/outputs/moneycontrol.csv. The articles are saved in CSV format in a structured format like:

{
  "article_text": "As Kotak Mahindra Bank marks three months of facing a ban on digital onboarding of customers, the private sector lender highlighted the impact the ban had on its unsecured loan book growth and margins during the April-June quarter (Q1FY25). \"As I mentioned in the last quarter's results, the RBI order would affect our 811 and credit card businesses. This has had some impact on unsecured book growth and consequently on net interest margin (NIM). However, we believe that when the embargo is lifted, we will come out even more strongly. If you take out the impact on the unsecured businesses and 811, the rest of the business grew very well,\" the bank's management said in their earnings conference call. In the June-ended quarter, Kotak Mahindra Bank reported a 20 basis points (bps) sequential drop in its unsecured loan book growth to 11.6 percent from 11.8 percent. On the other hand, NIM remained flat quarter-on-quarter at 5.02 percent in Q1FY24, but was down 55 bps year-on-year from 5.57 percent in Q1FY23. On April 24, 2024, the Reserve Bank of India barred Kotak Mahindra Bank from taking on new customers via its online and mobile banking channels and from issuing new credit cards. The central bank took this action after examining the country's fourth-largest private lender's IT systems in 2022 and 2023 and finding concerns that Kotak failed to adequately address information technology-related drawbacks. When asked about the progress made in improving its IT systems, Kotak Bank said they are committed to completing the process efficiently. \"It is hard to predict when the RBI will approve the complete overhaul of our technology systems. But, we remain committed to finishing everything and are working on this with great determination. We will continue to perfect our tech systems with incredible gusto and mitigate this impact as soon as possible,\" the management stated. ",
  "author": "Lovisha Darad",
  "date_modified": "2024-07-20T18:55:10+05:30",
  "date_published": "2024-07-20T18:55:10+05:30",
  "description": "Kotak Mahindra Bank reported a 20 basis points (bps) sequential drop in its unsecured loan book growth to 11.6 percent  in Q1FY25 from 11.8 percent",
  "scrapy_parsed_at": "2024-07-21T17:12:29.051476",
  "scrapy_scraped_at": "Sun, 21 Jul 2024 17:12:27 GMT",
  "title": "Digital curbs impact Kotak Bank's unsecured loan book growth, margin in Q1FY25",
  "url": "https://www.moneycontrol.com/news/business/markets/digital-curbs-impact-kotak-banks-unsecured-loan-book-growth-margin-in-q1fy25-12773719.html"
}

Testing

The tests check that data is correctly pulled from news article webpages. When a website structure changes, the tests should ideally fail so that we can promptly make updates to the spiders.

pip install .[test] # install all test deps
pytest # test all spiders

Scrapy CLI

Here are some useful built-in scrapy commands:

scrapy list # view a list of all available spiders
scrapy bench # view scraping benchmark tests performed by scrapy

# view the parsed article output
scrapy parse https://www.businesstoday.in/markets/stocks/story/upward-revision-in-eps-estimates-what-analysts-say-on-tcs-q1-results-stock-trading-strategy-436794-2024-07-11

Production

Deploying the spider in production involves the following steps:

Configure Settings: All the relevant data generated by scrapy is stored in the data folder which can be set in scrapy.cfg. Moreover, the following custom scrapy settings are available in settings.py

Setting	Description	Default
`SKIP_OUTPUT_URLS`	Skips fetching and parsing of articles that already exist in the output of a spider.	`True`
`USE_FLOATING_IPS`	Uses all available floating IPs on `eth0` to prevent IP-based blocks while scraping.	`True`
`DATE_RANGE`	Gathers and scrapes articles within the specified date range from the sitemap.	`(datetime.today(), datetime.now())`
`SCRAPE_MODE`	Dump the entire date_range or Check for new articles in the date_range	`"update"`

Infrastructure:
- Create a server.
- Mount a 100 GB+ volume to save output and cache data. Point the above mentioned data directory to this volume.
- Attach 40+ floating IPs. These IPs will automatically be used by the scraper.
Run Scraper - Start the scraping by running python scrape.py. The logs are saved at scrape.log
Verify Output - You should see the articles saved in data/outputs folder. Timely check the scrape.log to verify that the scraper is running as expected. That's it!

You also can seamlessly integrate scrapy into an ETL pipeline with frameworks like Prefect to maintain an updated dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
.vscode		.vscode
examples		examples
news_scraper		news_scraper
tests		tests
.gitignore		.gitignore
NOTES.md		NOTES.md
README.md		README.md
pytest.ini		pytest.ini
scrape.py		scrape.py
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DesiQuant: News Scraper

Data Usage

Data Files

Pandas Usage

Data Preview

Local Development

Setup

Testing

Scrapy CLI

Production

About

Releases

Packages

Contributors 4

Languages

desiquant/news_scraper

Folders and files

Latest commit

History

Repository files navigation

DesiQuant: News Scraper

Data Usage

Data Files

Pandas Usage

Data Preview

Local Development

Setup

Testing

Scrapy CLI

Production

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages