NewsCrawler

NewsCrawler is a Python-based web scraping tool designed to extract news articles from various sources using multiple techniques. It navigates through paywalls and anti-bot measures to retrieve content, leveraging the Google Cache, Selenium with Stealth Mode, and Archive.is for comprehensive coverage.

Features

Multiple Parsing Methods: Includes Google Cache, Selenium Stealthed, Archive.is, and direct requests to fetch articles.
HTML Validation: Ensures the integrity of the downloaded content, filtering out insufficient or irrelevant data.
Dynamic News Source Handling: Utilizes a custom NewsUrlGetter to dynamically fetch news URLs based on specified topics.
Robust Error Handling: Implements custom exceptions for HTML validation and download errors, ensuring reliability.
Extensible Design: Easily adaptable to include more news sources or parsing methods.

Dependencies

Python 3.x
requests
selenium
newspaper3k
selenium-stealth
beautifulsoup4

Ensure you have Chrome WebDriver installed and accessible in your system's PATH for Selenium to function properly.

Installation

Clone the repository:

git clone https://github.com/yourgithubusername/newscrawler.git

Install the required Python packages:

pip install -r requirements.txt

Usage

To use NewsCrawler, instantiate the NewsParser class with optional parameters for headless browsing and URL filtering. Then, call the get_news method with your topic of interest:

from newscrawler import NewsParser, NewsUrlGetter

# Initialize the NewsParser with custom settings
news_parser = NewsParser(NewsUrlGetter(max_results=20, start_date=(2023, 1, 20), end_date=(2023, 12, 25)), headless=True)

# Fetch news articles about "Interest rates"
articles = news_parser.get_news("Interest rates")

Contributing

Contributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
kya_newsreader		kya_newsreader
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NewsCrawler

Features

Dependencies

Installation

Usage

Contributing

License

About

Releases

Packages

Languages

License

xwkya/NewsScraping

Folders and files

Latest commit

History

Repository files navigation

NewsCrawler

Features

Dependencies

Installation

Usage

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages