This project is a Python-based web scraper designed to scrape data from web pages using Selenium WebDriver. The scraper can handle pagination and retrieve elements based on class names or IDs. It also supports configuration through a JSON file for flexible usage.
- Scrapes data from web pages based on class names or IDs
- Handles paginated web pages
- Configurable through a JSON file
- Supports headless mode for background execution
- Parallel processing for faster data retrieval
- Logs errors and progress
- Python 3.x
- Google Chrome browser
- ChromeDriver compatible with your version of Chrome
- Required Python packages (see below)
-
Clone the repository:
git clone https://github.com/gtrtuugii/python-web-scraper.git cd python-web-scraper
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required Python packages:
pip install -r requirements.txt
-
Download ChromeDriver and place it in your PATH or specify its path in the
config.json
file.
The scraper uses a configuration file config.json
to set various parameters. An example configuration is provided below:
{
"driver_path": "path/to/chromedriver",
"implicit_wait_time": 10,
"base_url": "https://www.playhq.com/basketball-victoria/org/melbourne-central-basketball-association/sunday-cyms-senior-domestic-summer-202324/sunday-senior-men-a/a112a9d0/R{}",
"pagination_pattern": "{}",
"output_file": "output.csv",
"headless": true
}
python webscraper.py