VEJA News Scraper

This program was developed as part of an extension program at the Laboratory of Neologism at UFES, with non-commercial goals. Its main objective is to create a corpus for identifying neologisms in the Brazilian Portuguese language.

Installation

Clone this repository to your local machine:

git clone https://github.com/ivarejao/veja-news-scraper.git

Change into the project directory:
```
cd veja-news-scraper
```
Install the required Python packages using pip:
```
pip install -r requirements.txt
```
Create a .env file in the src directory with the following environment variables:
- VEJA_EMAIL: Your VEJA account email.
- VEJA_PASSWORD: Your VEJA account password.

Usage

Step 1: Collect Links

Run the get_links.py script to collect links to VEJA news articles.

python get_links.py --headless --sector <sector> --time-range <start_year,end_year> --data-path <data_directory>

--headless: If this flag is set, the program will run without a GUI.
--sector: Specify the sector to be listed. The default is to collect links from all sectors on the site.
--time-range: Define the time range for collecting links. The default is from 2008 to 2023.
--data-path: Set the root directory path where the links will be stored. The default is the current working directory (./data).

Step 2: Generate News

After collecting the links, run the generate_news.py script to download the news articles.

python generate_news.py --sector <sector> --time-range <start_year:end_year> --data-path <data_directory>

--sector: Specify the sector to be listed. The default is to collect news from all sectors on the site.
--time-range: Define the time range for collecting news articles. The default is from 2008 to 2023.
--data-path: Set the root directory path where the links are stored and where the news files will be saved. The default is the current working directory (./data).

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ext		ext
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VEJA News Scraper

Installation

Usage

Step 1: Collect Links

Step 2: Generate News

License

About

Releases

Packages

Languages

License

ivarejao/veja-news-scraper

Folders and files

Latest commit

History

Repository files navigation

VEJA News Scraper

Installation

Usage

Step 1: Collect Links

Step 2: Generate News

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages