This program was developed as part of an extension program at the Laboratory of Neologism at UFES, with non-commercial goals. Its main objective is to create a corpus for identifying neologisms in the Brazilian Portuguese language.
-
Clone this repository to your local machine:
git clone https://github.com/ivarejao/veja-news-scraper.git
-
Change into the project directory:
cd veja-news-scraper
-
Install the required Python packages using pip:
pip install -r requirements.txt
-
Create a
.env
file in thesrc
directory with the following environment variables:VEJA_EMAIL
: Your VEJA account email.VEJA_PASSWORD
: Your VEJA account password.
Run the get_links.py
script to collect links to VEJA news articles.
python get_links.py --headless --sector <sector> --time-range <start_year,end_year> --data-path <data_directory>
--headless
: If this flag is set, the program will run without a GUI.--sector
: Specify the sector to be listed. The default is to collect links from all sectors on the site.--time-range
: Define the time range for collecting links. The default is from 2008 to 2023.--data-path
: Set the root directory path where the links will be stored. The default is the current working directory (./data).
After collecting the links, run the generate_news.py
script to download the news articles.
python generate_news.py --sector <sector> --time-range <start_year:end_year> --data-path <data_directory>
--sector
: Specify the sector to be listed. The default is to collect news from all sectors on the site.--time-range
: Define the time range for collecting news articles. The default is from 2008 to 2023.--data-path
: Set the root directory path where the links are stored and where the news files will be saved. The default is the current working directory (./data).
This project is licensed under the MIT License. See the LICENSE file for details.