This repository contains scripts for extracting, processing, and crawling data related to DiGA (Digitale Gesundheitsanwendung) and study data. The scripts are:
-
Clone the repository:
git clone https://github.com/suriija/digaCrawler.git cd digaCrawler
-
Install the required dependencies:
pip install pandas pip install sqlite3 pip install selenium pip install pandas pip install sqlite pip install webdriver-manager
This script is used for extracting and transforming study data from Excel files.
- Ensure your Excel files are named and formatted correctly.
- Modify the
excel_file_path
andsheet_name
variables in the script to match your file and sheet names. - Run the script:
python study_data_extraction_and_transformation.py
This script is used for extracting and processing DiGA data and importing it into an SQLite database.
- Ensure your Excel files are named and formatted correctly.
- Modify the
excel_file
,sheet_name
,database_file
, and other relevant variables in the script to match your files and database. - Run the script:
python diga_data_extraction_and_processing.py
This script is designed to scrape information from a DiGA (Digitale Gesundheitsanwendung) website using Selenium. It extracts data related to various health apps and their details.
The Chrome and Chromedriver versions have to be compatible for Selenium to operate without errors. You need to find a driver version in latest_release_url
, which is available at 'https://github.com/GoogleChromeLabs/chrome-for-testing#json-api-endpoints', that is compatible with your current Chrome version, then specify the latest_release_url
and driver_version
parameters in:
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager(
latest_release_url='https://googlechromelabs.github.io/chrome-for-testing/last-known-good-versions-with-downloads.json',
driver_version='124.0.6367.91').install()), options=options)
Here, the driver version chosen is 124.0.6367.91.
- Download the Jupyter Notebook file to your local machine.
- Open the notebook using Jupyter Notebook or Jupyter Lab.
- Run each cell in the notebook sequentially. You can do this by clicking the "Run" button or using the shortcut Shift + Enter.
- The notebook will extract information from the DiGA website and display the results in the output of the respective cells. The extracted data will also be saved as a DataFrame in a CSV file.