This project contains scripts for setting up MariaDB databases and generating synthetic data for educational institutions using local or cloud-based LLMs.
This project is organized into four main components:
Contains original seed data files (Excel/CSV) used as templates for generating synthetic data. These files provide the structure and examples for cohort, course, and financial aid data.
All database setup, management, and testing utilities:
- Database creation and table setup
- Record counting and summary generation
- Database testing and verification scripts
- Prediction table creation
Scripts for generating and managing LLM-based student recommendations:
- Student readiness assessments
- LLM recommendation table management
- Progress monitoring and viewing recommendations
School-specific synthetic data generation scripts organized by institution:
- Individual school data generators (AL, CSUSB, KCTCS, KY, OH)
- Shared configuration and utilities
- Master scripts for bulk generation
Installation:
- Windows:
- Go to ollama.ai and download the Windows installer
- Run the installer and follow setup instructions
- Alternative:
winget install Ollama.Ollama
Start Ollama Service:
ollama serveInstall Mistral Model:
ollama pull mistralSystem Requirements:
- RAM: At least 8GB (16GB recommended)
- Storage: 4-8GB for model files
- CPU: Any modern CPU (more cores = faster generation)
Requirements:
- AWS account with Bedrock access
- IAM user with
bedrock:InvokeModelpermissions - AWS CLI configured with valid credentials
Environment Variables:
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_DEFAULT_REGION=your_region
This project uses MariaDB as the database system. You can run it locally or connect to a remote instance.
Windows:
- Download MariaDB from mariadb.org
- Run the installer and follow setup instructions
- Note the root password you set during installation
- MariaDB will run as a Windows service
Verify Installation:
mysql --versionConnect to MariaDB:
mysql -u root -pConnect to an existing MariaDB or MySQL server by configuring the connection details in .env.
Create virtual environment:
python -m venv venv
.\venv\Scripts\Activate.ps1Install required packages:
pip install -r requirements.txtConfigure database connection in .env:
DB_HOST=localhost # Use 'localhost' for local MariaDB or remote host IP
DB_USER=root # Your database username
DB_PASSWORD=your_password # Your database password
DB_PORT=3306 # Default MariaDB/MySQL port
This project supports both local and cloud-based database development. Choose the option that best fits your development needs:
Advantages:
- Fast Development Cycle: No network latency for database operations
- Offline Development: Work without internet connectivity
- Full Control: Complete control over database configuration and data
- Cost-Effective: No cloud hosting costs during development
- Privacy: All data stays on your local machine
Best For:
- Initial development and testing
- Learning the codebase
- Experimenting with data generation
- Working with synthetic data only
Setup:
- Install MariaDB locally (see Prerequisites section)
- Use
DB_HOST=localhostin your.envfile - All database operations run on your local machine
Advantages:
- Team Collaboration: Shared database access for multiple developers
- Production-Like Environment: Test against cloud infrastructure
- Scalability: Handle larger datasets and concurrent users
- Backup & Recovery: Automated backups and disaster recovery
- Remote Access: Access from anywhere with internet
Best For:
- Team development projects
- Production deployments
- Working with large datasets
- Multi-developer collaboration
- Integration testing
Setup:
- Use a cloud MariaDB service (AWS RDS MariaDB, Google Cloud SQL MariaDB, etc.)
- Configure remote connection details in
.env:DB_HOST=your-cloud-mariadb-host.com DB_USER=your_username DB_PASSWORD=your_password DB_PORT=3306 - Ensure proper security groups/firewall rules for database access
Choose ONE of the following options for your development environment:
Option 1: Local MariaDB Server
# In your .env file
DB_HOST=localhost
DB_USER=root
DB_PASSWORD=your_password
DB_PORT=3306Option 2: Cloud MariaDB Server
# In your .env file
DB_HOST=your-cloud-mariadb-host.com
DB_USER=your_username
DB_PASSWORD=your_password
DB_PORT=3306All scripts and operations work identically with either setup.
The project manages 5 institutional databases:
- Bishop_State_Community_College (AL)
- California_State_University_San_Bernardino (CSUSB)
- Kentucky_Community_and_Technical_College_System (KCTCS)
- Thomas_More_University (KY)
- University_of_Akron (OH)
Each database contains core tables:
cohort- Student cohort informationcourse- Course enrollment datafinancial_aid- Financial aid recordsllm_recommendations- LLM-generated student recommendationsar_*- Analysis-ready tables (school-specific)
Create all databases and tables:
python dboperations/db_setup.pyTest database connection:
python dboperations/testing/test_db_connection.pyCount records across all databases:
python dboperations/count_records.pyGenerate Excel summary of all databases:
python dboperations/generate_db_summary.pyCreate prediction tables:
python dboperations/create_prediction_tables.pyView database schemas:
python dboperations/view_schema.py # View all databases
python dboperations/view_schema.py --database AL # View specific database
python dboperations/view_schema.py --table cohort # View specific table
python dboperations/view_schema.py --overview # Overview onlySee dboperations/README.md for complete documentation.
Data generation scripts are organized by school in generate_data/schools/:
generate_data/schools/
├── shared/
│ └── config.py # Shared database configuration
├── AL/ # Bishop State Community College
│ ├── cohort.py
│ ├── course.py
│ ├── financial_aid.py
│ └── generate_all.py
├── CSUSB/ # California State University San Bernardino
├── KCTCS/ # Kentucky Community and Technical College System
├── KY/ # Thomas More University
├── OH/ # University of Akron
└── generate_all_schools.py # Master script for all schools
Generate data for all schools:
cd generate_data/schools
python generate_all_schools.pyGenerate data for a specific school:
cd generate_data/schools/AL
python generate_all.pyGenerate specific data types:
cd generate_data/schools/AL
python cohort.py # Cohort records
python course.py # Course records
python financial_aid.py # Financial aid records- Uses LLM (Ollama/Bedrock) for realistic synthetic data
- Falls back to rule-based generation if LLM unavailable
- All records marked with
dataset_type = 'S'(Synthetic) - Includes
schoolcolumn for cross-database joins
Generate AI-powered student readiness assessments using LLM:
Add LLM recommendations table to all databases:
python llm/add_llm_table.pyGenerate student readiness recommendations:
python llm/llm_student_readiness.pyView recommendations:
python llm/view_recommendations.py --database Kentucky_Community_and_Technical_College_System --limit 10Check generation progress:
python llm/check_progress.py- Analyzes student academic performance and risk factors
- Generates personalized readiness scores and recommendations
- Stores results in
llm_recommendationstable - Tracks model version and prompt version for reproducibility
The data/ folder contains original seed data files used as templates:
- Excel files with cohort, course, and financial aid structures
- Analysis-ready file templates
- Prediction schema definitions
These files serve as the foundation for generating realistic synthetic data.
Current Population (Synthetic Data from seed_data01):
- Total Cohorts: 21 (3-5 per school)
- Total Students: 21,153 (500-1,500 per cohort)
- Total Course Enrollments: 105,726 (4-6 per student)
- Total Financial Aid Records: 21,153 (1 per student)
Grand Total: 126,900 records across all databases
All records are marked with dataset_type = 'S' (Synthetic). Future real data will be marked with 'R'.
For detailed table structures, column definitions, and relationships, see DATABASE_SCHEMA.md.
Student-Centric Connections:
- Tables are linked through
student_idorStudent_GUIDfields financial_aid.student_idconnects student financial recordsllm_recommendations.Student_GUIDlinks AI recommendations to students
Institution-Centric Connections:
- All tables include a
schoolcolumn (AL, CSUSB, KCTCS, KY, OH) llm_recommendations.Institution_IDprovides institutional linking- Cross-database joins possible using
schoolcolumn
Example Joins:
-- Get all data for a specific school
SELECT * FROM course c
JOIN cohort co ON c.school = co.school
WHERE c.school = 'AL';
-- Get student financial aid and recommendations
SELECT fa.*, lr.* FROM financial_aid fa
JOIN llm_recommendations lr ON fa.student_id = lr.Student_GUID
WHERE fa.school = 'AL';devcolor-data-gen/
├── .env # Local database configuration
├── .env.example # Example environment configuration
├── requirements.txt # Python dependencies
├── DATABASE_SCHEMA.md # Complete database schema documentation
├── database_summary_*.xlsx # Generated database summary reports
├── data/ # Seed data files
│ └── course_analysis_ready_file_template_Identified_01_27_25.xlsx
├── dboperations/ # Database operations and utilities
│ ├── README.md # Database operations documentation
│ ├── db_setup.py # Creates databases and tables
│ ├── count_records.py # Counts records in all tables
│ ├── generate_db_summary.py # Generates Excel summary of databases
│ ├── view_schema.py # View database schemas and table structures
│ ├── create_complete_seed_structure.py # Seed data structure creation
│ ├── create_prediction_tables.py # Create prediction tables
│ ├── create_kctcs_prediction_tables.py # KCTCS-specific prediction tables
│ └── testing/ # Database testing and verification
│ ├── test_db_connection.py # Tests database connection
│ ├── verify_*.py # Various verification scripts
│ └── display_schema.py # Display database schemas
├── llm/ # LLM-related operations
│ ├── add_llm_table.py # Add LLM recommendations table
│ ├── llm_student_readiness.py # Generate student readiness recommendations
│ ├── view_recommendations.py # View LLM recommendations
│ ├── check_progress.py # Check recommendation progress
│ └── alter_add_school_column.py # Add school column to tables
├── generate_data/ # Synthetic data generation scripts
│ ├── schools/ # School-based generation scripts
│ │ ├── shared/config.py # Shared configuration
│ │ ├── AL/ # Bishop State Community College
│ │ ├── CSUSB/ # California State University San Bernardino
│ │ ├── KCTCS/ # Kentucky Community and Technical College System
│ │ ├── KY/ # Thomas More University
│ │ ├── OH/ # University of Akron
│ │ └── generate_all_schools.py
│ └── archive/ # Old data-type-based scripts (for reference)
├── venv/ # Optional local Python virtualenv (ignored in git)
└── .venv/ # Optional local Python virtualenv (ignored in git)