This project demonstrates how Deep Learning techniques can be effectively applied to tabular data, offering a competitive alternative to traditional machine learning models like Gradient Boosting. The code has been significantly updated in 2025 to incorporate modern best practices, performance optimizations, and a more robust experimental setup.
The core of this project is to predict employee access needs based on the Amazon Employee Access Challenge dataset. It provides a side-by-side comparison of a fine-tuned XGBoost model and a Deep Neural Network built with Keras 3 (using a PyTorch backend).
The latest version includes substantial improvements over the original implementation:
- Modernized Tech Stack: Upgraded to Keras 3 with a PyTorch backend, supporting CPU, CUDA, and Apple's MPS for acceleration.
- Corrected Cross-Validation: Fixed a state-leakage bug in the K-Fold cross-validation loop to establish a reliable and realistic performance baseline.
- Hyperparameter Optimization: Integrated Optuna to perform systematic hyperparameter tuning for the XGBoost model, with the best parameters saved in the script.
- Advanced DNN Architecture: The neural network has been completely refactored for better performance and stability:
- Architecture: Changed from a simple
256 -> 256structure to a more effective512 -> 256funnel architecture. - Regularization: Added
BatchNormalizationlayers and increased the dropout rate to0.4to prevent overfitting. - Intelligent Embeddings: Updated the embedding size calculation to
min(50, (num_categories + 1) / 2)for better representation of categorical features.
- Architecture: Changed from a simple
- Dependency Management: All dependencies are now correctly listed in
requirements.txtandpyproject.toml. - Code Quality: Resolved all
UserWarnings from libraries for a cleaner execution experience.
The architectural and methodological improvements resulted in a significant performance boost for the DNN model:
- Mean AUC: Increased from
0.790to0.809. - Stability: Cross-fold score standard deviation drastically reduced from
0.125to0.022, indicating a much more reliable model.
- Python 3.12+
- An active virtual environment (e.g., using
venvorconda)
-
Clone the repository:
git clone https://github.com/lmassaron/deep_learning_for_tabular_data.git cd deep_learning_for_tabular_data -
Install the required dependencies:
pip install -r requirements.txt
To run the full experiment, including training the XGBoost and DNN models with 5-fold cross-validation and generating submission files, execute the main script:
python run_experiment.pyThe script will output the cross-validation scores for both models and save the predictions to tabular_dnn_submission.csv and xgboost_submission.csv.
The original version of this project was presented at GDG Venezia in 2019 and featured in presentations in 2020. It demonstrated how to achieve good results using TensorFlow/Keras integrated with Scikit-learn and Pandas.