The submission uses an ensemble of Gradient Boosting Machines (LightGBM, XGBoost and Sklearn's Histogramm-Based-Boosting) to predict the fertility. The classifiers are only trained on the individuals with an available outcome variable. Variables are selected based the Feature Importance of simple LightGBM and XGBoost models trained repeatedly on a subset of the data. The household variable from the background dataset is used to conduct grouped Train-Test-Splits or Cross-Validation to avoid data leakage per household. Moreover I tried to preprocess the alle features based on the definitions in the codebook and certain heuristics, i.e. all personality variables are defined as continuous variables, missing value indicators are removed, years are reformatted to ages (or time-differences), categorical variables are defined as the respective pandas dtypes. During preprocessing I also removed the free-form-answers. The feature set used is very large (over 1000 variables), because I have seen minor improvements in prediction quality, but I lacked the time to identify the relevant variables. My goal was to take a data driven approach to feature selection to indentify currently unknown correlates of fertility, but inspecting the selected variables this was not successful. I cannot rule out overfitting, so the large number of variables most likely degrades the performance on the holdout data. The training and hyperparameter optimization was done with Microsofts FLAML libary. This libary offers so-called flamlized versions of common Machine Learning Classifiers (e.g. LightGBM), which enable zero-shot Hyperparameter Tuning. These Hyperparameters are selected based on characteristics of the dataset, so no expensive Optimization is needed while iterating on the ideal model. Moreover the library offers an easy to use way to optimize Hyperparameters utilizing the information describes above.
I tried using Semi-Supervised with SelfTraining or TriTraining (with Disagreement) to utilize the large amounts of missing data, but was not able so reach a better F1-score. Moreover I tried time-shifting the data, which works better but it contradicts my goal of a fully data-driven approach.