In the 2015 hackathon organized by Singapore’s Ministry of Defense, one of the tasks was to predict resignation rates in the military, using anonymized data on 23,000 personnel which included their age, military rank, years in service, as well as performance indicators such as salary increments and promotions. Our team won overall 3rd place. In this post, we elaborate on our methodology.
Stacking is a method to combine multiple models to improve predictions. It was famously used in the Netflix $1 million competition. Below is a graphical summary of our stacking model:
We first used a random forest analysis to identify the top 20 most important features. Next, we trained 7 models: gradient boosting, extreme gradient boosting, random forest, ada boost, neural network, support vector machines, and bagged CART. We generated 7 cross-validated predictions for each personnel corresponding to each model. These predictions were then fed back into the training dataset as meta-features. With the expanded dataset, we trained and averaged the results from 2 models, extreme gradient boosting and random forest, to generate final predictions.
While examining features, we noticed that personnel ID was a significant predictor of resignation. This led us to postulate that the data order was non-random. Plotting personnel ID by age, it was clear that personnel who had resigned were clustered at the bottom. New features were engineered to identify these bottom clusters of resigned personnel. The data appeared to be sorted in a particular order during preparation. Hence, to obtain resigned clusters, we reverse sorted the data.
Visualizing the dataset in its entirety, we noticed there were features that were balanced across the test and training sets (see heat map below). Each feature’s balance score was computed based on the L1 norm of its distribution over the test and training sets. Combining that and the importance of the feature given by the classifier, the data was then sorted to derive a sort key.
Even though the sort key was not a perfect oracle, we were able to use it to improve predictions. Specifically, we could now ascertain a personnel’s probability of resignation based on its neighbors’ in the sorted position. We also derived the size of the public test dataset and used it to calculate how many of these ascertained predictions were correct. Correct predictions verified by the public test dataset were then used to expand our training dataset, improving our classifer.
Fun fact: During the reverse sort process, we deduced the size of the private (n = 4600) and public (n = 3488) datasets. Coincidentally or not, the numbers “46” and “88” signify good fortune in Chinese tradition.
Did you learn something useful today? We would be glad to inform you when we have new tutorials, so that your learning continues!
Sign up below to get bite-sized tutorials delivered to your inbox:
Copyright © 2015-Present Algobeans.com. All rights reserved. Be a cool bean.