Predicting Resignation in the Military

In the 2015 hackathon organized by Singapore’s Ministry of Defense, one of the tasks was to predict resignation rates in the military, using anonymized data on 23,000 personnel which included their age, military rank, years in service, as well as performance indicators such as salary increments and promotions. Our team won overall 3rd place. In this post, we elaborate on our methodology.

Posing with CDF

With Singapore’s Chief of Defence Force, Major-General Perry Lim. (from left) CDF, Kenneth Soo, Annalyn Ng, Chin Hui Han.


Algorithm Design

Stacking is a method to combine multiple models to improve predictions. It was famously used in the Netflix $1 million competition. Below is a graphical summary of our stacking model:

stacking model

We first used a random forest analysis to identify the top 20 most important features. Next, we trained 7 models: gradient boosting, extreme gradient boosting, random forest, ada boost, neural network, support vector machines, and bagged CART. We generated 7 cross-validated predictions for each personnel corresponding to each model. These predictions were then fed back into the training dataset as meta-features. With the expanded dataset, we trained and averaged the results from 2 models, extreme gradient boosting and random forest, to generate final predictions.
.

Data Forensics

While examining features, we noticed that personnel ID was a significant predictor of resignation. This led us to postulate that the data order was non-random. Plotting personnel ID by age, it was clear that personnel who had resigned were clustered at the bottom. New features were engineered to identify these bottom clusters of resigned personnel. The data appeared to be sorted in a particular order during preparation. Hence, to obtain resigned clusters, we reverse sorted the data.

Visualizing the dataset in its entirety, we noticed there were features that were balanced across the test and training sets (see heat map below). Each feature’s balance score was computed based on the L1 norm of its distribution over the test and training sets. Combining that and the importance of the feature given by the classifier, the data was then sorted to derive a sort key.

Heatmap of data before reverse sorting. The first column indicates whether the data was in the test (blue) or training (red) set.

afterSort

Homogeneous clusters revealed after reverse sorting.

Even though the sort key was not a perfect oracle, we were able to use it to improve predictions. Specifically, we could now ascertain a personnel’s probability of resignation based on its neighbors’ in the sorted position. We also derived the size of the public test dataset and used it to calculate how many of these ascertained predictions were correct. Correct predictions verified by the public test dataset were then used to expand our training dataset, improving our classifer.

Fun fact: During the reverse sort process, we deduced the size of the private (n = 4600) and public (n = 3488) datasets. Coincidentally or not, the numbers “46” and “88” signify good fortune in Chinese tradition.

Did you learn something useful today? We would be glad to inform you when we have new tutorials, so that your learning continues!

Sign up below to get bite-sized tutorials delivered to your inbox:

 

Free Data Science Tutorials

Copyright © 2015-Present Algobeans.com. All rights reserved. Be a cool bean.

 

One thought on “Predicting Resignation in the Military

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s