This post records my experience in Kaggle's Allstate Claims Severity competition. It has three parts—feel free to skip to whichever you're interested in!
Hello! I am Yi Xiang, currently employed as a Junior Data Scientist. Feel free to drop me a message if you think I can improve in any way! You can reach me at my LinkedIn.
Alternatively, you could also leave your comments on this github's issue page here.
In the many meet-ups I have attended, a common question that aspiring data scientists ask employers is:
What do you look for in a potential hire?
Typical answers would include:
* Communication or storytelling skills
* Coding proficiency
* Learning ability
* Passion
* And the lists goes on...
Upon hearing this, the follow-up question usually most likely be:
How should I demonstrate this to my potential employer?
And the advice given is usually to:
* Increase online presence (e.g. through github, blogs)
* Find interesting projects to work on
The truth is, I had procrastinated on fulfilling the above two points, and it was time I did something about it. Kaggle seemed like a good fit for the above two objectives. That was how I joined the Allstate Claims Severity competition on 2nd nov, which lasted from Oct to Dec 2016.
Among those familiar with this field, xgboost (XGB) comes to mind as a popular approach to ANY (structured) machine learning problem.
Hence, I started the competition with an initial goal of learning how to tune XGB, and this was how my goals evolved during the competition over a single month:
The point i am trying to make is that:
Kaggle is really a good place to start with lots of helpful people sharing.
The aim of this competition was to create an algorithm to predict the severity of insurance claims. Evaluation metric used was the Mean Absolute Error (MAE).
If you wish to reproduce my results, refer to the README.md
here.
My (current) understanding about MSE is that it penalises error that are further away from the mean, while MAE penalises errors equally. The first thing I learnt about the MAE metric was that it optimises in terms of the median value, while MSE optimises for the mean. More information here.
If you had taken undergraduate mathematics, you would know that y = |x|
is non-differentiable at x = 0
. So when you configure Xgboost to use eval_metric = 'mae'
, the algorithm would still descent by MSE, which poses a problem if you are optimising for MAE. To avoid over-penalising values further away from the mean, you could compress the range of values of your target variable, such as normalising, scaling or log-transform, but it still would not solve the problem.
However, it turned out that numerical approximation is very useful (thank you taylor series!). This link (worth reading!) describes the intuition behind optimising for MAE. For those who did undergraduate mathematics/statistics, you would remember functions like Cauchy and huber, which happens to be solvers for MAE problems. More information here.
Basically, This is done via the 'Fair' objective function. Essentially, you define an MAE objective function, but with a the gradient (first derivative) and hessian (second derivative) of the Fair objective Function
.
Below, you can observe how the 'Fair' objective function mimics the least-absolute function pretty accurately:
The objective, gradient, hessian of the above functions are defined as follows:
Majority of the scripts in the forums used the 'Fair' objective, coded as:
def fair_obj(preds, dtrain):
fair_constant = 2
labels = dtrain.get_label()
x = (preds - labels)
den = abs(x) + fair_constant
grad = fair_constant * x / (den)
hess = fair_constant * fair_constant / (den * den)
return grad, hess
The smaller fair_constant
is, the slower or smoother the loss is.
This custom objective can then be used in xgb.train
:
clf = xgb.train(params,
d_train,
100000,
watchlist,
early_stopping_rounds=early_stop,
obj=fair_obj,
verbose_eval = n_print,
feval=xg_eval_mae)
Additional information can be found here.
One of the limitations of linear regression is in identifying interactions between features. To solve this, an XGBoost model dump parser was developed as a way to find N-way feature interactions that can be used to improve your XGB model, or to be used as features themselves.
Fortunately, someone else posted this script, which saved me a bit of time on finding N-way feature interactions.
In the raw data, features ran from A, B, ... , Z,
to AA, .. AZ
, which seemed to suggest some significance in how the data was ordered.
Therefore, instead of using label or one hot encoding, I experimented with an alternative function to encode these categorical features:
def encode(charcode):
r = 0
ln = len(str(charcode))
for i in range(ln):
r += (ord(str(charcode)[i]) - ord('A') + 1) * 26 ** (ln - i - 1)
return r
Essentially, this function recodes categories based on their rank order:
encode('A') = 1
encode('Z') = 26
encode('AA') = 27
encode('AC') = 29
While this method could be used to complement other functions like min/max/mean/counts/ti-idf
, I did not manage to test this.
Some machine learning algorithms perform better when features are normally distributed. Unfortunately, figuring out how to transform each feature as such requires a huge effort.
Introducing boxcox, a (very) convenient way of transforming these features by measuring their skew.
Here are a couple of good articles explaining boxcox that I came across:
Implementing boxcox in code:
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna()))
skewed_feats = skewed_feats[abs(skewed_feats) > 0.25]
skewed_feats = skewed_feats.index
for feats in skewed_feats:
train_test[feats] = train_test[feats] + 1
train_test[feats], lam = boxcox(train_test[feats])
Unless you are extremely experienced and have good intuition about which parameter values to use, it is likely that you need to learn from trial and error.
For this, I recommend hyperopt, a python library for serial and parallel optimisation over awkward search spaces. It even allows you to tweak the number of layers in a neural net!
I have some examples in my git repo:
Change the data input to power3 if you want to run hyperopt for a 3-way interaction
Results of the hyperopt can be found in this repo.
As mentioned, this is my first time coding a neural net outside of Coursera (Andrew Ng's Machine Learning and UOW's Machine Learning specialisation).
This script helped me a lot in starting out with neural networks, but running it on an 8-core MacBook pro would have taken at least 2 days (55 runs * 30 seconds * 10 fold * 10 bags = 45 hours)!
With all the hype over cloud computing, I decided to give AWS GPU-compute series spot instances a go, which was about 0.35 cents per hour. While there are plenty of online resources about how to implement this, they are not without missing pieces.
I will write a guide on installing Anaconda, Juypter, Cudas, Keras and Theano in a separate post soon.
For those who are starting out with Keras like me, there are two things you must note:
In your home directory, run the following:
cd .keras/
nano keras.json #use open/subl depending on your OS
You should see:
{
"image_dim_ordering": "tf",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "Tensorflow"
}
If you want to use Theano, you have to edit Tensorflow
to Theano
{
"image_dim_ordering": "tf",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "theano"
}
If you are using an Nvidia GPU, installing Theano would introduce a .theanorc
file in your home directory. If it's not there, you need to create one and paste the following inside:
[global]
floatX = float32
device = gpu
[mode] = FAST_RUN
[nvcc]
fastmath=TRUE
[cuda]
root = /usr/local/cuda
[lib]
cnmem = 0.95
The line cnmem = 0.95
is very important—it halves the duration of each iteration from 12 to 6 seconds!
To see monitor whether your score is improving and to determine early stopping, you need to first specify the metric function:
def mae(y_true, y_pred):
return K.mean(K.abs(K.exp(y_true) - K.exp(y_pred)))
and put it inside the keras.compile
function:
model.compile(loss = 'mae', optimizer = optimizer, metrics=[mae])
Use this code to determine early stopping and checkpoints for your model:
callsback_list = [EarlyStopping(patience=10),
ModelCheckpoint('keras-regressor-' + str(i+1) +'_'+ str(j+1) + '.check'\
, monitor='val_loss', save_best_only=True, verbose=0)]
Then, specify the validation_data
and callsback
:
fit = model.fit_generator(generator = batch_generator(xtr, ytr, 128, True),
nb_epoch = nepochs,
samples_per_epoch = xtr.shape[0],
verbose = 0,
validation_data=(xte.todense(),yte),
callbacks=callsback_list)
where xtr, ytr
is the training set, and xte, yte
the validation set in the specified Kfold
.
You can then call the best model with val_loss
or val_mae
, before making the best prediction with:
fit = load_model('keras-regressor-' + str(i+1) + '_'+ str(j+1) + '.check')
pred += np.exp(fit.predict_generator(generator = batch_generatorp(xte, 800, False), \
val_samples = xte.shape[0])[:,0])-200
To see how I used early stopping for my second level modelling, check out this link. Turns out that saving the model is a good idea if you want to use it later on. You can try it by commenting out the following lines from the notebook:
#comment from here
callsback_list = [EarlyStopping(patience=10),\
ModelCheckpoint('keras-regressor-' + str(i+1) +'_'+ str(j+1) + '.check'\
, monitor='val_loss', save_best_only=True, verbose=0)]
model = nn_model(layer1=250,layer2=100,\
dropout1 = 0.4,dropout2=0.2, \
optimizer = 'adadelta')
fit = model.fit_generator(generator = batch_generator(xtr, ytr, 128, True),
nb_epoch = nepochs,
samples_per_epoch = xtr.shape[0],
verbose = 0,
validation_data=(xte.todense(),yte),
callbacks=callsback_list)
# to here if you just want to use the pre-trained models yourself.
After training my neural networks, I randomly assigned weights to my best XGB and Keras predictions to see which would fit the public leaderboard best.
When I decided to do stacking, I started reading up:
I did not have any of my out-of-bag (OOB) training sets. This meant that I had to re-write my codes and retrain the models. Lesson Learnt: You should always extract the OOB sets for models running on k-fold validation even if you are unsure stacking is required.
pred_oob = np.zeros(train_x.shape[0])
for iterations in kfold:
#split training and test set
scores_val = clf.predict(d_valid, ntree_limit=clf.best_ntree_limit)
pred_oob[test_index] = np.exp(scores_val) - shift
#export the oob training set
oob_df = pd.DataFrame(pred_oob, columns = ['loss'])
sub_file = 'oob_xgb_fairobj_' + str(score) + '_' + str(
now.strftime("%Y-%m-%d-%H-%M")) + '.csv'
print("Writing submission: %s" % sub_file)
oob_df.to_csv(sub_file, index = False)
This concept would apply to other models as well.
Additionally, if you are using AWS, disconnections to your juypter kernels might disrupt tracking of your code progress (e.g which fold or bags it was running). I overcame this problem by adding these lines in my code:
For XGB:
partial_evalutaion = open('temp_scores_power2.txt','a')
partial_evalutaion.write('Fold '+ str(i) + '- MAE:'+ str(cv_score)+'\n')
partial_evalutaion.flush()
For hyperopt:
partial_evalutaion = open('extra_trees_bootstrap2.txt','a')
partial_evalutaion.write('iteration ' + str(space) +str(iter_count) + 'with' + ' ' + str(score) + '\n')
partial_evalutaion.flush()
To store all the parameters that ran on hyperopt, you can specify a data frame and call it within the function to append the results:
Df_results = pd.DataFrame()
def objective(space):
...
...
global Df_results
Df_results = Df_results.append(log_files_df)
return(...)
Df_results.to_csv("results.csv",index = None)
The idea for using a weighted average (in my opinion) stems from linear programming. Credit should go to this post for sharing the code on finding optimal weights.
To implement this,
Define the objective function as follows:
def mae_func(weights):
''' scipy minimize will pass the weights as a numpy array '''
final_prediction = 0
for weight, prediction in zip(weights, predictions):
final_prediction += weight*prediction
return mean_absolute_error(Y_values, final_prediction)
After which, run the following code:
for i in range(100):
starting_values = np.random.uniform(size=len(predictions))
cons = ({'type':'ineq','fun':lambda w: 1.2-sum(w)})
bounds = [(0,1)]*len(predictions)
res = minimize(mae_func, starting_values, method='L-BFGS-B',
bounds=bounds, options={'disp': False, 'maxiter': 10000})
lls.append(res['fun'])
wghts.append(res['x'])
bestSC = np.min(lls)
bestWght = wghts[np.argmin(lls)]
For those familiar with linear programming, you would understand that
cons = ({'type':'ineq','fun':lambda w: 1.2-sum(w)}
implies that the sum of weights should not be greater than 1.2, while
bounds = [(0,1)]*len(predictions)
implies that each weight should be between 0
and 1
.
With 6 XGB models and 4 Keras model, I generated this result, which would have ranked me at 102th on the private leaderboard. The local CV score was about 1118.34
.
This post inspired me to try neural networks as my second level model.
I must also admit I was pretty lucky in my first guess of parameters of a two layer NN with 250-100 nodes, found here. My 5-fold approach generated the following results:
('Fold ', 1, '- MAE:', 1117.5260825665521)
('Fold ', 2, '- MAE:', 1113.2272453103922)
('Fold ', 3, '- MAE:', 1117.1135764027533)
('Fold ', 4, '- MAE:', 1121.9982577768997)
('Fold ', 5, '- MAE:', 1119.6595219061744)
('Total - MAE:', 1117.9049057391971)
Interestingly, I tried 10 folds but obtained a worse CV result than 1118.34
(although it might not have meant a worse LB).
Afterwards, I decided to bag my model 5 times :
('Fold ', 1, '- MAE:', 1117.6329549570657)
('Fold ', 2, '- MAE:', 1113.3701316951469)
('Fold ', 3, '- MAE:', 1117.1293409206548)
('Fold ', 4, '- MAE:', 1121.8204992333194)
('Fold ', 5, '- MAE:', 1119.4491190596229)
('Total - MAE:', 1117.880379920515)
You can see the entire output here.
However, the score was only 0.02 points better. Hence, I decided to weigh them with my best first level models, and tried the following submissions on the last day:
Single 5 fold Keras | W. avg with single fold NN | Bagged 5 fold Keras | W.avg with lvl 1 models with both NN | |
---|---|---|---|---|
Local CV | 1117.90490574 | 1117.77760897 | 1117.88037992 | 1117.7181697 |
Public LB | 1100.90013 | 1100.87763 | 1100.88155 | 1100.86946 |
Private LB | 1112.84611 | 1112.77244 | 1112.93370 | 1112.73936 |
Surprisingly, the single 5-fold model performed better in the private LB than the bagged model.
The final weighted scores, codes and datasets I used can be found here.
Here are additional links to top solutions: