# Low Yi Xiang

## Background of me!

• Attendence Guy for DSSG (I hope to retire one day).
• Junior Data Scientist in Singapore.
• Studied Statistics (my friends would disagree) from NTU.

## Prior to Kaggle

• Kaggle is not practical for industry (i still feel it's somewhat true)
• Will not use Deep Learning (in most cases)
• Will not tune the models to the extreme or stacking for 1% (0.1% gain).

## While that is all true...

• Most employers still look for people who exhibit passion in Data Science or Machine Learning.
• Happens that Kaggle is one great way to build your portfolio.

• People who are in industry (and are much better than i am) still write blogs or present their work in meetups.

• This is my first (serious) Kaggle competition, prior to this i talk to my colleagues or the DSSG organizers about ther competition. (So i am lucky to have met such people who inspire me in their own way.)

• Thankfully, i managed to achieve top 2%.

• It showcases your learning capacity and ability to understand documentation. (I learnt all of these in a month)

• Causes you sleepless nights (waiting for your cv score).

## Logistics ...

- All my code can be found in my github page, Freedom89 and look for the Allstate_kaggle repo.

- The full detailed blog post can be found here or refer to the index.md file on the repo.

- Questions are welcomed, but I might take them offline if i think your question requires a heavier discussion.

- This may feel like i am spamming you with 'tools', in which case i apologize.

- Nevertheless, I hope the content would be useful and we might become kaggle teammates in the future!

# Contents

- Competition

- Features

- Xgboost

- Keras

- Stacking

## Competition - background¶

In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.

• MAE (mean absolute error)
• Anonymized Features

## Competition - Data¶

In [2]:
data.head(5)

Out[2]:
id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 ... cont6 cont7 cont8 cont9 cont10 cont11 cont12 cont13 cont14 loss
0 1 A B A B A A A A B ... 0.718367 0.335060 0.30260 0.67135 0.83510 0.569745 0.594646 0.822493 0.714843 2213.18
1 2 A B A A A A A A B ... 0.438917 0.436585 0.60087 0.35127 0.43919 0.338312 0.366307 0.611431 0.304496 1283.60
2 5 A B A A B A A A B ... 0.289648 0.315545 0.27320 0.26076 0.32446 0.381398 0.373424 0.195709 0.774425 3005.09
3 10 B B A B A A A A B ... 0.440945 0.391128 0.31796 0.32128 0.44467 0.327915 0.321570 0.605077 0.602642 939.85
4 11 A B A B A A A A B ... 0.178193 0.247408 0.24564 0.22089 0.21230 0.204687 0.202213 0.246011 0.432606 2763.85

5 rows × 132 columns

## Features - Xgb Interactions¶

• Use Xgbfi by far0n to find N-way interactions for categorical features
col1 col2 new_feat = col1 + col2
A B AB
C D CD
• Use this tool to find feature interactions for Regression.

## Features - Feature Engineering ( boxcox )¶

Other than log the target as well as standardization of input features,

Boxcox which aims to provide an automatic way of transforming your (cont) features to make it more even (normal).

In [4]:
f, axarr = plt.subplots(2, sharex=True)
axarr[0].hist(temp0,20);
axarr[0].set_title('before');
axarr[1].hist(temp,20);
axarr[1].set_title('after');


## Features - Encoding of Categorical features¶

Most common (two) types of encoding (i know),

• Label Encoding
col trans
A 1
B 2
C 3
• one - hot encoding
col is_A is_B is_C
A 1 0 0
B 0 1 0
C 0 0 1

## Features - Lexical Encoding¶

Example:

In [5]:
def encode(charcode):
r = 0
ln = len(str(charcode))
for i in range(ln):
r += (ord(str(charcode)[i]) - ord('A') + 1) * 26 ** (ln - i - 1)
return r

print encode('A')
print encode('B')
print encode('Z')
print encode('AA')
print encode('AB')

1
2
26
27
28


# XGBOOST¶

## XGBOOST - Understanding Learning Task Parameters¶

If you refer to the xgboost parameters documentation here,

Specify the learning task and the corresponding learning objective. The objective options are below:

• objective [ default=reg:linear ]

• "reg:linear" --linear regression
• "reg:logistic" --logistic regression
• "binary:logistic" --logistic regression for binary classification, output probability
• "binary:logitraw" --logistic regression for binary classification, output score before logistic transformation
• "count:poisson" --poisson regression for count data, output mean of poisson distribution
• max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
• "multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

• ...

## XGBOOST - Understanding Eval¶

• eval_metric [ default according to objective ]
• "rmse": root mean square error
• "mae": mean absolute error
• "logloss": negative log-likelihood
• "error": Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
• ...

## XGBOOST - Problem!¶

The learning objective as 'Linear' while the evaluation metric is 'mae', what can we do?

It turns out that when you are optimizing for Mean Squared Error, finding the Mean optimizes the error. If you are optimizing for mean absolute error, finding the Median optimizes the error. Example:

In [6]:
np.random.seed(1337);size = 5000000
vector = np.random.exponential(size=size);vector = np.ceil(vector*1000)
def mean_mae(x_hat,size = size):return np.sum(np.absolute(vector-x_hat))/size
print 'The mean is: ' , np.mean(vector)
print 'The median is: ' , np.median(vector)
print 'The error by using mean is: ' , mean_mae(np.mean(vector))
print 'The error by using median is: ' ,mean_mae(np.median(vector))
print 'increasing 1 point from median: ' ,mean_mae(np.median(vector)+1)
print 'decreasing 1 point from median: ' ,mean_mae(np.median(vector)-1)

The mean is:  1000.3960478
The median is:  694.0
The error by using mean is:  735.784095754
The error by using median is:  693.1531646
increasing 1 point from median:  693.1539714
decreasing 1 point from median:  693.1533366


## XGBOOST - Gradient Descent of L1 and L2¶

In [7]:
class L1(object):
def __init__(self,vector,size,guess):
self.vector = vector
self.guess = guess
self.size = size
def jacob(self):
return np.sum(np.sign((self.vector-self.guess)))/self.size

class L2(object):
def __init__(self,vector,size,guess):
self.vector = vector
self.guess = guess
self.size = size
def jacob(self):
return np.sum(self.vector-self.guess)/self.size


## XGBOOST - L1 gradient descent¶

Note that it takes close to 50 seconds

In [8]:
initial_guess = 725
learning_rate = 4
L1_descent = L1(vector,size,initial_guess)
start = time.time()
for i in range(1000):

if i % 100 == 0:
print 'The gradient is :', L1_descent.jacob(), ' with value:' , initial_guess
next_guess = initial_guess + learning_rate*L1_descent.jacob()
initial_guess = next_guess
L1_descent = L1(vector,size,next_guess)
print 'The output is: ', next_guess
print 'The time taken is, : ', time.time() - start
#the mean is:  1000.3960478 the median is:  694.0

The gradient is : -0.030957  with value: 725
The gradient is : -0.02076  with value: 714.724892
The gradient is : -0.0137276  with value: 707.806204
The gradient is : -0.0097468  with value: 703.1710776
The gradient is : -0.0067868  with value: 700.0471752
The gradient is : -0.0038096  with value: 697.945108
The gradient is : -0.0028124  with value: 696.5688536
The gradient is : -0.001808  with value: 695.640756
The gradient is : -0.0008068  with value: 694.9616088
The gradient is : -0.0008068  with value: 694.6388888
The output is:  694.3161688
The time taken is, :  49.7573640347


## XGBOOST - L2 gradient descent¶

Much faster!

In [9]:
initial_guess = 725
learning_rate = 0.1
L2_descent = L2(vector,size,initial_guess)
start = time.time()
for i in range(150):
if i % 50 ==  0:
print 'The gradient is :', L2_descent.jacob(), ' with value:' , initial_guess

next_guess = initial_guess + learning_rate*L2_descent.jacob()
initial_guess = next_guess
L2_descent = L2(vector,size,next_guess)
print 'The output is: ', next_guess
print 'The time taken is, : ', time.time() - start
#the mean is:  1000.3960478 the median is:  694.0

The gradient is : 275.3960478  with value: 725
The gradient is : 1.41932932335  with value: 998.976718477
The gradient is : 0.00731490427759  with value: 1000.3887329
The output is:  1000.3960101
The time taken is, :  2.02755618095


## XGBOOST - L2 gradient descent with log-transformed features¶

Recall that L2 penalizes more for points far away from the mean, while mae penalizes equally throughout.

In that case, we can transform the features with log.

In [10]:
np.random.seed(1337);size = 5000000
vector = np.random.exponential(size=size);vector = np.ceil(vector*1000)
print 'The mean is: ' , np.mean(vector)
print 'The median is: ' , np.median(vector)
print 'The log mean is', np.mean(np.log(vector)), 'the exp of the log mean is:' , np.exp(np.mean(np.log(vector)))
print 'The error by using mean is: ' , mean_mae(np.mean(vector))
print 'The error by using median is: ' ,mean_mae(np.median(vector))
vector_log = np.log(vector);mean_vector_log = np.exp(np.mean(vector_log))
print 'The error by using log-mean is: ' ,mean_mae(mean_vector_log)

The mean is:  1000.3960478
The median is:  694.0
The log mean is 6.33421520082 the exp of the log mean is: 563.526973894
The error by using mean is:  735.784095754
The error by using median is:  693.1531646
The error by using log-mean is:  702.006505008


## XGBOOST - L2 gradient descent with log-transformed features¶

In [11]:
initial_guess = 10
learning_rate = 0.1
L2_descent = L2(np.log(vector),size,initial_guess)
start = time.time()
for i in range(150):
if i % 50 ==  0:
print 'The gradient is :', L2_descent.jacob(), ' with value:' , initial_guess

next_guess = initial_guess + learning_rate*L2_descent.jacob()
initial_guess = next_guess
L2_descent = L2(np.log(vector),size,next_guess)

print 'The output is: ', next_guess
print 'The time taken is, : ', time.time() - start
#the mean is:  1000.3960478 the median is:  694.0 the log mean is 6.33421520082

The gradient is : -3.66578479918  with value: 10
The gradient is : -0.0188926308134  with value: 6.35310783164
The gradient is : -9.73683722883e-05  with value: 6.3343125692
The output is:  6.33421570264
The time taken is, :  8.58931088448


## XGBOOST - Another example¶

Notice that calculating the median is almost 20 times slower!

In [12]:
clock1 = time.time()
print 'the mean is: ', np.mean(vector)
print time.time() - clock1

clock2 = time.time()
print 'the median is: ', np.median(vector)
print time.time() - clock2

the mean is:  1000.3960478
0.00365710258484
the median is:  694.0
0.0586512088776


## XGBOOST - Conclusion¶

Optimizing by MAE is still the best option! But there is no obejctive for MAE, what do we do?

For those who are familar, you may know that $y = |x|$ is not differentiable at $x=0$.

As seen from the following graph, when performing gradient descent is going to be very tricky.

In [13]:
x = np.arange(-1,1,0.001)
f, (ax1, ax2,ax3) = plt.subplots(3, 1, sharex=True)
ax1.plot(x,np.abs(x)); ax1.set_title('function')
ax2.plot(x,np.sign(x),color = 'r'); ax2.set_title('Jacob')
ax3.plot(x,1/np.abs(x),color = 'g'); ax3.set_title('Hessian');


## XGBOOST - Numercial approximation¶

I will skip the details but it turns out that there is a very nice function called the fair function that 'smooths' the gradient of the MAE function.

Function = $c^2 [\frac{|x|}{c} - log(1+\frac{|x|}{c} ) ]$, Jacob = $\frac{x}{1+\frac{|x|}{c}}$, Hess = $\frac{1}{1+\frac{|x|}{c}}$

Lets define the code for the fair function:

In [14]:
class fair_obj(object):
def __init__(self,vector,c=2):
self.vector = vector
self.c = c
def func(self):
c = self.c
abs_x_c = np.abs(self.vector) / c
return((c**2)*(abs_x_c - np.log(1+abs_x_c)))
def jacob(self):
c = self.c
abs_x_c = np.abs(self.vector) / c
return(x / (1+abs_x_c))
def hess(self):
c = self.c
abs_x_c = np.abs(self.vector) / c
return(1 / (1+abs_x_c))


## XGBOOST - Visualizing the fair function¶

In [15]:
x = np.arange(-5,5,0.001)
y = fair_obj(x,0.1)
f, (ax1, ax2,ax3) = plt.subplots(3, 1, sharex=True)
ax1.plot(x,y.func()); ax1.set_title('function')
ax2.plot(x,y.jacob(),color = 'r'); ax2.set_title('Jacob')
ax3.plot(x,y.hess(),color = 'g'); ax3.set_title('Hessian')

Out[15]:
<matplotlib.text.Text at 0x110e2b750>

## XGBOOST - Fair function¶

Notice that there is a parameter within the fair function c. The smaller the value of c, the closer it approximates the MAE function. Suppose we increase 0.1 to 1 instead:

You can probably notice that the higher the 'c' parameter is, the smoother the computation and hence lesser training time at the cost of accuracy.

In [16]:
x = np.arange(-1,1,0.001)
y = fair_obj(x,1)
f, (ax1, ax2,ax3) = plt.subplots(3, 1, sharex=True)
ax1.plot(x,y.func()); ax1.set_title('function')
ax2.plot(x,y.jacob(),color = 'r'); ax2.set_title('Jacob')
ax3.plot(x,y.hess(),color = 'g'); ax3.set_title('Hessian');


## XGBOOST - Fair function in xgb¶

So how do i do this with XGB? Start by defining the fair objective:

def fair_obj(preds, dtrain):
fair_constant = 2
labels = dtrain.get_label()
x = (preds - labels)
den = abs(x) + fair_constant
grad = fair_constant * x / (den)
hess = fair_constant * fair_constant / (den * den)


## XGBOOST - Fair function in xgb¶

And feed it inside xgb.train:

xgb.train(params,
d_train,
100000,
watchlist,
early_stopping_rounds=early_stop,
obj=fair_obj,
eval_metric='mae')


## Results so far¶

• With a few variants of my Xgboost and (weighted) averaging them, i (initially) got into top 10%.

• Unfortunately within no time at all i was on the verge of getting kick out of 10%, and there were no further improvements possible (based on the fourms) from Xgb.

• Many people in the fourms suggested Neural Networks which i have absolutely no experience or knowledge other than the ones i did on MOOCS!

• The idea is to use two different models so that they can 'look out' for each other.

# Neural Networks!¶

My first ever Deep neural network! (Outside Moocs)

## Neural Networks - Building one with Keras¶

It turns out to be really easy to build Neural Networks with Keras:

def nn_model():
model = Sequential()

model.add(Dense(400, input_dim = xtrain.shape[1], init = 'he_normal'))

return(model)


## Neural Networks -Training and tuning NN.¶

• Each fold still required ~ 33 minutes approx which also made it very hard to iterate different features.

• One simple improvement (or quick win) is to bag the neural networks.

• However with my mac book pro, it would still take (40*55*10 = 22000 seconds) 6.1 hours just for training one model alone!

• Bagging it 10 times would take close to 3 days which is roughly what the fourms reported.

• Time is not on my side!

## Neural Networks - Training NN on AWS (CUDA)¶

• This was about 10-12 remaining days into the competition.
• No choice but to learn how to set up GPU instances on AWS. (Spend slightly over a day to figured it out)
• Wrote my own small tutorial here.

## Huge Mistake (Oversight)¶

After setting up Theano and CUDA with CUDNN, i forgot to set the [cnmem] parameter.

What it does essentially it forces the system to allocate the gpu memory to the application you are using. Below is basically a summary of run-times:

Type Mac AWS C4.6x Theano Theano w cnmem
Time 40-44 24-28 18 12

## Neural Networks - Some Tricks with NN¶

• In XGB, you have

• early_stopping_rounds = value
• clf.predict(d_valid, ntree_limit=clf.best_ntree_limit)
• Could you do the same with Neural Networks?

• Note that if you use early stopping you could be overfitting to the validation data and hence the test set.

## Neural Networks - Early Stopping of Neural Networks¶

Its easy!

model = nn_model()

callsback_list = [EarlyStopping(patience=10), \
ModelCheckpoint('keras-regressor-' + str(i+1) +'_'+ str(j+1) + '.check',\
monitor='val_loss', save_best_only=True, verbose=0)]

fit = model.fit_generator(generator = batch_generator(xtr, ytr, 128, True),
nb_epoch = nepochs,
samples_per_epoch = xtr.shape[0],
verbose = 0,
validation_data=(xte.todense(),yte),
callbacks=callsback_list)

fit = load_model('keras-regressor-' + str(i+1) + '_'+ str(j+1) + '.check')


## Results so far¶

• With simple averaging / using leaderboard as a feedback, i got into top 5%.
• With that, i changed my goal to top 5%!
• I needed more safety net to ensure that i would be in the top 5%
• Additionally, i got kicked out of top 5% in no time at all.

## Side note¶

• With the best submission of this approach, i would have gotten rank 276 which barely made it to top 10%.
• There were a sudden surge of experts entering the competition (as an earlier one ended).
• Possibly other factors as people started stacking their models.

# Stacking¶

## Stacking - Horrors!¶

The moment i decided to do stacking (with 7days remaining), i realized i do not have my out-of-fold predictions.

For those of you who are not familar with stacking, here is a high level idea:

• Take your data set, split into n-folds.
• Train your model with (n-1) folds and predict on the remaining fold (out of bag) and the test set.
• Combine (stack) the predictions on the remaining folds
• Average the predictions for the test set.
• These will form the meta-features for your second level modeling, which is stacking.

#### Thus, i had to re-train all my previous models, and one additional random forest and extra trees each with out of bag predictions!¶

• I also trained a few other models, such as Neural Network without log-transform

## Stacking - Mistakes!¶

• As my folds were very different in terms of cv-score, i had to use early stopping.
• Try to make cv-scores across folds more consistent. (It turns out that it did not help much, and i still had descent results)
• Majority of the errors came from data points which has high target values, etc 9000 but model predicts 5000 which contributed significantly to the MAE.

## Stacking - Simple Weighted Average¶

Combine all your meta-features to one data frame. Notebook here

In [17]:
from sklearn.metrics import mean_absolute_error

Out[17]:
0 1 2 3 4 5 6 7 8 9 10 11
0 1993.587402 1991.582031 2040.753906 2037.662842 2011.458252 2031.191162 1709.747961 1863.231165 1867.304663 1817.927966 2177.778654 2059.030370
1 1785.814331 1760.319946 1782.887695 1801.068604 1761.099731 1786.076660 1472.165295 1518.536694 1465.007471 1472.942126 1884.687448 1700.372436
2 4657.417480 4471.034180 4556.829590 4361.276367 4409.905762 4420.834961 4131.383081 3913.994287 3883.456030 3798.080249 4037.687796 3863.859438
3 1038.933716 1063.732910 1086.600220 1084.421143 1074.311401 1078.842163 1038.551477 987.047913 952.744666 1040.312915 1134.017057 1163.973155
4 3166.384033 3289.858887 3324.271240 3269.665527 3485.719971 3238.185791 3147.510742 3263.374414 3241.944531 3269.075391 3307.724160 2948.707217

## Stacking - Simple Weighted Average (cont)¶

Printing the loss for each model:

In [18]:
for i in range(train_x.shape[1]):
print mean_absolute_error(train_x.ix[:,i],data.loss)

1124.2522893
1125.11043344
1124.73870933
1125.09915962
1124.87963343
1124.45611729
1130.58667325
1130.40016748
1131.39946641
1132.36796706
1190.29358276
1186.37357485


## Stacking - Simple Weighted Average (cont)¶

Simple Averaging:

(But previously i was using leaderboard feedback)

In [19]:
ratio = 0.5
print mean_absolute_error(ratio*train_x.ix[:,0]+(1-ratio)*train_x.ix[:,7],data.loss)

1119.97406917

In [20]:
ratio = 0.65
print mean_absolute_error(ratio*train_x.ix[:,0]+(1-ratio)*train_x.ix[:,7],data.loss)

1119.7183677

In [21]:
ratio = 0.75
print mean_absolute_error(ratio*train_x.ix[:,0]+(1-ratio)*train_x.ix[:,7],data.loss)

1120.3049399


## Stacking - Simple Weighted Average (cont)¶

Finally lets look at a way of finding optimal ratio for multiple inputs:

Define the following functions

In [22]:
from scipy.optimize import minimize

def mae_func(weights):
''' scipy minimize will pass the weights as a numpy array '''
final_prediction = 0
for weight, prediction in zip(weights, predictions):
final_prediction += weight*prediction

return mean_absolute_error(Y_values, final_prediction)

Y_values = data['loss'].values
predictions = []
for i in range(11):
predictions.append(train_x.ix[:,i])
lls = []
wghts = []


## Stacking - Simple Weighted Average (cont)¶

Finding the optimal weights

In [24]:
start_time = timer(None)
for i in range(5):
starting_values = np.random.uniform(size=len(predictions))
bounds = [(0,1)]*len(predictions)
res = minimize(mae_func, starting_values, method='L-BFGS-B',
bounds=bounds, options={'disp': False, 'maxiter': 10})
lls.append(res['fun'])
wghts.append(res['x'])
# Uncomment the next line if you want to see the weights and scores calculated in real time
#print('Weights: {weights}  Score: {score}'.format(weights=res['x'], score=res['fun']))
bestSC = np.min(lls)
bestWght = wghts[np.argmin(lls)]
print('\n Ensemble Score: {best_score}'.format(best_score=bestSC))
print('\n Best Weights: {weights}'.format(weights=bestWght))
timer(start_time)

 Ensemble Score: 1118.46592553

Best Weights: [ 0.1317671   0.05049913  0.09826599  0.12918516  0.0616241   0.11857319
0.06449542  0.09522439  0.16919281  0.09825817  0.        ]
Time taken: 0 hours 0 minutes and 13.92 seconds.


## Results so far¶

• By now, i was hovering around 5% in the public leaderboard with 2 days left.
• There was no guarantee i would remain in the top 5% in the private leader board.
• To be precise, this score placed me at position 135 at the public leaderboard - borderline top 5%
• Required more safety Net!

It turns out that with the simple weighted average model i would have been placed 104, close to the top 3%.

## Stacking - Unfortunately...¶

• I tried linear regression, but the results is worse than the weighted average approach.
• It made sense, as linear regression uses MSE rather than MAE.

## XGB to the rescue perhaps?¶

• NOPE!
• XGB overfitted to my single best xgb model instead.

## Stacking - With (slightly more than) one day to go,¶

I decided to try Neural Networks instead.

The best local cv with the weighted average was 1118.345

Here is the output with my first neural network:

('Fold ', 1, '- MAE:', 1117.5260825665521)
('Fold ', 2, '- MAE:', 1113.2272453103922)
('Fold ', 3, '- MAE:', 1117.1135764027533)
('Fold ', 4, '- MAE:', 1121.9982577768997)
('Fold ', 5, '- MAE:', 1119.6595219061744)
('Total - MAE:', 1117.9049057391971)

Finally! Some improvement!

And the private LB would have given me position 53!

Clearly i am onto something, right? No.

## Stacking - Whats are the first few things to try?¶

1. Tuning
• Luckily with only 11 features, it is relatively fast (~40minutes) for each model to try new params.
• Unfortunately, no matter what i did, i got results worse than my weighted average cv score.
• With no choice, i resorted to bagging
2. Bagging
• Manage to achieve a lower CV score, lower public score, but a worst private score.

So i guess, sometimes luck does play a part.

## Stacking - Finally..¶

i combine my keras (out of bag) results (consider them third level modeling) and combine with my second level models and ran the weighted average codes.

Below is a summary of what i achieved:

Single Keras W. avg with single NN Bagged 5 Keras W.avg both NN
Local CV 1117.90490574 1117.77760897 1117.88037992 1117.7181697
Public LB 1100.90013 1100.87763 1100.88155 1100.86946
Private LB 1112.84611 1112.77244 1112.93370 1112.73936

What is interesting is that the weighted average with my single Neural Netwrok outperformed the one that is bagged 5 times.

I decided to run weighted average with both single and bagged Neural Network. Turns out i had best Cv results across all measures.

Final position - 46!