- Attendence Guy for DSSG (I hope to retire one day).
- Junior Data Scientist in Singapore.
- Studied Statistics (my friends would disagree) from NTU.

- Kaggle is not practical for industry (i still feel it's somewhat true)
- Will not use Deep Learning (in most cases)
- Will not tune the models to the extreme or stacking for 1% (0.1% gain).

- Most employers still look for people who exhibit passion in Data Science or Machine Learning.
Happens that Kaggle is one great way to build your portfolio.

People who are in industry (and are much better than i am) still write blogs or present their work in meetups.

This is my first (serious) Kaggle competition, prior to this i talk to my colleagues or the DSSG organizers about ther competition.

**(So i am lucky to have met such people who inspire me in their own way.)**Thankfully, i managed to achieve top 2%.

It showcases your learning capacity and ability to understand documentation. (I learnt all of these in a month)

Causes you sleepless nights (waiting for your cv score).

- All my code can be found in my github page, Freedom89 and look for the Allstate_kaggle repo.

- The full detailed blog post can be found here or refer to the index.md file on the repo.

- Questions are welcomed, but I might take them offline if i think your question requires a heavier discussion.

- This may feel like i am spamming you with 'tools', in which case i apologize.

- Nevertheless, I hope the content would be useful and we might become kaggle teammates in the future!

- Competition

- Features

- Xgboost

- Keras

- Stacking

In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstateâ€™s efforts to ensure a worry-free customer experience.

- MAE (mean absolute error)
- Anonymized Features

In [2]:

```
data.head(5)
```

Out[2]:

id | cat1 | cat2 | cat3 | cat4 | cat5 | cat6 | cat7 | cat8 | cat9 | ... | cont6 | cont7 | cont8 | cont9 | cont10 | cont11 | cont12 | cont13 | cont14 | loss | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | A | B | A | B | A | A | A | A | B | ... | 0.718367 | 0.335060 | 0.30260 | 0.67135 | 0.83510 | 0.569745 | 0.594646 | 0.822493 | 0.714843 | 2213.18 |

1 | 2 | A | B | A | A | A | A | A | A | B | ... | 0.438917 | 0.436585 | 0.60087 | 0.35127 | 0.43919 | 0.338312 | 0.366307 | 0.611431 | 0.304496 | 1283.60 |

2 | 5 | A | B | A | A | B | A | A | A | B | ... | 0.289648 | 0.315545 | 0.27320 | 0.26076 | 0.32446 | 0.381398 | 0.373424 | 0.195709 | 0.774425 | 3005.09 |

3 | 10 | B | B | A | B | A | A | A | A | B | ... | 0.440945 | 0.391128 | 0.31796 | 0.32128 | 0.44467 | 0.327915 | 0.321570 | 0.605077 | 0.602642 | 939.85 |

4 | 11 | A | B | A | B | A | A | A | A | B | ... | 0.178193 | 0.247408 | 0.24564 | 0.22089 | 0.21230 | 0.204687 | 0.202213 | 0.246011 | 0.432606 | 2763.85 |

5 rows Ã— 132 columns

- Use Xgbfi by far0n to find N-way interactions for categorical features

col1 | col2 | new_feat = col1 + col2 |
---|---|---|

A | B | AB |

C | D | CD |

- Use this tool to find feature interactions for Regression.

- As an alternative to grid search, i have explored using hyperopt.
- Example of how i used hyperopt for my xgb.
- Results of hyperopt
- Start with a relatively higher ETA for hyperopt, find out the 'best' params and lower the ETA for actual training

Other than log the target as well as standardization of input features,

Boxcox which aims to provide an automatic way of transforming your (cont) features to make it more even (normal).

In [4]:

```
f, axarr = plt.subplots(2, sharex=True)
axarr[0].hist(temp0,20);
axarr[0].set_title('before');
axarr[1].hist(temp,20);
axarr[1].set_title('after');
```

Most common (two) types of encoding (i know),

- Label Encoding

col | trans |
---|---|

A | 1 |

B | 2 |

C | 3 |

- one - hot encoding

col | is_A | is_B | is_C |
---|---|---|---|

A | 1 | 0 | 0 |

B | 0 | 1 | 0 |

C | 0 | 0 | 1 |

Example:

In [5]:

```
def encode(charcode):
r = 0
ln = len(str(charcode))
for i in range(ln):
r += (ord(str(charcode)[i]) - ord('A') + 1) * 26 ** (ln - i - 1)
return r
print encode('A')
print encode('B')
print encode('Z')
print encode('AA')
print encode('AB')
```

1 2 26 27 28

If you refer to the xgboost parameters documentation here,

Specify the learning task and the corresponding learning objective. The objective options are below:

objective [ default=reg:linear ]

**"reg:linear" --linear regression**- "reg:logistic" --logistic regression
- "binary:logistic" --logistic regression for binary classification, output probability
- "binary:logitraw" --logistic regression for binary classification, output score before logistic transformation
- "count:poisson" --poisson regression for count data, output mean of poisson distribution
- max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)

"multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

...

- eval_metric [ default according to objective ]
- "rmse": root mean square error
**"mae"**: mean absolute error- "logloss": negative log-likelihood
- "error": Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
- ...

The learning objective as 'Linear' while the evaluation metric is 'mae', what can we do?

It turns out that when you are optimizing for Mean Squared Error, finding the **Mean** optimizes the error.
If you are optimizing for mean absolute error, finding the **Median** optimizes the error. Example:

In [6]:

```
np.random.seed(1337);size = 5000000
vector = np.random.exponential(size=size);vector = np.ceil(vector*1000)
def mean_mae(x_hat,size = size):return np.sum(np.absolute(vector-x_hat))/size
print 'The mean is: ' , np.mean(vector)
print 'The median is: ' , np.median(vector)
print 'The error by using mean is: ' , mean_mae(np.mean(vector))
print 'The error by using median is: ' ,mean_mae(np.median(vector))
print 'increasing 1 point from median: ' ,mean_mae(np.median(vector)+1)
print 'decreasing 1 point from median: ' ,mean_mae(np.median(vector)-1)
```

In [7]:

```
class L1(object):
def __init__(self,vector,size,guess):
self.vector = vector
self.guess = guess
self.size = size
def jacob(self):
return np.sum(np.sign((self.vector-self.guess)))/self.size
class L2(object):
def __init__(self,vector,size,guess):
self.vector = vector
self.guess = guess
self.size = size
def jacob(self):
return np.sum(self.vector-self.guess)/self.size
```

Note that it takes close to 50 seconds

In [8]:

```
initial_guess = 725
learning_rate = 4
L1_descent = L1(vector,size,initial_guess)
start = time.time()
for i in range(1000):
if i % 100 == 0:
print 'The gradient is :', L1_descent.jacob(), ' with value:' , initial_guess
next_guess = initial_guess + learning_rate*L1_descent.jacob()
initial_guess = next_guess
L1_descent = L1(vector,size,next_guess)
print 'The output is: ', next_guess
print 'The time taken is, : ', time.time() - start
#the mean is: 1000.3960478 the median is: 694.0
```

Much faster!

In [9]:

```
initial_guess = 725
learning_rate = 0.1
L2_descent = L2(vector,size,initial_guess)
start = time.time()
for i in range(150):
if i % 50 == 0:
print 'The gradient is :', L2_descent.jacob(), ' with value:' , initial_guess
next_guess = initial_guess + learning_rate*L2_descent.jacob()
initial_guess = next_guess
L2_descent = L2(vector,size,next_guess)
print 'The output is: ', next_guess
print 'The time taken is, : ', time.time() - start
#the mean is: 1000.3960478 the median is: 694.0
```

Recall that L2 penalizes more for points far away from the mean, while mae penalizes equally throughout.

In that case, we can transform the features with log.

In [10]:

```
np.random.seed(1337);size = 5000000
vector = np.random.exponential(size=size);vector = np.ceil(vector*1000)
print 'The mean is: ' , np.mean(vector)
print 'The median is: ' , np.median(vector)
print 'The log mean is', np.mean(np.log(vector)), 'the exp of the log mean is:' , np.exp(np.mean(np.log(vector)))
print 'The error by using mean is: ' , mean_mae(np.mean(vector))
print 'The error by using median is: ' ,mean_mae(np.median(vector))
vector_log = np.log(vector);mean_vector_log = np.exp(np.mean(vector_log))
print 'The error by using log-mean is: ' ,mean_mae(mean_vector_log)
```

In [11]:

```
initial_guess = 10
learning_rate = 0.1
L2_descent = L2(np.log(vector),size,initial_guess)
start = time.time()
for i in range(150):
if i % 50 == 0:
print 'The gradient is :', L2_descent.jacob(), ' with value:' , initial_guess
next_guess = initial_guess + learning_rate*L2_descent.jacob()
initial_guess = next_guess
L2_descent = L2(np.log(vector),size,next_guess)
print 'The output is: ', next_guess
print 'The time taken is, : ', time.time() - start
#the mean is: 1000.3960478 the median is: 694.0 the log mean is 6.33421520082
```

Notice that calculating the median is almost 20 times slower!

In [12]:

```
clock1 = time.time()
print 'the mean is: ', np.mean(vector)
print time.time() - clock1
clock2 = time.time()
print 'the median is: ', np.median(vector)
print time.time() - clock2
```

the mean is: 1000.3960478 0.00365710258484 the median is: 694.0 0.0586512088776

Optimizing by MAE is still the best option! But there is no obejctive for MAE, what do we do?

For those who are familar, you may know that $y = |x|$ is not differentiable at $x=0$.

As seen from the following graph, when performing gradient descent is going to be very tricky.

In [13]:

```
x = np.arange(-1,1,0.001)
f, (ax1, ax2,ax3) = plt.subplots(3, 1, sharex=True)
ax1.plot(x,np.abs(x)); ax1.set_title('function')
ax2.plot(x,np.sign(x),color = 'r'); ax2.set_title('Jacob')
ax3.plot(x,1/np.abs(x),color = 'g'); ax3.set_title('Hessian');
```

I will skip the details but it turns out that there is a very nice function called the `fair`

function that 'smooths' the gradient of the MAE function.

Function = $c^2 [\frac{|x|}{c} - log(1+\frac{|x|}{c} ) ]$, Jacob = $\frac{x}{1+\frac{|x|}{c}}$, Hess = $\frac{1}{1+\frac{|x|}{c}}$

If you are still interested, click here.

Lets define the code for the fair function:

In [14]:

```
class fair_obj(object):
def __init__(self,vector,c=2):
self.vector = vector
self.c = c
def func(self):
c = self.c
abs_x_c = np.abs(self.vector) / c
return((c**2)*(abs_x_c - np.log(1+abs_x_c)))
def jacob(self):
c = self.c
abs_x_c = np.abs(self.vector) / c
return(x / (1+abs_x_c))
def hess(self):
c = self.c
abs_x_c = np.abs(self.vector) / c
return(1 / (1+abs_x_c))
```

In [15]:

```
x = np.arange(-5,5,0.001)
y = fair_obj(x,0.1)
f, (ax1, ax2,ax3) = plt.subplots(3, 1, sharex=True)
ax1.plot(x,y.func()); ax1.set_title('function')
ax2.plot(x,y.jacob(),color = 'r'); ax2.set_title('Jacob')
ax3.plot(x,y.hess(),color = 'g'); ax3.set_title('Hessian')
```

Out[15]:

<matplotlib.text.Text at 0x110e2b750>

Notice that there is a parameter within the fair function `c`

. The smaller the value of c, the closer it approximates the `MAE`

function. Suppose we increase `0.1`

to `1`

instead:

** You can probably notice that the higher the 'c' parameter is, the smoother the computation and hence lesser training time at the cost of accuracy. **

In [16]:

```
x = np.arange(-1,1,0.001)
y = fair_obj(x,1)
f, (ax1, ax2,ax3) = plt.subplots(3, 1, sharex=True)
ax1.plot(x,y.func()); ax1.set_title('function')
ax2.plot(x,y.jacob(),color = 'r'); ax2.set_title('Jacob')
ax3.plot(x,y.hess(),color = 'g'); ax3.set_title('Hessian');
```

So how do i do this with XGB? Start by defining the fair objective:

```
def fair_obj(preds, dtrain):
fair_constant = 2
labels = dtrain.get_label()
x = (preds - labels)
den = abs(x) + fair_constant
grad = fair_constant * x / (den)
hess = fair_constant * fair_constant / (den * den)
return grad, hess
```

And feed it inside `xgb.train`

:

```
xgb.train(params,
d_train,
100000,
watchlist,
early_stopping_rounds=early_stop,
obj=fair_obj,
eval_metric='mae')
```

With a few variants of my Xgboost and (weighted) averaging them, i (initially) got into top 10%.

Unfortunately within no time at all i was on the verge of getting kick out of 10%, and there were no further improvements possible (based on the fourms) from Xgb.

Many people in the fourms suggested Neural Networks

**which i have absolutely no experience or knowledge**other than the ones i did on MOOCS!The idea is to use two different models so that they can 'look out' for each other.

My first ever Deep neural network! (Outside Moocs)

It turns out to be really easy to build Neural Networks with `Keras`

:

```
def nn_model():
model = Sequential()
model.add(Dense(400, input_dim = xtrain.shape[1], init = 'he_normal'))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Dense(200, init = 'he_normal'))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(100, init = 'he_normal'))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(50, init = 'he_normal'))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(1, init = 'he_normal'))
model.compile(loss = 'mae', optimizer = 'adadelta',metrics=[mae])
return(model)
```

Each fold still required ~ 33 minutes approx which also made it very hard to iterate different features.

One simple improvement (or quick win) is to bag the neural networks.

However with my mac book pro, it would still take (40*55*10 = 22000 seconds) 6.1 hours just for training one model alone!

Bagging it 10 times would take close to 3 days which is roughly what the fourms reported.

Time is not on my side!

- This was about 10-12 remaining days into the competition.
- No choice but to learn how to set up GPU instances on AWS. (Spend slightly over a day to figured it out)
- Wrote my own small tutorial here.

After setting up Theano and CUDA with CUDNN, i forgot to set the [cnmem] parameter.

What it does essentially it forces the system to allocate the gpu memory to the application you are using. Below is basically a summary of run-times:

Type | Mac | AWS C4.6x | Theano | Theano w cnmem |
---|---|---|---|---|

Time | 40-44 | 24-28 | 18 | 12 |

In XGB, you have

`early_stopping_rounds = value`

`clf.predict(d_valid, ntree_limit=clf.best_ntree_limit)`

Could you do the same with Neural Networks?

- Note that if you use early stopping you could be overfitting to the validation data and hence the test set.

Its easy!

```
model = nn_model()
callsback_list = [EarlyStopping(patience=10), \
ModelCheckpoint('keras-regressor-' + str(i+1) +'_'+ str(j+1) + '.check',\
monitor='val_loss', save_best_only=True, verbose=0)]
fit = model.fit_generator(generator = batch_generator(xtr, ytr, 128, True),
nb_epoch = nepochs,
samples_per_epoch = xtr.shape[0],
verbose = 0,
validation_data=(xte.todense(),yte),
callbacks=callsback_list)
fit = load_model('keras-regressor-' + str(i+1) + '_'+ str(j+1) + '.check')
```

- With simple averaging / using leaderboard as a feedback, i got into top 5%.
- With that, i changed my goal to top 5%!
- I needed more safety net to ensure that i would be in the top 5%
- Additionally, i got kicked out of top 5% in no time at all.

- With the best submission of this approach, i would have gotten rank 276 which barely made it to top 10%.
- There were a sudden surge of experts entering the competition (as an earlier one ended).
- Possibly other factors as people started stacking their models.

The moment i decided to do stacking (with 7days remaining), i realized i do not have my out-of-fold predictions.

For those of you who are not familar with stacking, here is a high level idea:

- Take your data set, split into n-folds.
- Train your model with (n-1) folds and predict on the
**remaining fold (out of bag)**and the test set. - Combine (stack) the predictions on the remaining folds
- Average the predictions for the test set.
- These will form the meta-features for your second level modeling, which is stacking.

- I also trained a few other models, such as Neural Network without log-transform

- As my folds were very different in terms of cv-score, i had to use early stopping.
- Try to make cv-scores across folds more consistent. (It turns out that it did not help much, and i still had descent results)
- Majority of the errors came from data points which has high target values, etc 9000 but model predicts 5000 which contributed significantly to the MAE.

Combine all your meta-features to one data frame. Notebook here

In [17]:

```
from sklearn.metrics import mean_absolute_error
train_x = pd.read_csv("../kaggle_1/train_second_level_model.csv")
train_x.head(5)
```

Out[17]:

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1993.587402 | 1991.582031 | 2040.753906 | 2037.662842 | 2011.458252 | 2031.191162 | 1709.747961 | 1863.231165 | 1867.304663 | 1817.927966 | 2177.778654 | 2059.030370 |

1 | 1785.814331 | 1760.319946 | 1782.887695 | 1801.068604 | 1761.099731 | 1786.076660 | 1472.165295 | 1518.536694 | 1465.007471 | 1472.942126 | 1884.687448 | 1700.372436 |

2 | 4657.417480 | 4471.034180 | 4556.829590 | 4361.276367 | 4409.905762 | 4420.834961 | 4131.383081 | 3913.994287 | 3883.456030 | 3798.080249 | 4037.687796 | 3863.859438 |

3 | 1038.933716 | 1063.732910 | 1086.600220 | 1084.421143 | 1074.311401 | 1078.842163 | 1038.551477 | 987.047913 | 952.744666 | 1040.312915 | 1134.017057 | 1163.973155 |

4 | 3166.384033 | 3289.858887 | 3324.271240 | 3269.665527 | 3485.719971 | 3238.185791 | 3147.510742 | 3263.374414 | 3241.944531 | 3269.075391 | 3307.724160 | 2948.707217 |

Printing the loss for each model:

In [18]:

```
for i in range(train_x.shape[1]):
print mean_absolute_error(train_x.ix[:,i],data.loss)
```

Simple Averaging:

(But previously i was using leaderboard feedback)

In [19]:

```
ratio = 0.5
print mean_absolute_error(ratio*train_x.ix[:,0]+(1-ratio)*train_x.ix[:,7],data.loss)
```

1119.97406917

In [20]:

```
ratio = 0.65
print mean_absolute_error(ratio*train_x.ix[:,0]+(1-ratio)*train_x.ix[:,7],data.loss)
```

1119.7183677

In [21]:

```
ratio = 0.75
print mean_absolute_error(ratio*train_x.ix[:,0]+(1-ratio)*train_x.ix[:,7],data.loss)
```

1120.3049399

Finally lets look at a way of finding optimal ratio for multiple inputs:

Define the following functions

In [22]:

```
from scipy.optimize import minimize
def mae_func(weights):
''' scipy minimize will pass the weights as a numpy array '''
final_prediction = 0
for weight, prediction in zip(weights, predictions):
final_prediction += weight*prediction
return mean_absolute_error(Y_values, final_prediction)
Y_values = data['loss'].values
predictions = []
for i in range(11):
predictions.append(train_x.ix[:,i])
lls = []
wghts = []
```

Finding the optimal weights

In [24]:

```
start_time = timer(None)
for i in range(5):
starting_values = np.random.uniform(size=len(predictions))
bounds = [(0,1)]*len(predictions)
res = minimize(mae_func, starting_values, method='L-BFGS-B',
bounds=bounds, options={'disp': False, 'maxiter': 10})
lls.append(res['fun'])
wghts.append(res['x'])
# Uncomment the next line if you want to see the weights and scores calculated in real time
#print('Weights: {weights} Score: {score}'.format(weights=res['x'], score=res['fun']))
bestSC = np.min(lls)
bestWght = wghts[np.argmin(lls)]
print('\n Ensemble Score: {best_score}'.format(best_score=bestSC))
print('\n Best Weights: {weights}'.format(weights=bestWght))
timer(start_time)
```

- By now, i was hovering around 5% in the public leaderboard with 2 days left.
- There was no guarantee i would remain in the top 5% in the private leader board.
- To be precise, this score placed me at position 135 at the public leaderboard - borderline top 5%
- Required more safety Net!

It turns out that with the simple weighted average model i would have been placed 104, close to the top 3%.

I decided to try Neural Networks instead.

The best local cv with the weighted average was `1118.345`

Here is the output with my first neural network:

```
('Fold ', 1, '- MAE:', 1117.5260825665521)
('Fold ', 2, '- MAE:', 1113.2272453103922)
('Fold ', 3, '- MAE:', 1117.1135764027533)
('Fold ', 4, '- MAE:', 1121.9982577768997)
('Fold ', 5, '- MAE:', 1119.6595219061744)
('Total - MAE:', 1117.9049057391971)
```

**Finally! Some improvement! **

And the private LB would have given me position 53!

Clearly i am onto something, right? **No.**

- Tuning
- Luckily with only 11 features, it is relatively fast (~40minutes) for each model to try new params.
- Unfortunately, no matter what i did, i got results worse than my weighted average cv score.
- With no choice, i resorted to bagging

- Bagging
- Manage to achieve a lower CV score, lower public score, but a worst private score.

So i guess, sometimes **luck** does play a part.

i combine my keras (out of bag) results (consider them third level modeling) and combine with my second level models and ran the weighted average codes.

Below is a summary of what i achieved:

Single Keras | W. avg with single NN | Bagged 5 Keras | W.avg both NN | |
---|---|---|---|---|

Local CV | 1117.90490574 | 1117.77760897 | 1117.88037992 | 1117.7181697 |

Public LB | 1100.90013 | 1100.87763 | 1100.88155 | 1100.86946 |

Private LB | 1112.84611 | 1112.77244 | 1112.93370 | 1112.73936 |

What is interesting is that the weighted average with my single Neural Netwrok outperformed the one that is bagged 5 times.

I decided to run weighted average with both single and bagged Neural Network. Turns out i had best Cv results across all measures.

Final position - 46!