Felipe's Data Science Blog

Sam 25 November 2017

Benchmarking the LOOCV

Posted by Felipe in posts   

Benchmarking fastloocv

In this post I will benchmark the fastloocv code I presented in my previous post.

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut

First, I will create 8 datasets, with randomly generated data points:

In [2]:
index=5
sets=[]
for i in range(0,8):
    element=np.random.rand(int(index),5)
    sets.append(element)
    if i%2==0:
     index=index*2
    else:
     index=index*5

All datasets have 5 varaibles. The last one will be the target.

In [3]:
for i in range(0,8):
    print(sets[i].shape)
(5, 5)
(10, 5)
(50, 5)
(100, 5)
(500, 5)
(1000, 5)
(5000, 5)
(10000, 5)
In [10]:
import fastloocv

Test run...

In [11]:
lm=LinearRegression()
In [12]:
CV=fastloocv.fastfloo(sets[0][:,:4],sets[0][:,4],lm)
In [7]:
CV
Out[7]:
(1.635132158255255e-23, 1.0)

Once again, I will use the python module 'timeit' to time the algorithms. First I generated 8 different setups, one for each dataset. I added 'gc.enable()' to the setups, in order to turn on the garbage collection, as advised in the 'timeit' documentation.

In [13]:
import timeit
In [14]:
setups=[]
str1='''gc.enable()
import numpy as np 
import fastloocv 
from sklearn.linear_model import LinearRegression 
element=np.random.rand('''
str3=''',5)
X = element[:,:4] 
y = element[:,4]
lm=LinearRegression()'''
    
index=5
for i in range(0,8):
 str2=str(index)   
 setups.append(str1+str2+str3)
 if i%2==0:
     index=index*2
 else:
     index=index*5   

Define the code to be run.

In [15]:
my_code = "fastloocv.fastfloo(X,y,lm)"

Run it and save the measured times.

In [16]:
times=[]
for i in range(0,8):
 times.append(timeit.timeit(setup = setups[i],stmt = my_code,number = 1))

Define the second code.

In [18]:
my_code2 = "fastloocv.floo(X,y,lm)"

Run it and save the measured times.

In [19]:
times2=[]
for i in range(0,8):
 times2.append(timeit.timeit(setup = setups[i],stmt = my_code2,number = 1))

Let's plot the results for comprison. I will try to emulate the ggplot style.

In [21]:
import matplotlib.pyplot as plt
%matplotlib inline
In [37]:
from matplotlib.ticker import FuncFormatter, MaxNLocator
fig, ax = plt.subplots()
fig.set_size_inches(10, 6, forward=False)
ax.plot(range(1,9),times, color='black',linestyle='--',linewidth=2, marker='o',
         markerfacecolor='none', markersize=12, MEC='black',MEW=1.5, label="Magic formula")
ax.plot(range(1,9),times2,color='black', linestyle='-.',linewidth=2, marker='x',
         markerfacecolor='black', markersize=12, MEC='black',MEW=1.5,label='Normal')
plt.title('Time [s] vs. size',fontsize=20)
plt.xlabel('Dataset Size',fontsize=18)
plt.ylabel('Time [s]',fontsize=18)
plt.grid(color="white",which='both')
ax.set_facecolor('#e6e6e6')
ax.tick_params(axis='both', which='major', labelsize=15)
ax.tick_params(axis='both', which='major', labelsize=17, width=2, length=10)

labels=[sets[i].shape[0] for i in range(0,8)]
labels.insert(0,5)
ax.set_xticklabels(labels)
plt.legend()
Out[37]:

It looks like the LOOCV method with the "magic" formula is indeed faster. Nonetheless, the difference only really becomes clear with larger datasets. In the x-axis, I plotted the dataset index instead of the actual size, because the sizes do not increase linearly, which would make for an awkward graph. I did label the x-axis ticks with the actual size of the dataset.

Also, I stopped at the $\mathrm{8^{th}}$ set because, on my computer, I cannot run larger datasets without running into memory problems.

I am also in the process of testing the memory use of both algorithms. The code I have used to do this is the following:

In [ ]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut
import sys

script, i = sys.argv

print("new run")
print(sys.argv[1])

index=5
sets=[]
for i in range(0,8):
    element=np.random.rand(int(index),5)
    sets.append(element)
    if i%2==0:
     index=index*2
    else:
     index=index*5


import fastloocv

lm=LinearRegression()

fastloocv.fastfloo(sets[i][:,:4],sets[i][:,4],lm)