Felipe's Data Science Blog

Don 02 November 2017

LOO cross-validation with python

Posted by Felipe in posts   

There is a type of cross-validation procedure called leave one out cross-validation (LOOCV). It is very similar to the more commonly used $k-fold$ cross-validation. In fact, LOOCV can be seen as a special case of $k-fold$ CV with $k=n$, where n is the number of data points. In other words, LOOCV trains the statistical model on every possible set containing $n-1$ data points and then tests it on the $n^{th}$ point.

From this procedure, we can get an estimate of the MSE of the model:

$$ CV = \frac{1}{n} \sum^{n}_{i=1} MSE_i \; , $$

where $MSE_i$ is the MSE of model $i$ and $CV$ is the average MSE used to estimate the actual MSE of the model.

LOOCV is a very low bias, randomness-free, but possibly high variance cross-validation method. One big disadvantage of LOOCV is that it is very computationally expensive, since $n$ models must be trained. As such, it is only possible to use it in small data sets.

Nonetheless, for linear models, there is a "magical formula" that obtains the result of LOOCV by training the model just once on the whole data set:

\begin{equation} CV = \frac{1}{n} \sum^{n}_{i=1} \left[\frac{e_i}{(1-h_i)}\right]^2 \; . \end{equation}

In the above formula, $e_i = y_i - \hat{y}_i$ are the residuals (the difference between the actual response $y_i$ and the response predicted by the model $\hat{y}_i$). $h_i$ is the leverage of point $i$. The $h_i$ are actually the diagonal elements of the so-called hat or projection matrix.

The LOOCV method is implemented in R, and the webpage below links to the package that contains this implementantion and also shows the algebra involved in obtaining the formula.

https://robjhyndman.com/hyndsight/loocv-linear-models/

As a project, since this feature is not present in scikit-learn, I wrote a program that calculates LOOCV for linear models, using this special formula.

There is a section on page 178 (page 192 in the pdf version) ofthe book "Introduction to Statistical Learning" on LOOCV.

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut

I will use a dataset that represents housing prices in the US to test the algorithm (similiar to the famous Boston dataset, but larger). I got the dataset from the Udemy course "Python for Data Science and Machine Learning Bootcamp", which I can recommend.

In [2]:
USAhousing = pd.read_csv('USA_Housing.csv')

Setting the predictors and response variables.

In [3]:
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
In [4]:
lm=LinearRegression()
In [5]:
loocv = LeaveOneOut()

The loocv.split method found in sci-kit learn can be used to iterate over the data points so we can apply the LOOCV method (without the special formula).

In the input below, I use LOOCV in the "dumb" way and get the MSE, using the loocv.split(X) iterator.

In [6]:
loocv = LeaveOneOut()
CV=0
for train, test in loocv.split(X):
    predictions=lm.fit(X.iloc[train], y.iloc[train]).predict(X.iloc[test])
    CV=CV+metrics.mean_squared_error(y.iloc[test], predictions)
CV=CV/len(X.index)
In [7]:
CV
Out[7]:
10244271123.526987

The code below does the LOOCV using the special formula. The code before the for loop calculates the hat matrix (more on it later).

In [8]:
#transpose X
Xtrans=np.transpose(X)
#do X X^T
mult=np.dot(Xtrans,X)
#invert the matrix
inv=np.linalg.inv(mult)
#do the other multiplications to get H
Hp=np.dot(X,inv)
H=np.dot(Hp,Xtrans)
#get the predictions from the model, for the whole data set
predictions=lm.fit(X,y).predict(X)

CV=0
for i in range(0,len(X.index)):
    summand=(y.iloc[i]-predictions[i])/(1-H[i][i])
    CV=CV+np.dot(summand,summand)
CV=CV/len(X.index)
In [9]:
CV
Out[9]:
10240115690.867804

We can see that the MSE results are very similar (but not exactly equal, which is something I have to investigate),

I wrote a python module where the above algorithm is implemented (with a few differences on which I will elaborate later).

In [10]:
import fastloocv

Transforming the pandas dataframes into numpy arrays so I can feed them to my function:

In [12]:
X1=pd.DataFrame.as_matrix(X)
In [13]:
y1=pd.DataFrame.as_matrix(y)

Running the cross-validation method:

In [14]:
CV=fastloocv.fastfloo(X1,y1,lm)

The function fastfloo takes at least three arguments: i) the predictors, ii) the response variable and iii) an instance of a linear model (lm). The program outputs the MSE as well as the $R^2$ in a tuple.

In [15]:
CV
Out[15]:
(10238888148.679834, 0.9999835740359353)

Now, more about the steps used to obtain the hat matrix. Consider a linear regression model:

\begin{equation} Y=X\beta + e \; . \end{equation}

Then the hat-matrix $H$ is defined through the equation \begin{equation} \hat{Y}=X\hat{\beta}=HY \; , \end{equation}

and can be calculated with

\begin{equation} H=X(X^{T}X)^{-1}X^{T} \; . \end{equation}

The hat $\hat{}$ indicates that the quantity is either the predicted response ($\hat{Y}$) or an estimator ($\hat{\beta}$).

Notice that the equation for $H$ requires a matrix inversion. I actually spent a lot of time thinking and doing research about what algorithm to use in the matrix inversion. This is because matrix inversion is a delicate numerical procedure. It can be expensive (roughly $O(n^{3})$ for exact methods such as Gauss-Jordan or LU) and if the matrix is ill-conditioned it can lead to innaccuracies in the solution.

In fact, in order to avoid matrix diagonalization, many implementations of linear regression do not solve the normal equations, but rather solve the least squares problem directly using an minimization procedure.

I talked to an acquaintance who does research in numerical mathematics and he told me his algorithm of choice for the inversion of large matrice is the GMRES algorithm.

The default algorithm used for matrix inversion in my code is the LGMRES implemented in scipy. However, I also included the possibility of using the LU decomposition:

In [17]:
fastloocv.fastfloo(X1,y1,lm,alg='lu')
Out[17]:
(10240115690.867804, 0.99998357206662336)

GMRES and LU are usually used to solve systems of linear equations. It is possible to use them to get a matrix inverse by solving $n$ separate systems. Suppose $A$ is the matrix whose inverse we need. We have:

\begin{equation} AA^{-1}=I \; , \end{equation}

where $I$ is the identity matrix. Now let $a^{'}_1, \ldots, a^{'}_n $ be the column vectors of the inverse matrix and let $k_1, \ldots, k_n $ be the column vectors of the identity matrix (the unit vectors). Then we have to solve $n$ systems of the form

\begin{equation} Aa^{'}_i= k_i\; . \end{equation}

In fact, the solution vectors of these equations will be the column vectors of $A^{-1}$.

In our case, $A=(X^{T}X)$. However, the code actually uses GMRES or LU to solve for the column vectors of matrix D

\begin{equation} AD=X^{T} \; , \end{equation}

since this will give

\begin{equation} D=A^{-1}X^{T} = (X^{T}X)^{-1}X^{T}\; . \end{equation}

In other words, the $n$ systems with the unit vectors are substituted with $n$ systems with the column vectors of $X^{T}$. This avoids an extra matrix multiplication $A^{-1}X^{T}$, making the code more efficient and accurate.

I also included the "dumb" version of the algorithm in the fastloocv module:

In [18]:
CV2=fastloocv.floo(X1,y1,lm)
In [19]:
CV2
Out[19]:
(10244271123.527073, 0.99998356540017808)

Now it's time to benchmark the algorithm. For this purpose I use the python module timeit.

In [20]:
import timeit

Defining the setup needed for LOOCV to be run:

In [40]:
my_setup='''
import pandas as pd
import numpy as np
import fastloocv
from sklearn.linear_model import LinearRegression
USAhousing = pd.read_csv('USA_Housing.csv')
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
lm=LinearRegression()
X1=pd.DataFrame.as_matrix(X)
y1=pd.DataFrame.as_matrix(y)
'''

An the actual code:

In [36]:
my_code = '''
fastloocv.fastfloo(X1,y1,lm)'''
In [41]:
timeit.timeit(setup = my_setup,
                    stmt = mycode,
                    number = 1)
Out[41]:
1.4960502359999737

Above is the time in seconds for the execution of the code.

Testing the "dumb" LOOCV:

In [42]:
my_setup2='''
import pandas as pd
import numpy as np
import fastloocv
from sklearn.linear_model import LinearRegression
USAhousing = pd.read_csv('USA_Housing.csv')
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
lm=LinearRegression()
X1=pd.DataFrame.as_matrix(X)
y1=pd.DataFrame.as_matrix(y)
'''
In [44]:
my_code2 = '''
fastloocv.floo(X1,y1,lm)'''
In [46]:
timeit.timeit(setup = my_setup2,
                    stmt = my_code2,
                    number = 1)
Out[46]:
6.342293713999879

Since the setup required to run both methods is the same, we can see that the LOOCV with the special formula is indeed much faster.

The actual code contained in fastloocv is:

In [ ]:
import numpy
import scipy

def fastfloo(X,y,lm,alg='gmres'):


#check if lu decomposition is requested

 if alg=='lu':

#lu decompose A=X^{T} X
 
  fact=scipy.linalg.lu_factor(numpy.dot(numpy.transpose(X),X))
  
  a=numpy.empty([X.shape[1],X.shape[0]])

#Solve A D = X^{T}, save solution vectors in a  
  
  for i in range(0,X.shape[0]):
      a[:,i]=scipy.linalg.lu_solve(fact,numpy.transpose(X)[:,i])
 
 elif alg=='gmres':

  a=numpy.empty([X.shape[1],X.shape[0]])

#Solve X^{T} X D =  A D = X^{T} with LGMRES, save solution vectors in a
  
  for i in range(0,X.shape[0]):
      a[:,i],info= scipy.sparse.linalg.gmres(numpy.dot(numpy.transpose(X),X),numpy.transpose(X)[:,i])

  if info!=0:
   print(f"There was a problem with the gmres algorithm. INFO={info}")
 
 else:
  raise ValueError(f"{alg} is not a valid value for alg")

#multiply X times a to get the hat matrix H

 H=numpy.dot(X,a)

#get the predictions from the model, on the whole dataset
 
 predictions=lm.fit(X,y).predict(X)

#calculate MSE and R^2 
 
 CV=0
 y_mean=numpy.mean(y)
 den=0
 for i in range(0,X.shape[0]):
     summand=(y[i]-predictions[i])/(1-H[i][i])
     CV=CV+numpy.dot(summand,summand)
     summand2=(y[i]-y_mean)
     den=den+numpy.dot(summand2,summand2)

 CV=CV/X.shape[0]
 R2=1-CV/den
 


 return CV,R2

def floo(X,y,lm):

 from sklearn.model_selection import LeaveOneOut
 from sklearn import metrics

 loocv = LeaveOneOut()
 predictions=lm.fit(X,y).predict(X)

 CV=0
 for train, test in loocv.split(X):
    predictions=lm.fit(X[train], y[train]).predict(X[test])
    CV=CV+metrics.mean_squared_error(y[test], predictions)
 CV=CV/X.shape[0]

 y_mean=numpy.mean(y)
 den=0

 for i in range(0,X.shape[0]):
     summand2=(y[i]-y_mean)
     den=den+numpy.dot(summand2,summand2)

 R2=1-CV/den

 return CV,R2

Next, I will use this method on different datasets.