Felipe's Data Science Blog

Son 27 August 2017

ENEM project, part 1

Posted by Felipe in posts   

I am going to explore data resulting from Brazil's (I am Brazilian, even though I've lived most of my life abroad) "Exame Nacional do Ensino Médio (Enem)", a school-leaver's test. It is roughly equivalent to the American SATs, Swiss Matura, French Bac, etc...

This notebook represents cleaned up version of the actual project, in order to get to this point, I had to run a lot of tests on the data. Nonetheless, by following what is here, one should be able to get from the raw data to my final results.

I will be using R (which also works in jupyter) to do the analysis. I will mostly be playing around with linear regression models here. This is not supposed to be an analysis to end all analysis of this huge and, rather complex (in my humble opinion) dataset.

I downloaded the data for 2015 from "Instituto Nacional de Estudos e Pesquisas Anísio Teixeira (INEP)":

http://portal.inep.gov.br/microdados

The file "microdados_enem2015" containts several subfolders, including an excel file that works as a dictionary for all the columns contained in the main .csv file:

microdados_enem2015\microdados_enem_2015\DICIONÁRIO\Dicionário_Microdados_Enem_2015.xlsx

The actual data is in:

\microdados_enem2015\microdados_enem_2015\DADOS\MICRODADOS_ENEM_2015.csv

This file contains 166 variables (columns) and 7746428 rows! The latter number corresponds to the total number of students who took the test in 2015 (the total population of Brazil is around 210 million). Among all the variables, we can find those that correspond to personal student data (age, place of birth, gender, etc) as well as variables regarding information on the test itself, namely grades for 5 different tests: science, language, humanities, maths and writing. These are going to be our dependent variables.

Moreover, 50 out of the 166 variables are part of a socio-economic questionnaire every student had to take. These are going to make the bulk of our predictors.

I started exploring the data by loading the first 500 rows of the file:

read.csv(file="MICRODADOS_ENEM_2015.csv",nrows=500)

The complete .csv file is 5 GB. I actually tried loading everything into R, but my computer froze. This is due to the fact that R loads everything into RAM memory, which is limited on my personal computer (it's over 5 years old). In order to get around this problem I used an R module called "sqldf", which allowed me to select only a subset of students, using an SQL "SELECT FROM" type command.

I limited myself, at first, to students from the Brazilian state of Paraná (where I was born). I used this data to explore the data set and clean it, as well as to create some linear regression models. Unfortunately, after having cleaned the data, this subset was too small to give any meaningful results.

Long story short, I ended up selecting students (from the whole of Brazil) without special needs and who are white (self-declared). Concerning the former choice, I noticed that there were over 50 variables describing students' special needs, making the dataset even more complex. Moreover, the special needs are very heterogeneous in their nature, which would have made the analysis much harder. As for the choice of race, I decided to pick the largest group (just under 50%) in order to reduce the number of data points by around half. Moreover, I suspect that race and social condition are highly correlated, and it might actually be wise to separate their analysis.

I also eliminated questions Q027-Q033 that asked the student about their motivations for working (a payed job). I did this because I only want to work with complete data points (i.e. students for whom no information is missing), and including these questions would have meant eliminating all students who had never held a job (a big chunk of the dataset). I did keep the question "have you every worked a paying job?" (Q026).

Finally, I also eliminated questions of the type "was SOME REASON an important factor in deciding to take this test", since they are inherently subjective and I wanted to restrict my analysis to strictly socio-economic factors.

Loading the dataset

In [ ]:
library(sqldf)
enem_br=read.csv.sql("MICRODADOS_ENEM_2015.csv",sql="select * from file
                                                                where (CO_PROVA_MT =243 OR CO_PROVA_MT =244)
                                                                AND  TP_PRESENCA_MT=1
                                                                AND  TP_PRESENCA_CN=1
                                                                AND  TP_PRESENCA_CH=1
                                                                AND  TP_PRESENCA_LC=1
                                                                AND  IN_TREINEIRO =0
                                                                AND  TP_NACIONALIDADE =1
                                                                AND  IN_ESTUDA_CLASSE_HOSPITALAR =0
                                                                AND  IN_BAIXA_VISAO =0
                                                                AND  IN_CEGUEIRA =0
                                                                AND  IN_SURDEZ =0
                                                                AND  IN_DEFICIENCIA_AUDITIVA =0
                                                                AND  IN_SURDO_CEGUEIRA =0
                                                                AND  IN_DEFICIENCIA_FISICA =0
                                                                AND  IN_DEFICIENCIA_MENTAL =0
                                                                AND  IN_DEFICIT_ATENCAO =0
                                                                AND  IN_DISLEXIA =0
                                                                AND  IN_GESTANTE =0
                                                                AND  IN_LACTANTE =0
                                                                AND  IN_IDOSO =0
                                                                AND  IN_DISCALCULIA =0
                                                                AND  IN_AUTISMO =0
                                                                AND  IN_VISAO_MONOCULAR =0
                                                                AND  IN_SABATISTA =0
                                                                AND  IN_OUTRA_DEF =0
                                                                AND  TP_COR_RACA=1")

I saved the subset (much smaller than the initial one) I obtained with above command

In [ ]:
write.csv(enem_br,"enem_proc2.csv")
In [ ]:
rm(enem_br)

In the first command, the "CO_PROVA_MT =243 OR CO_PROVA_MT =244" bit selects only students who took the "yellow" and "grey" math tests (the tests are color-coded and each color has different questions, presumably to avoid cheating). This was simply a method I used to create smaller datasets and avoid overloading my PC.

The same command is repeated below, this time selecting students who took the "blue" and "pink" math tests:

In [ ]:
enem_br=read.csv.sql("MICRODADOS_ENEM_2015.csv",sql="select * from file
                                                                where (CO_PROVA_MT =245 OR CO_PROVA_MT =246)
                                                                AND  TP_PRESENCA_MT=1
                                                                AND  TP_PRESENCA_CN=1
                                                                AND  TP_PRESENCA_CH=1
                                                                AND  TP_PRESENCA_LC=1
                                                                AND  IN_TREINEIRO =0
                                                                AND  TP_NACIONALIDADE =1
                                                                AND  IN_ESTUDA_CLASSE_HOSPITALAR =0
                                                                AND  IN_BAIXA_VISAO =0
                                                                AND  IN_CEGUEIRA =0
                                                                AND  IN_SURDEZ =0
                                                                AND  IN_DEFICIENCIA_AUDITIVA =0
                                                                AND  IN_SURDO_CEGUEIRA =0
                                                                AND  IN_DEFICIENCIA_FISICA =0
                                                                AND  IN_DEFICIENCIA_MENTAL =0
                                                                AND  IN_DEFICIT_ATENCAO =0
                                                                AND  IN_DISLEXIA =0
                                                                AND  IN_GESTANTE =0
                                                                AND  IN_LACTANTE =0
                                                                AND  IN_IDOSO =0
                                                                AND  IN_DISCALCULIA =0
                                                                AND  IN_AUTISMO =0
                                                                AND  IN_VISAO_MONOCULAR =0
                                                                AND  IN_SABATISTA =0
                                                                AND  IN_OUTRA_DEF =0
                                                                AND  TP_COR_RACA=1")
In [4]:
write.csv(enem_br,"enem_proc3.csv")

With command below, I load the files I had previously saved and at the same time eliminate the columns I am not going to use. Loading the data this way also automatically transform categorical variables into the R variable type "factor".

In [44]:
library(data.table)
enem1=fread("enem_proc2.csv",drop=c(1:3,7:17,20,30:81,82:93,94:97,102:117,144:158),data.table=FALSE, na.strings=c("","NA"))
Read 859664 rows and 53 (of 167) columns from 0.752 GB file in 00:01:06
In [45]:
enem2=fread("enem_proc3.csv",drop=c(1:3,7:17,20,30:81,82:93,94:97,102:117,144:158),data.table=FALSE, na.strings=c("","NA"))
Read 827055 rows and 53 (of 167) columns from 0.723 GB file in 00:00:42

Puttign everything into one dataset:

In [46]:
enem_br=rbind(enem1,enem2)

And eliminating all students who did not answer all questions:

In [48]:
enem_br=enem_br[complete.cases(enem_br), ]

We are left with 558022 observations only. Let's save this into a file so we can load it more easily next time

In [50]:
write.csv(enem_br,"enem_completo.csv")
In [1]:
enem_br=read.csv("enem_completo.csv")

All the remaining students have the same value for the variables "TP_ST_CONCLUSAO" and "TP_ANO_CONCLUIU":

In [2]:
table(enem_br$TP_ST_CONCLUSAO)
     2 
558022 
In [68]:
table(enem_br$TP_ANO_CONCLUIU)
     0 
558022 

So I eliminate these unnecessary columns:

In [3]:
enem_br$TP_ST_CONCLUSAO=NULL
enem_br$TP_ANO_CONCLUIU=NULL

Combining the science, humanities, language and math grades into one (nota is Portuguese for grade):

In [71]:
NOTA=enem_br$NU_NOTA_CN+enem_br$NU_NOTA_CH+enem_br$NU_NOTA_LC+enem_br$NU_NOTA_MT
In [72]:
NOTA=NOTA/4
In [73]:
enem_br$NOTA=NOTA
In [75]:
enem_br$NU_NOTA_CN=NULL
enem_br$NU_NOTA_CH=NULL
enem_br$NU_NOTA_LC=NULL
enem_br$NU_NOTA_MT=NULL

Time to make a linear regression model. First, let's define the formula:

In [3]:
f=NOTA ~ NU_IDADE + TP_SEXO + TP_ESTADO_CIVIL + 
    Q001 + Q002 + Q003 + Q004 + Q005 + Q006 + Q007 + Q008 + Q009 + 
    Q010 + Q011 + Q012 + Q013 + Q014 + Q015 + Q016 + Q017 + Q018 + 
    Q019 + Q020 + Q021 + Q022 + Q023 + Q024 + Q025 + Q026 + Q042 + 
    Q043 + Q044 + Q045 + Q046 + Q047 + Q048 + Q049 + Q050

Actually, I copied the above formula from my test notebooks. To build it, I used a command resembling the one below:

f = as.formula(paste('NU_NOTA_MT ~', paste(colnames(enem_test)[3:4], collapse='+'),"+",paste(colnames(enem_test)[9:11], collapse='+'),"+",paste(colnames(enem_test)[13:62], collapse='+')))

I will also split the dataset into 2, a training and a test set.

In [7]:
library(caTools)
set.seed(93)
spl = sample.split(enem_br$NOTA, SplitRatio = 0.6)
In [8]:
train = subset(enem_br, spl==TRUE)
test = subset(enem_br, spl==FALSE)

Saving again...

In [9]:
write.csv(train,"train.csv")
In [10]:
write.csv(test,"test.csv")
In [1]:
test=read.csv("test.csv")
train=read.csv("train.csv")

Train the model:

In [4]:
reg=lm(f,data=train)

Get predictions on the test set:

In [5]:
reg_pred = predict(reg, newdata=test)

And calulate the out-of-sample $R^2$

In [6]:
SSE=sum((reg_pred - test$NOTA)^2) 
SST=sum((mean(train$NOTA) - test$NOTA)^2)
r2=1-SSE/SST
r2
0.424231880065406

Having a background in the physical sciences, 0.42 seems like a low $R^2$. The last scientific paper I worked on featured a linear model for which the $R^2$ sometimes dropped below 0.9. I tought that wasn't great, but given the inherent uncertainty of the data it was still good.

However, a friend who studies geography told me that in certain fields of human knowledge it's almost impossible to obtain such high values of $R^2$. Moreover, I found this quotation while in the well-know book "Introduction to statistical learning":

"... in certain problems in physics... we could expect to see an $R^2$ that is extremely close to 1... On the other hand, typical applications in biology, psychology, marketing and other domains, the linear model is at best an extremely rough approximation to the data... In this setting... $R^2$ value well bellow 0.1 might be more realistic!"

This quelled my worries a bit.

In addition, my out-of-sample $R^2$ was actually negative when I used the much smaler dataset I was running tests on.

Given the circumstances, $0.42$ doesn't look so bad anymore.

Besides calculating the out-of-sample $R^2$ for the test set (hold-out cross validation), we can use a type of cross validation called leave-one-out cross-validation (LOOCV). This method would usually be computationally unfeasible in a data set with $n=558022$, since it involves computing the linear model $n$ times. Fortunately, for linear models, there is a "shortcut" that makes the cross-validation scheme as computationally expensive as training the model once, and it is implemented in $R$ (in the forecast library):

https://robjhyndman.com/hyndsight/loocv-linear-models/

We start by training the model on the whole set:

In [4]:
total=rbind(train,test)
In [9]:
reg2=lm(f,data=total)
In [ ]:
library("forecast")
In [11]:
CV(reg2)
CV
3114.66789081512
AIC
4488640.6944833
AICc
4488640.7632504
BIC
4490190.73168936
AdjR2
0.424236255769567

The first number of the above list is the estimated MSE (mean squared error) for the model, as calculated by LOOCV (the CV row). AdjR2 is the adjusted $R^2$ of the model. The latter is very similar to what is obtained with reg. The MSE of reg (the model trained on the training set) is:

In [14]:
MSE = sum(reg$residuals^2)/nrow(test)
MSE
4742.83287726525

Which is much higher, since it was trained on a much smaller set. Model reg2 was trained on the whole set and is cross-validated.

Let's take a look at the coefficients of reg2:

In [15]:
results=summary(reg2)
foo=cbind(results$coefficients[,1],results$coefficients[,4])
In [16]:
foo[order(foo[,1],foo[,2],decreasing=TRUE), ]
(Intercept)466.78364 0.000000e+00
Q006Q 53.52479 0.000000e+00
Q006P 44.44506 0.000000e+00
Q006O 39.34599 0.000000e+00
Q006N 35.07305 2.720790e-303
Q006M 33.92957 3.587055e-241
Q042F 32.60547 5.585935e-04
Q006L 31.01148 5.232755e-232
Q024E 30.12478 0.000000e+00
Q047E 27.89321 0.000000e+00
Q006K 27.86017 5.220298e-219
Q042E 25.97141 2.003191e-04
Q006J 24.69926 2.675835e-187
Q024D 23.86337 0.000000e+00
Q006I 23.66893 2.409994e-176
Q006H 22.29963 1.531752e-164
Q042C 20.99840 0.000000e+00
Q047D 19.08748 0.000000e+00
Q006G 18.97507 2.830922e-128
Q024C 17.71110 0.000000e+00
Q001G 16.34254 1.003692e-109
Q001F 14.97134 1.635241e-103
Q017E 14.74003 2.204035e-02
Q048D 14.70700 1.820302e-105
TP_SEXOM 14.47265 0.000000e+00
Q013E 14.23821 3.743494e-10
Q006F 14.03593 6.099961e-72
Q013D 13.59287 3.134358e-35
Q006E 13.50812 5.350748e-67
Q042D 12.64642 5.193704e-270
.........
Q046D -5.199401 1.131313e-01
Q019E -5.434678 3.490004e-13
Q016D -5.774210 1.176820e-01
Q043B -5.870863 1.838807e-20
Q010D -5.969771 7.751843e-26
Q048C -6.119877 1.905589e-19
Q026C -6.266650 7.859301e-179
Q016E -6.327406 3.132359e-01
Q049B -6.938049 3.046690e-177
Q045C -7.145751 1.523212e-19
Q042G -7.423303 3.836121e-01
Q009B -7.507605 2.485018e-08
Q012E -7.573540 5.579310e-03
Q009D -8.482336 2.012172e-10
Q009E -8.485656 5.383440e-10
Q014C -8.613471 1.939538e-45
Q045D -8.676882 4.953474e-28
Q010E -8.914279 1.948191e-17
Q009C -8.990476 1.246483e-11
Q011C-10.341189 1.208406e-89
Q044C-10.659357 8.157993e-34
Q014D-11.382535 3.395455e-05
Q011D-11.579681 1.788013e-16
Q012D-11.846309 2.004865e-16
Q049C-12.136516 1.278363e-291
Q050B-13.711649 0.000000e+00
Q011E-15.417711 1.916887e-08
Q043C-16.140150 8.339730e-44
Q045B-25.796192 0.000000e+00
Q043D-28.631359 3.602033e-172

Question 006 is "What is the total monthly income of your family?". Option A, the baseline, stands for no income. The larger the family income, the more likely a student is to do well in the test. This is not surprising, but I suspect monthly income is strongly linked to other variables. In other words, I wonder if having a higher income is really the cause of a student doing well in the test, or having a larger income just means a student will have access to better schools, to computers, internet, transportation, etc. I will come back to this question later.

Question 042 aks "What kind of primary and middle school did you go to?". Options E and F correspond to "I onlt attended and Indian school" and "I attended mostly and Indian school", respectively. I had no idea what an "Indian School" was. Turns out it is a school for Native American populations, where children are taught in both Portuguese and a native language (e.g. Guaraní). Q042F has a small p-value and its coefficient is large and positive. Q042E also has a positive coefficient, but it's p-value is too large for me to actually interpret it.

Are these kinds of school doing a better job than average? I'm going to take a look at the average grade of the students who attended these schools.

In [14]:
ss1=subset(total,total$Q042=='E')
In [15]:
ss2=subset(total,total$Q042=='F')
In [16]:
mean(total$NOTA)
522.613868763239
In [17]:
mean(ss1$NOTA)
559.1515625
In [18]:
mean(ss2$NOTA)
543.145

It seems to be the case that students who attended native schools have better grades, on average. However, if we look their total number:

In [24]:
nrow(ss1)+nrow(ss2)
99

We can see that is very small compared to total number of students in the set:

In [69]:
nrow(total)
558022

This is not surprising, considering I only selected white students. Therefore, the conclusion that students who attended native schools do better than other can hardly be trusted. Nonetheless, it might be worth exploring this in a set with more native students.

I want to try a very simple model where I only consider the student's income level and see what happens

In [22]:
reg3=lm("NOTA~Q006",data=total)
In [23]:
CV(reg3)
CV
3896.01228156765
AIC
4613560.28030738
AICc
4613560.28153318
BIC
4613762.45907339
AdjR2
0.279623620082719
In [24]:
summary(reg3)
Call:
lm(formula = "NOTA~Q006", data = total)

Residuals:
    Min      1Q  Median      3Q     Max 
-475.42  -42.92   -4.30   39.36  342.17 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 479.9444     0.7971 602.130  < 2e-16 ***
Q006B        -4.5262     0.8319  -5.441  5.3e-08 ***
Q006C        10.6751     0.8217  12.992  < 2e-16 ***
Q006D        23.9079     0.8349  28.637  < 2e-16 ***
Q006E        30.9867     0.8452  36.663  < 2e-16 ***
Q006F        36.4217     0.8422  43.245  < 2e-16 ***
Q006G        49.6417     0.8423  58.936  < 2e-16 ***
Q006H        61.0480     0.8695  70.211  < 2e-16 ***
Q006I        69.2201     0.8880  77.949  < 2e-16 ***
Q006J        78.8713     0.8935  88.273  < 2e-16 ***
Q006K        88.8195     0.9286  95.644  < 2e-16 ***
Q006L        97.2227     1.0079  96.457  < 2e-16 ***
Q006M       104.3134     1.0855  96.097  < 2e-16 ***
Q006N       109.6957     0.9825 111.654  < 2e-16 ***
Q006O       119.9807     0.9811 122.289  < 2e-16 ***
Q006P       130.6570     0.9908 131.872  < 2e-16 ***
Q006Q       147.0085     0.9330 157.565  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62.42 on 558005 degrees of freedom
Multiple R-squared:  0.2796,	Adjusted R-squared:  0.2796 
F-statistic: 1.354e+04 on 16 and 558005 DF,  p-value: < 2.2e-16

This is interesting, the out-of-sample $R^2$ is still, relatively, large. Many of the the questions (Q007 to Q025) are of the tpye "is there are car in your home?" or "is there a toilet in your home?" (with different levels indicating how many of the item are present in the students home). I think Q006 and these other questions depend on each other very strongly. I am going to try a few models with only Q006 and one of these variables:

In [28]:
reg4=lm(NOTA~Q008,data=train)
In [29]:
summary(reg4)
Call:
lm(formula = NOTA ~ Q008, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-500.34  -47.14   -6.09   42.38  352.76 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  485.909      1.895 256.355  < 2e-16 ***
Q008B         14.426      1.902   7.586 3.31e-14 ***
Q008C         49.757      1.908  26.073  < 2e-16 ***
Q008D         86.842      1.932  44.945  < 2e-16 ***
Q008E        117.993      1.960  60.209  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 67.36 on 335223 degrees of freedom
Multiple R-squared:  0.1713,	Adjusted R-squared:  0.1712 
F-statistic: 1.732e+04 on 4 and 335223 DF,  p-value: < 2.2e-16
In [30]:
CV(reg4)
CV
4537.73787941826
AIC
2822682.03078526
AICc
2822682.03103585
BIC
2822746.36618232
AdjR2
0.171240788723082

A lower adjusted $R^2$ and a higher MSE than reg3. Let's build single-variable models for all of these questions:

In [31]:
qvector=c("Q007",
"Q008",  
"Q009",
"Q010",
"Q011",
"Q012",
"Q013",
"Q014",
"Q015",
"Q016", 
"Q017",
"Q018",
"Q019",
"Q020", 
"Q021",
"Q022",
"Q023", 
"Q024",
"Q025" )
In [32]:
length(qvector)
19
In [44]:
regs=vector("list", 19)
preds=vector("list", 19)
cvs=vector("list", 19)
In [45]:
for (i in 1:19 )
    {
  #  print(qvector[i])
    f = as.formula(paste('NOTA ~ ', paste(qvector[i])))
    regs[[i]]=lm(f,data=total)
    cvs[[i]]=CV(regs[[i]])
}
In [83]:
MSE=vector("numeric", 19)
r2s=vector("numeric", 19)
for (i in 1:19 )
    {
     MSE[i]=cvs[[i]][1]
     r2s[i]=cvs[[i]][5]
    }
In [86]:
min(MSE)
4465.22563177163
In [87]:
max(r2s)
0.174351791703201
In [39]:
max(r2s)
0.173967669141494

None of these models can outperform reg3. What about all of them together?

In [2]:
f=NOTA ~ Q007 + Q008 + Q009 + 
         Q010 + Q011 + Q012 + Q013 + Q014 + Q015 + Q016 + Q017 + Q018 + 
         Q019 + Q020 + Q021 + Q022 + Q023 + Q024 + Q025
In [5]:
reg5=lm(f,data=total)
In [6]:
summary(reg5)
Call:
lm(formula = f, data = total)

Residuals:
    Min      1Q  Median      3Q     Max 
-507.74  -43.09   -4.19   39.39  367.57 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 469.99383    2.00478 234.437  < 2e-16 ***
Q007B        22.45859    0.35929  62.509  < 2e-16 ***
Q007C        31.16650    0.70504  44.205  < 2e-16 ***
Q007D        30.95026    0.44725  69.201  < 2e-16 ***
Q008B         1.13723    1.43048   0.795 0.426614    
Q008C        15.84595    1.44348  10.978  < 2e-16 ***
Q008D        32.18239    1.47027  21.889  < 2e-16 ***
Q008E        46.77519    1.51733  30.827  < 2e-16 ***
Q009B        -9.46246    1.50603  -6.283 3.32e-10 ***
Q009C        -7.33013    1.48312  -4.942 7.72e-07 ***
Q009D        -9.44661    1.48911  -6.344 2.24e-10 ***
Q009E       -14.90373    1.52441  -9.777  < 2e-16 ***
Q010B         7.50810    0.21801  34.440  < 2e-16 ***
Q010C        12.43509    0.33322  37.318  < 2e-16 ***
Q010D         7.59178    0.62919  12.066  < 2e-16 ***
Q010E         2.57428    1.16947   2.201 0.027719 *  
Q011B        -9.30013    0.21675 -42.907  < 2e-16 ***
Q011C       -18.13959    0.57449 -31.575  < 2e-16 ***
Q011D       -21.20366    1.57275 -13.482  < 2e-16 ***
Q011E       -24.18887    3.06976  -7.880 3.29e-15 ***
Q012B         0.55045    0.98231   0.560 0.575233    
Q012C        -9.09455    1.04920  -8.668  < 2e-16 ***
Q012D       -16.76382    1.61119 -10.405  < 2e-16 ***
Q012E       -10.77767    3.05718  -3.525 0.000423 ***
Q013B         8.53706    0.19121  44.647  < 2e-16 ***
Q013C        12.13752    0.48026  25.273  < 2e-16 ***
Q013D        11.53209    1.22732   9.396  < 2e-16 ***
Q013E         7.52744    2.54299   2.960 0.003076 ** 
Q014B         0.38697    0.25192   1.536 0.124526    
Q014C       -14.07835    0.67901 -20.734  < 2e-16 ***
Q014D       -19.95154    3.07249  -6.494 8.39e-11 ***
Q014E        -9.72637    6.14186  -1.584 0.113281    
Q015B        -5.57582    0.22048 -25.290  < 2e-16 ***
Q015C        -5.34129    1.51677  -3.521 0.000429 ***
Q015D         0.04250    5.69516   0.007 0.994046    
Q015E        -3.83820    7.22536  -0.531 0.595272    
Q016B         0.01574    0.22251   0.071 0.943622    
Q016C        -5.43162    0.94045  -5.776 7.67e-09 ***
Q016D        -7.71948    4.13012  -1.869 0.061614 .  
Q016E        -8.27859    7.01631  -1.180 0.238038    
Q017B        11.63894    0.40836  28.502  < 2e-16 ***
Q017C        10.53262    2.97400   3.542 0.000398 ***
Q017D        -2.25708    6.81518  -0.331 0.740505    
Q017E        12.83935    7.20352   1.782 0.074689 .  
Q018B         5.89161    0.20801  28.323  < 2e-16 ***
Q019B         3.51448    0.73199   4.801 1.58e-06 ***
Q019C         6.66269    0.75125   8.869  < 2e-16 ***
Q019D         5.13100    0.78174   6.564 5.26e-11 ***
Q019E         2.05448    0.83453   2.462 0.013823 *  
Q020B        -3.51499    0.19415 -18.104  < 2e-16 ***
Q021B         5.56289    0.20433  27.226  < 2e-16 ***
Q022B        -2.40288    0.74771  -3.214 0.001311 ** 
Q022C         7.27767    0.73149   9.949  < 2e-16 ***
Q022D         6.87700    0.73207   9.394  < 2e-16 ***
Q022E         3.14619    0.74096   4.246 2.18e-05 ***
Q023B         7.76967    0.20211  38.443  < 2e-16 ***
Q024B        13.44693    0.29804  45.117  < 2e-16 ***
Q024C        32.41639    0.37963  85.389  < 2e-16 ***
Q024D        45.66167    0.49290  92.639  < 2e-16 ***
Q024E        57.18610    0.63527  90.018  < 2e-16 ***
Q025B         8.40074    0.29535  28.444  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62.46 on 557961 degrees of freedom
Multiple R-squared:  0.2786,	Adjusted R-squared:  0.2785 
F-statistic:  3592 on 60 and 557961 DF,  p-value: < 2.2e-16
In [91]:
CV(reg5)
CV
3902.26391781274
AIC
4614439.11469145
AICc
4614439.12869247
BIC
4615135.50821881
AdjR2
0.27854508325787

Comparable MSE and $R^2$, despite the much larger number of variables! To simplify the model, I will keep Q006 and drop Q007 to Q025:

In [7]:
f=NOTA ~ NU_IDADE + TP_SEXO + TP_ESTADO_CIVIL + 
    Q001 + Q002 + Q003 + Q004 + Q005 + Q006  + Q026 + Q042 + 
    Q043 + Q044 + Q045 + Q046 + Q047 + Q048 + Q049 + Q050
In [8]:
reg6=lm(f,data=total)
In [94]:
CV(reg6)
CV
3188.35820583082
AIC
4501710.93987035
AICc
4501710.96195863
BIC
4502587.04785638
AdjR2
0.410527913522544

So reg6 has a MSE and $R^2$ comparable to those of reg2, even though the fomer has 20 less variables. Pretty cool.

Possible next steps:

  • Use automated model selection methods, such as LASSO.
  • Use unsupervised learning to see if I can build different models for different subsets of the dataset
  • Use k-fold cross-validation instead of LOOCV. While LOOCV gives an almost unbiased MSE, it might overstimate a bit. k-fold cross-validation might give a more accurate estimate of the true (it has less variance), and I would like to compare their results.
  • Use geographical data instead of the socio-economic questionnaire as the predictors to build the model (might have been easier to start this way).

Observation: the piece of code below removes infrequent levels from factor variables in the dataset. I sometimes ran into the problem that very infrequent levels (e.g. very very high family income) would only appear in the test set. This was an issue because the model would have no coefficient for that particular level, so I got an error when it appeared in the test set and I tried to get predictions.

In [ ]:
enem_b=enem_br
for(i in 13:62) {
 enem_b = enem_b[!(as.numeric(enem_b[[i]]) %in% which(table(enem_b[[i]])<10)),]
}
enem_b=droplevels(enem_b)