ENEM project, part 1
I am going to explore data resulting from Brazil's (I am Brazilian, even though I've lived most of my life abroad) "Exame Nacional do Ensino Médio (Enem)", a school-leaver's test. It is roughly equivalent to the American SATs, Swiss Matura, French Bac, etc...
This notebook represents cleaned up version of the actual project, in order to get to this point, I had to run a lot of tests on the data. Nonetheless, by following what is here, one should be able to get from the raw data to my final results.
I will be using R (which also works in jupyter) to do the analysis. I will mostly be playing around with linear regression models here. This is not supposed to be an analysis to end all analysis of this huge and, rather complex (in my humble opinion) dataset.
I downloaded the data for 2015 from "Instituto Nacional de Estudos e Pesquisas Anísio Teixeira (INEP)":
http://portal.inep.gov.br/microdados
The file "microdados_enem2015" containts several subfolders, including an excel file that works as a dictionary for all the columns contained in the main .csv file:
microdados_enem2015\microdados_enem_2015\DICIONÁRIO\Dicionário_Microdados_Enem_2015.xlsx
The actual data is in:
\microdados_enem2015\microdados_enem_2015\DADOS\MICRODADOS_ENEM_2015.csv
This file contains 166 variables (columns) and 7746428 rows! The latter number corresponds to the total number of students who took the test in 2015 (the total population of Brazil is around 210 million). Among all the variables, we can find those that correspond to personal student data (age, place of birth, gender, etc) as well as variables regarding information on the test itself, namely grades for 5 different tests: science, language, humanities, maths and writing. These are going to be our dependent variables.
Moreover, 50 out of the 166 variables are part of a socio-economic questionnaire every student had to take. These are going to make the bulk of our predictors.
I started exploring the data by loading the first 500 rows of the file:
read.csv(file="MICRODADOS_ENEM_2015.csv",nrows=500)
The complete .csv file is 5 GB. I actually tried loading everything into R, but my computer froze. This is due to the fact that R loads everything into RAM memory, which is limited on my personal computer (it's over 5 years old). In order to get around this problem I used an R module called "sqldf", which allowed me to select only a subset of students, using an SQL "SELECT FROM" type command.
I limited myself, at first, to students from the Brazilian state of Paraná (where I was born). I used this data to explore the data set and clean it, as well as to create some linear regression models. Unfortunately, after having cleaned the data, this subset was too small to give any meaningful results.
Long story short, I ended up selecting students (from the whole of Brazil) without special needs and who are white (self-declared). Concerning the former choice, I noticed that there were over 50 variables describing students' special needs, making the dataset even more complex. Moreover, the special needs are very heterogeneous in their nature, which would have made the analysis much harder. As for the choice of race, I decided to pick the largest group (just under 50%) in order to reduce the number of data points by around half. Moreover, I suspect that race and social condition are highly correlated, and it might actually be wise to separate their analysis.
I also eliminated questions Q027-Q033 that asked the student about their motivations for working (a payed job). I did this because I only want to work with complete data points (i.e. students for whom no information is missing), and including these questions would have meant eliminating all students who had never held a job (a big chunk of the dataset). I did keep the question "have you every worked a paying job?" (Q026).
Finally, I also eliminated questions of the type "was SOME REASON an important factor in deciding to take this test", since they are inherently subjective and I wanted to restrict my analysis to strictly socio-economic factors.
Loading the dataset
library(sqldf)
enem_br=read.csv.sql("MICRODADOS_ENEM_2015.csv",sql="select * from file
where (CO_PROVA_MT =243 OR CO_PROVA_MT =244)
AND TP_PRESENCA_MT=1
AND TP_PRESENCA_CN=1
AND TP_PRESENCA_CH=1
AND TP_PRESENCA_LC=1
AND IN_TREINEIRO =0
AND TP_NACIONALIDADE =1
AND IN_ESTUDA_CLASSE_HOSPITALAR =0
AND IN_BAIXA_VISAO =0
AND IN_CEGUEIRA =0
AND IN_SURDEZ =0
AND IN_DEFICIENCIA_AUDITIVA =0
AND IN_SURDO_CEGUEIRA =0
AND IN_DEFICIENCIA_FISICA =0
AND IN_DEFICIENCIA_MENTAL =0
AND IN_DEFICIT_ATENCAO =0
AND IN_DISLEXIA =0
AND IN_GESTANTE =0
AND IN_LACTANTE =0
AND IN_IDOSO =0
AND IN_DISCALCULIA =0
AND IN_AUTISMO =0
AND IN_VISAO_MONOCULAR =0
AND IN_SABATISTA =0
AND IN_OUTRA_DEF =0
AND TP_COR_RACA=1")
I saved the subset (much smaller than the initial one) I obtained with above command
write.csv(enem_br,"enem_proc2.csv")
rm(enem_br)
In the first command, the "CO_PROVA_MT =243 OR CO_PROVA_MT =244" bit selects only students who took the "yellow" and "grey" math tests (the tests are color-coded and each color has different questions, presumably to avoid cheating). This was simply a method I used to create smaller datasets and avoid overloading my PC.
The same command is repeated below, this time selecting students who took the "blue" and "pink" math tests:
enem_br=read.csv.sql("MICRODADOS_ENEM_2015.csv",sql="select * from file
where (CO_PROVA_MT =245 OR CO_PROVA_MT =246)
AND TP_PRESENCA_MT=1
AND TP_PRESENCA_CN=1
AND TP_PRESENCA_CH=1
AND TP_PRESENCA_LC=1
AND IN_TREINEIRO =0
AND TP_NACIONALIDADE =1
AND IN_ESTUDA_CLASSE_HOSPITALAR =0
AND IN_BAIXA_VISAO =0
AND IN_CEGUEIRA =0
AND IN_SURDEZ =0
AND IN_DEFICIENCIA_AUDITIVA =0
AND IN_SURDO_CEGUEIRA =0
AND IN_DEFICIENCIA_FISICA =0
AND IN_DEFICIENCIA_MENTAL =0
AND IN_DEFICIT_ATENCAO =0
AND IN_DISLEXIA =0
AND IN_GESTANTE =0
AND IN_LACTANTE =0
AND IN_IDOSO =0
AND IN_DISCALCULIA =0
AND IN_AUTISMO =0
AND IN_VISAO_MONOCULAR =0
AND IN_SABATISTA =0
AND IN_OUTRA_DEF =0
AND TP_COR_RACA=1")
write.csv(enem_br,"enem_proc3.csv")
With command below, I load the files I had previously saved and at the same time eliminate the columns I am not going to use. Loading the data this way also automatically transform categorical variables into the R variable type "factor".
library(data.table)
enem1=fread("enem_proc2.csv",drop=c(1:3,7:17,20,30:81,82:93,94:97,102:117,144:158),data.table=FALSE, na.strings=c("","NA"))
enem2=fread("enem_proc3.csv",drop=c(1:3,7:17,20,30:81,82:93,94:97,102:117,144:158),data.table=FALSE, na.strings=c("","NA"))
Puttign everything into one dataset:
enem_br=rbind(enem1,enem2)
And eliminating all students who did not answer all questions:
enem_br=enem_br[complete.cases(enem_br), ]
We are left with 558022 observations only. Let's save this into a file so we can load it more easily next time
write.csv(enem_br,"enem_completo.csv")
enem_br=read.csv("enem_completo.csv")
All the remaining students have the same value for the variables "TP_ST_CONCLUSAO" and "TP_ANO_CONCLUIU":
table(enem_br$TP_ST_CONCLUSAO)
table(enem_br$TP_ANO_CONCLUIU)
So I eliminate these unnecessary columns:
enem_br$TP_ST_CONCLUSAO=NULL
enem_br$TP_ANO_CONCLUIU=NULL
Combining the science, humanities, language and math grades into one (nota is Portuguese for grade):
NOTA=enem_br$NU_NOTA_CN+enem_br$NU_NOTA_CH+enem_br$NU_NOTA_LC+enem_br$NU_NOTA_MT
NOTA=NOTA/4
enem_br$NOTA=NOTA
enem_br$NU_NOTA_CN=NULL
enem_br$NU_NOTA_CH=NULL
enem_br$NU_NOTA_LC=NULL
enem_br$NU_NOTA_MT=NULL
Time to make a linear regression model. First, let's define the formula:
f=NOTA ~ NU_IDADE + TP_SEXO + TP_ESTADO_CIVIL +
Q001 + Q002 + Q003 + Q004 + Q005 + Q006 + Q007 + Q008 + Q009 +
Q010 + Q011 + Q012 + Q013 + Q014 + Q015 + Q016 + Q017 + Q018 +
Q019 + Q020 + Q021 + Q022 + Q023 + Q024 + Q025 + Q026 + Q042 +
Q043 + Q044 + Q045 + Q046 + Q047 + Q048 + Q049 + Q050
Actually, I copied the above formula from my test notebooks. To build it, I used a command resembling the one below:
f = as.formula(paste('NU_NOTA_MT ~', paste(colnames(enem_test)[3:4], collapse='+'),"+",paste(colnames(enem_test)[9:11], collapse='+'),"+",paste(colnames(enem_test)[13:62], collapse='+')))
I will also split the dataset into 2, a training and a test set.
library(caTools)
set.seed(93)
spl = sample.split(enem_br$NOTA, SplitRatio = 0.6)
train = subset(enem_br, spl==TRUE)
test = subset(enem_br, spl==FALSE)
Saving again...
write.csv(train,"train.csv")
write.csv(test,"test.csv")
test=read.csv("test.csv")
train=read.csv("train.csv")
Train the model:
reg=lm(f,data=train)
Get predictions on the test set:
reg_pred = predict(reg, newdata=test)
And calulate the out-of-sample $R^2$
SSE=sum((reg_pred - test$NOTA)^2)
SST=sum((mean(train$NOTA) - test$NOTA)^2)
r2=1-SSE/SST
r2
Having a background in the physical sciences, 0.42 seems like a low $R^2$. The last scientific paper I worked on featured a linear model for which the $R^2$ sometimes dropped below 0.9. I tought that wasn't great, but given the inherent uncertainty of the data it was still good.
However, a friend who studies geography told me that in certain fields of human knowledge it's almost impossible to obtain such high values of $R^2$. Moreover, I found this quotation while in the well-know book "Introduction to statistical learning":
"... in certain problems in physics... we could expect to see an $R^2$ that is extremely close to 1... On the other hand, typical applications in biology, psychology, marketing and other domains, the linear model is at best an extremely rough approximation to the data... In this setting... $R^2$ value well bellow 0.1 might be more realistic!"
This quelled my worries a bit.
In addition, my out-of-sample $R^2$ was actually negative when I used the much smaler dataset I was running tests on.
Given the circumstances, $0.42$ doesn't look so bad anymore.
Besides calculating the out-of-sample $R^2$ for the test set (hold-out cross validation), we can use a type of cross validation called leave-one-out cross-validation (LOOCV). This method would usually be computationally unfeasible in a data set with $n=558022$, since it involves computing the linear model $n$ times. Fortunately, for linear models, there is a "shortcut" that makes the cross-validation scheme as computationally expensive as training the model once, and it is implemented in $R$ (in the forecast library):
We start by training the model on the whole set:
total=rbind(train,test)
reg2=lm(f,data=total)
library("forecast")
CV(reg2)
The first number of the above list is the estimated MSE (mean squared error) for the model, as calculated by LOOCV (the CV row). AdjR2 is the adjusted $R^2$ of the model. The latter is very similar to what is obtained with reg. The MSE of reg (the model trained on the training set) is:
MSE = sum(reg$residuals^2)/nrow(test)
MSE
Which is much higher, since it was trained on a much smaller set. Model reg2 was trained on the whole set and is cross-validated.
Let's take a look at the coefficients of reg2:
results=summary(reg2)
foo=cbind(results$coefficients[,1],results$coefficients[,4])
foo[order(foo[,1],foo[,2],decreasing=TRUE), ]
Question 006 is "What is the total monthly income of your family?". Option A, the baseline, stands for no income. The larger the family income, the more likely a student is to do well in the test. This is not surprising, but I suspect monthly income is strongly linked to other variables. In other words, I wonder if having a higher income is really the cause of a student doing well in the test, or having a larger income just means a student will have access to better schools, to computers, internet, transportation, etc. I will come back to this question later.
Question 042 aks "What kind of primary and middle school did you go to?". Options E and F correspond to "I onlt attended and Indian school" and "I attended mostly and Indian school", respectively. I had no idea what an "Indian School" was. Turns out it is a school for Native American populations, where children are taught in both Portuguese and a native language (e.g. Guaraní). Q042F has a small p-value and its coefficient is large and positive. Q042E also has a positive coefficient, but it's p-value is too large for me to actually interpret it.
Are these kinds of school doing a better job than average? I'm going to take a look at the average grade of the students who attended these schools.
ss1=subset(total,total$Q042=='E')
ss2=subset(total,total$Q042=='F')
mean(total$NOTA)
mean(ss1$NOTA)
mean(ss2$NOTA)
It seems to be the case that students who attended native schools have better grades, on average. However, if we look their total number:
nrow(ss1)+nrow(ss2)
We can see that is very small compared to total number of students in the set:
nrow(total)
This is not surprising, considering I only selected white students. Therefore, the conclusion that students who attended native schools do better than other can hardly be trusted. Nonetheless, it might be worth exploring this in a set with more native students.
I want to try a very simple model where I only consider the student's income level and see what happens
reg3=lm("NOTA~Q006",data=total)
CV(reg3)
summary(reg3)
This is interesting, the out-of-sample $R^2$ is still, relatively, large. Many of the the questions (Q007 to Q025) are of the tpye "is there are car in your home?" or "is there a toilet in your home?" (with different levels indicating how many of the item are present in the students home). I think Q006 and these other questions depend on each other very strongly. I am going to try a few models with only Q006 and one of these variables:
reg4=lm(NOTA~Q008,data=train)
summary(reg4)
CV(reg4)
A lower adjusted $R^2$ and a higher MSE than reg3. Let's build single-variable models for all of these questions:
qvector=c("Q007",
"Q008",
"Q009",
"Q010",
"Q011",
"Q012",
"Q013",
"Q014",
"Q015",
"Q016",
"Q017",
"Q018",
"Q019",
"Q020",
"Q021",
"Q022",
"Q023",
"Q024",
"Q025" )
length(qvector)
regs=vector("list", 19)
preds=vector("list", 19)
cvs=vector("list", 19)
for (i in 1:19 )
{
# print(qvector[i])
f = as.formula(paste('NOTA ~ ', paste(qvector[i])))
regs[[i]]=lm(f,data=total)
cvs[[i]]=CV(regs[[i]])
}
MSE=vector("numeric", 19)
r2s=vector("numeric", 19)
for (i in 1:19 )
{
MSE[i]=cvs[[i]][1]
r2s[i]=cvs[[i]][5]
}
min(MSE)
max(r2s)
max(r2s)
None of these models can outperform reg3. What about all of them together?
f=NOTA ~ Q007 + Q008 + Q009 +
Q010 + Q011 + Q012 + Q013 + Q014 + Q015 + Q016 + Q017 + Q018 +
Q019 + Q020 + Q021 + Q022 + Q023 + Q024 + Q025
reg5=lm(f,data=total)
summary(reg5)
CV(reg5)
Comparable MSE and $R^2$, despite the much larger number of variables! To simplify the model, I will keep Q006 and drop Q007 to Q025:
f=NOTA ~ NU_IDADE + TP_SEXO + TP_ESTADO_CIVIL +
Q001 + Q002 + Q003 + Q004 + Q005 + Q006 + Q026 + Q042 +
Q043 + Q044 + Q045 + Q046 + Q047 + Q048 + Q049 + Q050
reg6=lm(f,data=total)
CV(reg6)
So reg6 has a MSE and $R^2$ comparable to those of reg2, even though the fomer has 20 less variables. Pretty cool.
Possible next steps:
- Use automated model selection methods, such as LASSO.
- Use unsupervised learning to see if I can build different models for different subsets of the dataset
- Use k-fold cross-validation instead of LOOCV. While LOOCV gives an almost unbiased MSE, it might overstimate a bit. k-fold cross-validation might give a more accurate estimate of the true (it has less variance), and I would like to compare their results.
- Use geographical data instead of the socio-economic questionnaire as the predictors to build the model (might have been easier to start this way).
Observation: the piece of code below removes infrequent levels from factor variables in the dataset. I sometimes ran into the problem that very infrequent levels (e.g. very very high family income) would only appear in the test set. This was an issue because the model would have no coefficient for that particular level, so I got an error when it appeared in the test set and I tried to get predictions.
enem_b=enem_br
for(i in 13:62) {
enem_b = enem_b[!(as.numeric(enem_b[[i]]) %in% which(table(enem_b[[i]])<10)),]
}
enem_b=droplevels(enem_b)