Predicting students’ performance using survey data

The acquisition of competences for the development of computer programs is one of the main challenges faced by computer science students. As a result of not being able to develop the abilities needed (for example, abstraction), students drop out the subjects and sometimes even the course. There is a need to study the causes of student success (or failure) in introductory curricular units to check for behaviours or characteristics that may be determinant and thus try to prevent and change said causes. The students of one programming curricular unit were invited to answer four surveys. We use machine learning techniques to try to predict the students’ grades based on the answers obtained on the surveys. The results obtained enable us to plan the semester accordingly, by anticipating how many students might need extra support. We hope to increase the students’ motivation and, with this, increase their interest on the subject. This way we aim to accomplish our ultimate goal: reducing the drop out and increasing the overall average student performance.


I. INTRODUCTION
Universidade Portucalense has an undergraduate course on computer science. In the course, students undergo several technologically relevant curricular units (CUs). The first CU of programming, Algorithm and Programming (AP) [1] appears in the first semester of the first year for computer science students with a weekly load of six hours, corresponding to eight credits on the European Credit Transfer System (ECTS).
These types of CUs have a high failure rate, and students often drop out throughout the semester because they realise they will not get approval on the CU. It is very important for students to succeed in this specific CU for two main reasons. First, because the contents learned will be needed for subsequent semester CUs. Second, because this CU is usually the motivation for high performance during the remainder of the course.
For this, these technologically relevant CUs are essential to the appropriate performance on the course, not only because of the learned issues, but also because they are connected to the CUs in the following years.
The acquisition of competences for the development of computer programs in this CU is one of the main challenges faced by the students.
The AP CU grading process includes five individual tests and a project. In this study we will cover the three first tests (T1, T2 and T3), each at the end of the following subjects: For the following tests, subjects include arrays, multidimensional structures, miscellaneous ordering, searching, lists linked using registers, string handling, and other introductory subjects.
The grading is made using a 20 value grade scale. If the student undergoes the test (does not miss it), a positive result (grade ≥ 9.5) means that the student completed the CU successfully (pass). Otherwise, the student will fail the CU.
In the referred CU, the first approach to algorithmic thinking is made simultaneously with the introduction to a specific programming language. The students start by learning how to use C programming language to solve small problems. It is intended that students use compilers and know how to correctly debug programs.
There is a need to study the causes of student success (or failure) in introductory CUs to look for behaviours or characteristics that may be determinant and thus try to prevent and change the causes of failure.
The difficulty in addressing this issue revealed the need to search for alternative ways to approach it. The objective of this study is to profile the students at the beginning and during the semester, evaluating how their knowledge evolves throughout the CU.
For this, the students were invited to answer four surveys: one at the beginning of the semester (S0) and one at the end of each subject (after each test: S1, S2 and S3). However, as the surveys were optional, not all students answered all surveys. These surveys were published on the CU page in MOODLE 978-1-7281-0930-5/20/$31.00 ©2020 IEEE 27-30 April, 2020, Porto, Portugal

IEEE Global Engineering Education Conference (EDUCON)
Page 1017 right after the test and were available for the students to answer. At the beginning of the semester, there were 56 students enrolled in the CU, from which 39 answered S0. Surveys S1 and S3 had 21 responses each, and survey S2 had 17 responses. At the end of the semester, only 25 students attended all three tests. This means that, as a result of drop out, the percentage of possible success decreases.
Aiming at reducing drop out and increasing the overall average student performance, we use machine learning techniques to try to predict student grades, using the answers given on the surveys as features.
Next, we introduce some concepts needed to understand this work. After that, we describe the surveys offered to the students. Then, we explain the methodology used for this study, followed by the results obtained. Finally, we present our conclusions and future work.

II. BACKGROUND
In this section we start by presenting a summary of data mining, machine learning and the classification task. We then explain Educational Data Mining as a sub-field of the first, and analyse some works performed in the area.

A. Data mining and machine learning
Data mining (DM) is the process of discovering patterns in data [2]. There are some restrictions to this: 1) the process must be automatic or, at least, semi-automatic; 2) the patterns found must be useful (i.e., must lead to some advantage), and; 3) large quantities of data must be present.
Machine learning (ML) techniques aim at analysing the data (commonly represented in tabular form -datasets) to find meaningful patterns. ML problems can be divided into unsupervised and supervised learning problems. In unsupervised learning problems there is no dependent variable, while in supervised learning problems, at least one dependent variable is present on the dataset. There are specific tasks to unsupervised learning problems (e.g.: clustering, association) each with its own performance metrics.
Our research focuses on supervised learning problems, where the value of the dependent variable is present on the data and can be considered for building the prediction model, and it can also be used for evaluating the models' performance.
There are several tasks on supervised ML (e.g.: classification, regression). The one used in this study is classification. It consists in creating a model that tries to fit the function that best approximates the true value of the dependent variable. In this case, the dependent variable can assume a value from within a finite set of values (classes).
There are many algorithms for classification [3], [4], including those used in this work: top down induction of decision trees (for simplification, referred to as decision trees -DT, implemented in R with rpart [5]) and random forest (RF, implemented in R with randomForest [6]).
The evaluation of the predictive performance of classification methods can be made with several metrics. For the purpose of this work, we use the prediction accuracy, because it is a commonly used performance metric for classification. The performance evaluation is based on leave-one-out cross validation: for each problem (dataset) the metalearning process is repeated several times, each considering a different instance as testset and the remaining as the trainset. For example, if a dataset has got ten instances, the metalearning process is repeated ten times, each one considering one of the ten instances as the testset and the remaining nine as the trainset.

B. Educational Data Mining
Educational data mining is the application of data mining and ML techniques to educational data [7] and has lately received increased attention by the scientific community [8], [9]. Surveys of educational data mining are presented in [9] and, more recently, in [10]. After that, several reviews have been published on the same subject [8], [11]- [16]. Next, we describe two of the most recent works in Educational Data Mining that are similar to the work presented here.
In [7] the authors use two sets of data (one obtained before the school year began and another obtained two months after the beginning of the classes) from several schools on a district in Brazil to try to predict the student's outcome. The most relevant features were found to be 'grades' and 'absence' (obtained after the school year began), although features obtained prior to the beginning of the school year, such as 'neighbourhood', 'school' and 'age' were also found to be good indicators of the students' outcomes. The study was conducted using classification models based on gradient boosting trees (GBT).
In [17] the authors try to use data mining algorithms to support decisions that prevent student dropout. For this, the data was collected during the first 6 semesters of Computer Science, Information Systems and Computer Engineering courses. With the objective of evaluating different prediction models, several supervised learning techniques have been used. Random forest and decision trees were the best performing algorithms over the tree courses considered.
In this study, we aim at predicting the students' performance (grades) similarly to what is made in [7]: we also use classification techniques, and the data is also collected before and during the semester. However, we only use data from students attending one of the CUs at the University (Algorithm and Programming). Also, we do not use sociodemographic data, as 'neighbourhood'. Besides, our work is also similar to the one performed in [17], since the study is being conducted with students attending Computer Science (or similar) courses, and we are trying to use data mining techniques as an aid to predict (and, thus, prevent) student dropout.   [1][2][3][4][5] for values between 1 and 5). In any case, the student can choose not to answer a particular question. In that case, the answer is "NA" (no answer). If a student does not answer a full survey, then all the answers are considered to be "NA".

III. DESCRIPTION OF THE SURVEYS
In the beginning of the semester, before the first test, the students were invited to answer survey S0. The questions are presented on Table I. With the answers obtained on survey S0, we aim at predicting the students' performance on the first test (T1).
The second survey (S1) was applied after the first test (T1). The questions are presented on Table II.
The answers obtained in the surveys performed before the second test (first and second surveys, S0 and S1) together with the grades obtained on the first test (n Ex1) are used to try to predict the grade obtained on the second test (T2).
After the second test, the students were invited to answer the third survey (S2). The questions are presented on on Table III. All the answers obtained before the last test (on the first three surveys, S0, S1 and S2) together with the grades obtained on the first two tests (n Ex1 obtained on T1 and n Ex2 obtained on T2) can be used to try to predict the students' performance on the third test (T3).
Finally, after all the tests, at the end of the semester, the students were invited to answer a last survey (S3). Although

IV. METHODOLOGY
We are trying to predict students' grades based on their answers to the several surveys conducted during the semester. As we can only use past information to predict the grade of each test, we consider three ML problems (P1, P2 and P3), as depicted on Figure 1. The machine learning problems considered in this study and presented on the figure are the following: • P1: Use the answers provided in the first survey (S0) to predict the grade on the first test (T1); • P2: Use the answers provided in the first (S0) and second (S1) surveys, together with the students' grade on the first test (T1. The variable for the grade on the first test is named n Ex1) to predict the grade on the second test (T2); • P3: Use the answers provided in the first three surveys (S0, S1 and S2), together with the students' grades on the first two test (T1, wih a variable named n Ex1 and T2 with a variable named n Ex2) to try to predict the grade on the third test (T3).
For each problem, we consider the dependent variable in two different ways (types): • Predict pass, fail or miss: predict whether the student passes, fails or misses the test. The possible values in this case are pass (P), fail (F), miss (M), as presented on Table V.  For each problem and type, we create models to try to make the predictions with DT (only used for P1 -predicting the grades on the first test) and RF (on all problems).
When using RF, we also consider Recursive Feature Elimination (RFE, implemented in R's caret package [19]) techniques to select the most important features, instead of using them all.
We evaluate the models in terms of their accuracy (percentage of correct predictions) using Leave-one-out Cross-Validation and compare them to some baseline models.
We use different baseline models (BL) for each type of problem: • Predict pass, fail or miss: we consider three different baseline models, as described on Table VII.

BL
Description P model that always predicts that the student will pass the test F model that always predicts that the student will fail the test M model that always predicts that the student will miss the test • Predict the grade level: we consider six different baseline models as described on Table VIII. model that always predicts that the student will obtain an N1 level grade on the test N2 model that always predicts that the student will obtain an N2 level grade on the test N3 model that always predicts that the student will obtain an N3 level grade on the test N4 model that always predicts that the student will obtain an N4 level grade on the test N5 model that always predicts that the student will obtain an N5 level grade on the test M model that always predicts that the student will miss the test The accuracy of the baseline models are presented on Table IX and also on the next section together with the models' results.

IEEE Global Engineering Education Conference (EDUCON)
Page 1020

V. RESULTS
Here we present the accuracy results obtained for each problem (A) predict pass, fail or miss; and B) predict the grade level), compared to the baseline models' accuracies presented on Table IX.

A. Predict pass, fail or miss
With this we try to predict if a student will pass, fail or miss a test based on the answers given on the surveys. We approach the three problems referred on Figure 1 with different algorithms: we use DT for problem P1 and RF for all three problems (P1, P2 and P3).
First, we analyse the decision tree created for P1 (predict pass, fail or miss on the first test), presented on Figure 2. The diagram of the decision tree presented lets us see, at the top node, the overall percentage of the outcomes: when considering the whole dataset (100%), 36% of the students fail, 39% miss and 25% pass in the test. This means that, without considering any of the variables, the most probable outcome is for the student to miss the test.
Then, we can see that the tree splits into two branches, according to the value on variable i0 X6 (How much do you like technology?). If the student does not answer (NA) this question (the answer to i0 X6=NA represents 43% of the students), then there is a probability of 25% that the test outcome will be fail, 67% that it will be miss and 8% that it will be pass. In this case the most probable outcome is that the student will miss the test.
If the answer to question i0 X6 (How much do you like technology?) is not NA (the remaining 57% of students), then the right branch of the tree is taken. In this case, 44% of the students fail the test, 19% miss it and 38% pass. At this point, without considering any more variables, the most probable outcome is for the student to fail the test.
However, the tree splits again into two branches, this time according to variable i0 X42 (Do you often use digital libraries?). If the answer to this question is No (N) or if the student did not answer (NA) this question (30% of the students), then the left branch is taken. From within the students in this case, 59% fail, 24% miss and 18% pass test. In this case the most probable outcome is that the student will fail the test.
Otherwise, if the answer to i0 X42 (Do you often use digital libraries?) is Yes (Y), meaning that the student often uses digital libraries (27% of the students), the right branch is taken, and in this case, 27% of the students fail, 13% miss and 60% pass the test. In this case the most probable outcome is that the student will pass the test.
Besides using DT for problem P1, we also use RF to address all the problems (P1, P2 and P3). We considered RF in two different ways: 1) RF: considering all the variables; 2) RF(RFE): considering only the variables determined by Recursive Feature Elimination (RFE, that selects the variables with higher impact on the dependent variable). The variables determined by RFE are presented on Table X. Note that RFE found variable i0 X3 (age) important for all the problems considered. Besides that, variable n Ex1 (grade on the first test) is found important for predicting the grade of the second test, and the same happens for variables n Ex1 (grade on the first test) and n Ex2 (grade on the second test) for predicting the grade obtained on the third test. It is interesting that, as expected, the grades obtained previously can be an indicator of the students' performance. Another interesting observation is that when there are grades available, the model does not need many more variables to accurately predict the student's performance.

B. Predict the grade level
With this we try to predict a student's grade in the test, by considering the 20 value scale divided into five levels: N1, N2, N3, N4, and N5. This prediction is also performed based on the answers given on the surveys. We approach the three problems referred on Figure 1 with different algorithms: we use DT for problem P1 and RF for all three problems (P1, P2 and P3).
First, we analyse the decision tree created for P1 (predict the grade level on the first test), presented on Figure 3. The diagram of the decision tree presented lets us see, at the top node, the overall percentage of the outcomes: when considering the whole dataset (100%), 39% of the students miss, 5% obtain N1 level, 23% obtain N2, 12% obtain N3, 18% N4 and 2% N5. This means that, without considering any of the variables, the most probable outcome is for the student to miss the test.
Then, we can see that the tree splits into two branches, according to the value on variable i0 X18 (Do you often use the web for Social Networks?). If the answer to this question is No (N) or if the student did not answer (NA) this question (43% of the students), then the probability of the outcome being miss is 67% and the probabilities of obtaining levels N1, N2, N3, N4 or N5 are, respectively, 8%, 17%, 0%, 8% and 0%.
If the answer to question i0 X18 (Do you often use the web for Social Networks?) is not N or NA (the remaining 57% of students), then the right branch of the tree is taken. In this case, the probability of the outcome being miss is 19% and the probabilities of obtaining levels N1, N2, N3, N4 or N5 are, respectively, 3%, 28%, 22%, 25% and 3%. At this point, without considering any more variables, the most probable outcome is for the student to obtain level N2 (a grade between 4 and 7) in the test.
However, the tree splits again into two branches, this time according to variable i0 X42 (Do you often use digital libraries?). If the answer to this question is No (N) or if the student did not answer (NA) this question (30% of the students), then the left branch is taken. From within the students in this case, the probability of the outcome being miss is 24% and the probabilities of obtaining levels N1, N2, N3, N4 or N5 are, respectively, 6%, 35%, 24%, 6% and 6%. In this case, the most probable outcome is for the student to obtain an N2 level grade.
Otherwise, if the answer to i0 X42 (Do you often use digital libraries?) is Yes (Y), meaning that the student often uses digital libraries (27% of the students), the right branch is taken, and in this case, the probability of the outcome being miss is 13% and the probabilities of obtaining levels N1, N2, N3, N4 or N5 are, respectively, 0%, 20%, 20%, 47% and 0%. This means that, in this case, the most probable outcome is for the student to obtain an N4 level (between 12 and 15) grade.
The information obtained here is similar to the one obtained in the previous decision tree (Figure 2), since N2 level students have failed the test, and N4 level means that the students have passed the test.
In the same way as for predicting if the student will pass, fail or miss test, besides using DT for problem P1, we also use RF and RF(RFE) for problems P1, P2 and P3. The variables determined by RFE are presented on Table XII. By analysing the selected variables on Table XII, we were able to identify a set considered as having high impact on the dependent variable in both (P2 and P3) problems. These are shown on Table XIII. The variables presented above suggest that the age (i0 X3) is an important factor to consider when trying to predict the students' performance. Besides that, the motivation to use technology (i0 X25) is also an important factor, together with some previously obtained IT skills (i0 X33, i0 X34, i0 X35, i0 X38). Finally, and even more important, the students' 978-1-7281-0930-5/20/$31.00 ©2020 IEEE 27-30 April, 2020, Porto, Portugal

IEEE Global Engineering Education Conference (EDUCON)
Page 1022 interest on the CU (i1 X5, i1 X16, i1 X18) also needs to be considered.  The results presented on the table show that our models often have higher accuracies than the baseline models. However, in this case, the accuracies obtained were not as high as the ones referred for predicting pass, fail or miss. This suggests that this problem is more difficult to approach than the previous. In this case, for P1, the best result was obtained with DT. As for P2 and P3, RF with RFE was the best approach.
Despite being a more difficult problem category, and having led to lower accuracies, the analysis of these results suggests that, as expected, the interest of the students on the CU is a determinant factor for achieving good results. With this, we can plan ahead and try to make the students more motivated and interested on the CU in order to reduce the drop out and increase the overall average student performance.

VI. CONCLUSIONS AND FUTURE WORK
With this study, we are trying to predict students' performance based on answers provided to surveys. The students were invited to reply to the surveys, so they were not mandatory. This led to a limited number of answers, since not every student replied.
We used machine learning techniques (decision trees and random forest) to approach the problem. Results suggest that machine learning can be used for the task, and we obtained models with high accuracies when compared to some baseline models.
The model obtained for the first category of problems (predict if the student will pass, fail or miss a test) can help us anticipate how many students will need extra support.
The model obtained for the second category of problems (predicting the level of the grade obtained by the student on the test), despite achieving lower accuracies, suggests that, as expected, the interest of the students on the CU is a determinant factor for achieving good results.
Both the results enable us to plan the semester accordingly, by anticipating how many students might need extra support. We wish to motivate the students and, with this, increase their interest on the CU. This way we aim to accomplish our ultimate goal: reducing the drop out and increasing the overall average student performance.
As future work, we intend to extend the datasets, by using data obtained in more than one semester. Also, we plan to use different machine learning techniques to try to obtain even more efficient models.