PREDICTING STUDENTS’ PERFORMANCE IN INTRODUCTORY PROGRAMMING COURSES: A LITERATURE REVIEW

The teaching-learning process in programming in university with freshmen is often associated with high failure and dropout rates. These outcomes frustrate both students and teachers and there is a need to verify the causes of these failures. By predicting the causes of these problems, we can try to control them, or at least try to plan the courses to try to avoid failure in the identified cases. The purpose of this paper is to analyze the scientific production concerning the prediction of students’ performance in introductory programming courses. This analysis regards articles indexed in Clarivate Analytics’ Web of Science and Elsevier’s Scopus. The sample includes a total of 30 articles. The results obtained by bibliometric analysis show when and where those documents were published, who are the authors and what is the focus of said articles. We also analyzed the most cited documents. We made a summary of the articles. We were able to obtain a global overview of the theme, obtaining a strong analysis that is useful for teachers in the process of helping students achieve success in introductory programming courses at universities.


I. INTRODUCTION
Teaching and learning programming at the beginning of university courses worry many researchers and teachers [1] [2] [3] .The results reported are often disastrous: they tend to be associated with unwanted failure and dropout rates [4] [5] [6].The propaedeutic character of courses such as introduction to programming causes the students' paths to be altered by their performance: if they manage to reach high levels of success, the rest of the course is facilitated; on the contrary, the failure in these curricular units makes students unmotivated, which often leads to drop out the course [7].Are there students' characteristics that may indicate the level of success?Are there elements of the students' past career that make them tend to be more, or less, likely to drop out?What kind of motivation, habits and interests make students achieve lower or higher grades?These are just some examples of questions that can be asked, but there are many more [8].By using students' characteristics (e.g.past career, motivation, habits, interests, among others) to predict their success or failure, we can foresee problems and try to guide the students' performance (and even the teachers' behaviour) in a better path [9].
It is possible to try to predict students' performance using data mining (the process of discovering patterns in data [10] and machine learning techniques that aim at analysing data to find meaningful patterns.There is a subfield of datamining, called Educational data mining, which consist of the application of data mining and machine learning techniques to educational data [11].It has lately received increased attention by the scientific community [12], [13].Surveys of educational data mining are presented in [11] and, more recently, in [14].After that, several reviews have been published on the same subject [12] [15] [16] [17] [18] [19] [20].All these studies are important to give clues on how research in the area has been carried out and eventually to predict the future.The purpose of this paper is to analyse the scientific production that concerns the prediction of students' performance in higher education, specifically for introductory programming courses.We consider articles indexed in Clarivate Analytics' Web of Science and Elsevier's Scopus.The sample includes a total of 30 articles.The results obtained with the bibliometric analysis show when and where those documents are published, who are the authors and what is the focus of the research.
Bibliometric analysis [21] is the quantitative study of bibliographic material: it provides a general picture of a research field that can be classified by papers, authors and journals.Bibliometric methods employ a quantitative approach for the description, evaluation, and monitoring of published research.These methods have the potential to introduce a systematic, transparent and reproducible review process and thus improve the quality of reviews [22].Bibliometric analysis provides objective criteria that can assess the research development in a field and act as a valuable tool for measuring scholarship quality and productivity [23].Bibliometric methods offer systematization and replication processes that can improve understanding of the dissemination of knowledge in a field and can highlight gaps and opportunities that may contribute to the advancement of the discipline [24].We also analysed the most cited documents and made a summary of the articles.This document is subdivided into several sections.First, on describe the research questions, followed by the methodology and the bibliometric results.Then we show the results of article content, ending with the discussion of results obtained, the conclusions and suggested future work.

II. THE RESEARCH QUESTIONS
The research question, together with the purpose of the review, the intended deliverables and the intended audience, determines how the data is identified, collected and presented [25].As already referred, we wish to study documents, concerning the prediction of students' performance on introductory programming courses in higher education, published in high quality journals.Regarding this, this work aims at answering bibliometric and content related questions.The bibliometric questions considered are: BQ1: When were the articles published?BQ2: What is the type of these publications?BQ3: Where were the articles published?BQ4: What is the focus of the articles?BQ5: Who publishes on the subject?BQ6: Are there clusters of authors who publish together?BQ7: What are the most cited articles?As for the content related questions, the ones considered are: CQ1: Which papers use machine learning as a technique?(The remaining content related questions refer only to the papers that use machine learning as a technique) CQ2: What kind of data was used?CQ3: What is the aim of the publication?CQ4: Which machine learning task was considered?CQ5: Which algorithms were considered?CQ6: What were their findings?

III. METHODOLOGY
The term bibliometrics was first used in 1969 by Alan Pritchard, hoping that the term would be used explicitly in all studies which seek to quantify the processes of written communication and quickly gain acceptance in the field of information science [26].Moed mentioned the possibilities of this type of study that reveal the enormous potential of quantitative, bibliometric analysis of the scholarly literature for a deeper understanding of its activity and performance, and highlights their policy relevance.[27] In scientific research, it is important to get a wider perspective of research already being conducted concerning a relevant subject matter [28] and a bibliometric analysis profile on the research trajectory and dynamics of the research activities across the globe [29].This is a bibliometric study that systematically analyses the literature using articles indexed at Elsevier's Scopus (Scopus) and Clarivate Analytics' Web of Science (WoS) databases.This study conducts a bibliometric analysis of international journal papers that we expect provides a useful reference for future research.The search strategy was TITLE-ABS-KEY (predict*) AND TITLE-ABS-KEY ("learn to program" OR "learning programming" OR "introductory programming" OR "novice programming" OR "introduction to programming") AND TITLE-ABS-KEY (university* OR "higher education" ) PUBYEAR: < 2020.

IV. BIBLIOMETRIC RESULTS
A set of 20 published papers were collected from WoS and 26 from SCOPUS.The search returned a total of 30 documents after discounting the duplicate results.The first article in Scopus was published in 1986 and second in 2006 ( Fig. 1.Annual evolution published documents.).20 (67%) of the documents are conference papers and 10 (33%) of them are journal articles, as we can see in Fig. 2. The next table (Erro!A origem da referência não foi encontrada.)shows the conferences where the papers were published, including number of publications (n), h-index (H), Scientific Journal Ranking (SJR 2018), and Country where the conference was held.As can be observed on the figure, the clusters found were: C1: CS1, data processing, data programming, introductory computing.C2: motivation, programming aptitude, self-efficacy and teaching programming.C3: educational data mining, introductory programming and prediction.
Seven documents (23%) are written by two authors, six by three authors and another six by four authors (20%) (Fig. 5. Number of authors by document.).The affiliation of these eleven authors is: one from the Czech Republic, three from Portugal, three from Turkey and four others from the United States.We found three clusters (Fig. 6.Author network visualization.):We can see that there is a great diversity of institutions that publish and work on the subject: Stanford University [30], Federal University of Alagoas [31], University of Durham [32], University of Auckland [33], Washburn University [34], University of Washington [35], University of Otago [36] and University of Helsink [37].There are 17 organizations with more than two documents.(Erro!Autorreferência de marcador inválida.).We found one cluster of the countries of origin of the most cited authors (Australia and New Zealand).

V. CONTENT RESULTS
From the full set of  [45], 2012 [30], 2013 [32], 2017 [31] [39] [41] [42], 2018 [40] [43] and 2019 [38] [44], and regard studies performed in different Universities from different countries.The second content related research question (CQ2: What kind of data was used?) aims at knowing the data used for the study.For this, we need to analyse the sample size, but also the type of information present on the data.As for the sample size, the studies analysed were performed considering samples that go from 41 students in [39] to 505 in [40].One of these studies [31] even considers two different samples of students: a sample of [41] 1 students enrolled in regular classes and another one of 262 students enrolled in online classes.As for the information present on the data, it included sociodemographic data [39], [42] [45], psychometric data [38] [41] [44], statistical data [41] [44], data related to course activity and course statistics [31] [39] [43] [45], previous grades [42], and data concerning automatically evaluated programming exercises [30] [32] [40].For the third content related research question (CQ3: What is the aim of the publication?),we could see that the aims stated for each study include improving the skills of each student in programming [44], predicting students' performance [42] [38] [32] [39] [41] or program correctness [40], identify at-risk students [43], explore the effects of an instructional intervention for increasing student motivation [45] , graphically model students' progress [30], and also compare the effectiveness of different educational data mining techniques [31].All these 11 studies tried to use machine learning to try to predict students' grades, and to try to answer the fourth content related research question (CQ3: What is the aim of the publication?)we analysed the publications' content and were able to realise that the machine learning tasks considered were unsupervised machine learning techniques [30], but also supervised ones as regression [32] [38] [43] [45] and classification [31] [39] [40] [41] [42] [44].

Fig. 6 .
Fig. 6.Author network visualization.From this figure, we can see that three clusters were found: C1: A. Gomes, F. B. Correia and P. H. Abreu.C2: R. Anderson, M. D. Ernst, R. Ordóñez, P. Pham, S. A. Wolfman and B. Tribelhorn.C3: E. Deveci, D. Aydin, K. S. Benli and F. B. Tek.Considering the first author, there are 18 different countries: United States has five documents, Australia, Brazil, New Zealand, Portugal, Spain and Turkey have two documents each (Fig. 7. Countries first author address.).

Table 1 .
Conference papers.As the table shows, ACM Technical Symposium on Computer Science Education, together with Frontiers in Education Conference have each three articles published on the subject.Of the 13 conferences, nine are in the United States (Erro!A origem da referência não foi encontrada.).The next table(

Table 2 .
Journal papers.)shows the journals that have published work on this theme, including h-index (H), Scientific Journal Ranking (SJR), journal quartile (Q), and the country where the journal is based.
As can be observed in the table, there are 10 journals that have published articles, one in each ( Table2.Journal papers.).We found 136 different keywords, 168 in total.The most frequent keywords are CS1, programming Education & Educational Research, Educational Data Mining, Introductory programming, and programming aptitude as presented on the next table (Table3.Most frequent keywords.And Erro!A origem da referência não foi encontrada.).