Predicting Student Performance Using Discussion Forums' Participation Data
A significant gap in education lies in the need for mechanisms that enable early detection of potentially at-risk students. Through access to an earlier prediction of student performance, instructors are given ample time to meet with and assist under-achieving students. As with any prediction modeling problem, there are many predictors to choose from when formulating a model. Previous related works have shown limited success in predicting course performance using students' personal and socioeconomic traits. Students learn by asking clarifying questions. Therefore, discussion boards have been a staple of learning at the university level for years. This paper aims to utilize participation in discussion forums to predict final student performance. Using students' course grades at roughly the halfway point in the term and various discussion forum predictors, our model predicts the students' final percentage score. Using the model's prediction, instructors can speak with at-risk students and discuss ways to improve. The student grades and discussion board participation datasets are gathered from a graduate-level Electrical and Computer Engineering (ECE) course at Duke University. Various classical machine learning models are explored, with random forest yielding the highest accuracy. This random forest model, trained on discussion forum participation data, surpasses other similarly trained state-of-the-art models. Furthermore, related research attempts the classification problem of predicting what discrete letter grade a student will earn [1]. This is not an accurate representation of a student's performance, and therefore, we attempt the regression problem of predicting the exact percentage a student will earn. A significant finding of this paper is that our random forest model can predict student performance with an average error of approximately 2.3%. Additionally, our random forest model can generalize to a different graduate-level course and make performance predictions with an average error of 3.3%. The final important finding is that a model including discussion board predictors outperforms another whose sole predictor is the students' halfway point grade. This indicates that discussion forums hold significant value in determining final performance. We envision that the knowledge from our findings and our optimal random forest model can enable instructors to identify and support potentially at-risk students preemptively.