Accepted Papers
A Combined Strategy for the Pedagogical Evaluation of Automated Feedback: Generation, Decision and Fairness
Badmavasan Kirouchenassamy1, Amel Yessad1, Sébastien Jolivet2, Sébastien Lallé1, Vanda Luengo1
1 LIP6, Sorbonne Université, France 2 IUFE & TECFA, Université de Genève, Switzerland
Automated feedback is increasingly used in digital learning environments, yet its pedagogical evaluation is too often reduced to a single accuracy or satisfaction metric. We argue that pedagogy is intrinsically multidimensional and that evaluating automated feedback therefore requires a combined strategy that addresses, separately and explicitly, (i) the pedagogical validity of the feedback, (ii) the pedagogical alignment of the decision of which feedback to deliver and when, and (iii) the fairness of the deployed system across learner populations. We describe each layer, illustrate it with our ongoing work in multiple online Python practice platform deployed in French high schools, and outline how the three layers can be operationalised together using a reference pedagogical model, large language models and demographic indicators of equity. Our aim is not to propose a single new metric but to argue for a layered evaluation pipeline that mirrors the multidimensional nature of feedback itself.
Beyond Surface Human-Likeness: AI–Mentor Feedback Alignment, Pedagogical Adaptation, and Student Engagement in Longitudinal Learning Data
Jing Fan1, Jiseon Kim2, Kimia Abedini1, Daniel Koch-Truhponen1, Charles Koutcheme1, Juho Leinonen1
1 Aalto University, Finland 2 KAIST Global Institute for Talented Education, South Korea
Automated feedback is often evaluated by comparing it with human feedback, but surface resemblance may be an insufficient proxy for pedagogical quality. This extended abstract examines AI–mentor feedback alignment in the \textit{Anonymized} program, a 10-week online science program for underserved gifted elementary students. The dataset includes 920 student-week observations from 92 students, with AI-generated feedback, AI next-session guidance, mentor-refined feedback, task scores, and participation records. We compared AI and mentor feedback using string similarity, token overlap, TF-IDF cosine similarity, sentence-embedding semantic similarity, and traditional lexicon-based sentiment analysis with TextBlob and VADER. Results show low surface similarity but moderate-to-high semantic similarity, suggesting that mentors often rewrote AI feedback while preserving its core meaning. TextBlob indicated a positive mentor-minus-AI shift, while VADER suggested that both AI and mentor feedback were already highly positive. Adaptation also differed by instructional group: Basic group students received feedback that was less similar to AI output and more positively reframed than Advanced group students. These findings suggest that automated feedback evaluation should move beyond surface human-likeness and consider semantic preservation, pedagogical adaptation, affective tone, learner level, and engagement-related indicators.
Evaluating and Interpreting Gender Bias in LLM Feedback: Span-Level Embedding-Based Evidence from Automated Essay Feedback
Yishan Du1, Maria Perez Ortiz1, Mutlu Cukurova1
1 University College London, London, UK
Large language models (LLMs) are used to generate feedback on student writing at scale, concerns are growing that they may reproduce or amplify gender bias in pedagogically consequential ways. Recent embedding-based benchmarking work has shown that counterfactual gender cues can produce significant semantic divergence in LLM-generated essay feedback, especially under implicit gender conditions. However, such bias is difficult to evaluate because it increasingly emerge through subtle, context-dependent differences in tone, questioning, evaluation, revision guidance, and learner positioning. Thus, a key challenge remains: how can such bias be localised, interpreted, and connected to its downstream pedagogical meaning? This paper addresses this challenge by proposing a span-level embedding-based evaluation framework for analysing gender bias in LLM-generated essay feedback. Using 300 essays from the AES corpus, this work analysed 600 feedback responses generated by GPT-4o mini under an original male-associated condition and a male-to-female counterfactual condition. Feedback responses are segmented into local spans, embedded, and aligned across counterfactual pairs using cosine similarity. The study then estimate each matched span pair’s contribution to global cross-condition semantic separation through a leave-one-out cosine influence statistic and assess significance using one-sided permutation tests. Analysis of the significant span pairs (n=217, p<.05) reveals a systematic shift in pedagogical framing: (1) male-associated feedback is longer, more evaluative, and more strategy-oriented, focusing on argument, organisation, historical context, proofreading, and coherence; (2) counterfactual female feedback is shorter, more interrogative, and more focused on experiential, relational, and affective details. We argue that this pattern represents pedagogical framing bias, where gender cues influence not only what feedback says, but what kinds of learning opportunities it provides. This study contributes an interpretable NLP-based method for connecting embedding-level bias detection with pedagogically sensitive evaluation of automated feedback.
Evaluating the Pedagogical Quality of LLM-Generated Feedback: A Criterion-Based and Comparative Study
Harvey Ngoe Kolle1, Carrie Demmans Epp1, Amna Liaqat2, Maria Cutumisu3
1 University of Alberta, Canada 2 George Mason University, United States 3 McGill University, Canada
Evaluating automated feedback on pedagogical grounds requires more than a single holistic judgment. We developed a multi-dimension 14-item rating instrument grounded in formative feedback theory and used it to compare feedback from a multi-agent system, a single-agent system, and an instructor. Eighteen evaluators—instructors and pre-service teachers—rated and ranked 54 feedback instances that were tied to adult English language learner writing. Both automated conditions received substantially higher scores than human feedback. The only significant difference between the two automated conditions was on supportive tone, where the multi-agent condition received higher ratings than the single-agent condition. These results show that a multidimensional approach identifies differences in feedback quality that a single overall rating could miss.
Revision-Loop Behavior and Learning Outcomes under Voluntary AI Formative Feedback in an Undergraduate Statistics Course
Lifeng Han1
1 Tulane University, United States
We report a single-section pilot ($n{=}63$) of an instructor-built AI grader-and-tutor that lets students upload draft worksheets and receive itemized rubric feedback before submission to a teaching assistant. Across one semester the system logged 913 substantive conversations, roughly half of them continued by the student past the first AI response. Cross-sectional regressions of exam outcomes on usage measures are null after controlling for prior performance — consistent with strong selection into voluntary opt-in. A within-student panel on per-week recitation worksheet scores tells a different story: in weeks a student opts in, their recitation score rises by 0.90 points out of 10 relative to their own typical week, and each additional revision-loop conversation that week adds a further 0.68 points. Clustering reveals three behavioral phenotypes — light, skimmer, and iterator — not captured by usage counts alone: high-volume skimmers gain little, whereas iterators, who revise most despite a lower baseline, are the only group to outperform what their baseline predicts. We argue that pedagogical evaluation of automated feedback should look past usage volume to behavioral telemetry of revision — telemetry passively collected by any platform that supports re-submission.
Supporting Tutors in the Gig Economy with Automated Feedback: A Case Study on Ringle
Yeon Su Park1, Sieun Kim2, Keighley Overbay3, Seoyoung Kim1, Sewook Wee4, Daho Jung1, Juho Kim1
1 Korea Advanced Institute of Science and Technology, South Korea 2 University of Michigan, United States 3 Samsung Research, South Korea 4 Ringle, United States
The rise of online tutoring platforms in the gig economy has made education more scalable, flexible, and on-demand. These platforms rely on learner evaluations as the primary feedback for tutors and platforms. However, such feedback offers limited guidance for tutors’ improvement and makes it difficult to monitor tutor quality at scale. To this end, we explored AI-powered automated feedback and how tutors perceive and respond to it. We deployed a research probe on Ringle, a popular online English tutoring platform, providing automated feedback by analyzing tutors’ lessons, and surveyed 36 tutors. Our findings reveal that while tutors perceived automated feedback more negatively than learner feedback, they found it useful for self-monitoring and understanding platform expectations, though discrepancies between them often caused confusion. Based on these insights, we propose design considerations for feedback systems for online educational gig platforms at scale.
The Correct Answer Trap: Pedagogically-Grounded Detection and Feedback for Hidden Misconceptions
Moiz Imran1, Sahan Bulathwela1
1 University College London, United Kingdom
Automated feedback systems that rely on answer correctness will reinforce, rather than address, misconceptions when students reach the correct answer through flawed reasoning. We study this failure mode using 20,964 real student responses from the Eedi mathematics platform. Fine-tuned classifiers detect only 57% of these hidden misconceptions, and standard ML interventions do not improve on this. An open-weight reasoning model detects 84%, but at realistic prevalence, false alarms outnumber genuine detections roughly 8 to 1. We present a graduated assessment grounded in the mark/method distinction from educational assessment, and a detect-verify-escalate pipeline that routes ambiguous cases to diagnostic follow-up questions before teacher escalation. Two deployment modes adapt the pipeline to teacher dashboards (where false positives cost human time) and autonomous tutors (where every flag triggers a low-cost formative interaction).
Using a Learning Progression Framework to Guide LLM-Based Formative Assessment in STEM Education
Karen D. Wang1, Jialin Li2, Carl Wieman2, Leonora Kaldaras3
1 San Jose State University, United States 2 Stanford University, United States 3 University of Houston, United States
This study examines how a learning progression (LP) framework can adapt LLMs for formative assessment of open-ended math-science sensemaking responses. We developed LP-aligned prompts for scoring and feedback generation, evaluated scoring performance on 191 student responses and feedback quality on a stratified subsample. GPT-5.4 achieved 85.9% agreement and a weighted kappa of 0.79 with human scores. Feedback evaluation across five pedagogical dimensions indicated strong overall quality. Findings demonstrate the promise of grounding LLM-based formative assessment in LP frameworks.
Does Caring Cost Precision? Evaluating Anxiety-Framed LLM Misconception Feedback for Undergraduate Mathematics
Amanda La Hadi1, Muhammad Johan Alibasa1, A. Taufiq Asyhari1
1 BSD Campus, Monash University, Indonesia
Large language models (LLMs) can generate fluent mathematics feedback, but pedagogical quality requires more than diagnostic accuracy. This study audits LLM-generated misconception feedback for emerging adult learners by comparing a strict diagnostic prompt with personalised, anxiety-framed variants. Across 810 English-language outputs, with 270 generated under each of three prompt conditions, diagnostic accuracy was high but varied by prompt design. The strict diagnostic prompt (P3) achieved 90.4% accuracy, the anxiety-framed prompt (P5) achieved 89.6%, and the fully integrated prompt (P6) achieved 97.8%. These findings indicate that incorporating learner context can alter misconception diagnosis, with its effect depending on how affective and diagnostic instructions are integrated. In this study, the observed improvement was substantially greater than the reduction in accuracy.