Behind the Paper

Generative artificial intelligence for automated writing evaluation: A systematic review of trends, efficacy, and challenges

This systematic review shows that Generative AI in Automated Writing Evaluation supports grammar and rubric-based assessment but remains inconsistent in evaluating creativity and argumentation. While offering pedagogical benefits and academic integrity limit its use in high-stakes assessment.

Shadi Abudalfa Jun 11, 2026

Generative Artificial Intelligence (GenAI) is rapidly transforming educational practices, particularly in the domain of writing assessment. As large language models (LLMs) such as GPT, Gemini, Claude, and DeepSeek continue to advance, their ability to generate, evaluate, and provide feedback on written text has opened new possibilities for Automated Writing Evaluation (AWE). These technologies are increasingly being integrated into educational environments to support writing instruction, reduce teacher workload, and provide immediate, scalable feedback to learners. However, alongside these opportunities come important questions regarding reliability, validity, fairness, and pedagogical effectiveness.

This study presents a comprehensive and systematic review of the emerging literature on the use of Generative AI in Automated Writing Evaluation. Although traditional AWE systems have been studied for decades, the introduction of GenAI represents a significant shift from rule-based and statistical approaches toward context-aware, language-rich systems capable of producing detailed feedback and rubric-based evaluations. Despite the growing adoption of these technologies, there remains a lack of comprehensive synthesis regarding their effectiveness, educational value, and associated risks.

To address this gap, we conducted an extensive review of 96 empirical studies published between 2022 and early 2025. Following PRISMA guidelines and employing the Context–Intervention–Mechanism–Outcome (CIMO) framework, our study systematically examined how GenAI has been utilized in writing assessment, the contexts in which it has been deployed, the mechanisms through which it influences teaching and learning, and the outcomes reported across different educational settings.

Our analysis reveals that GenAI-based AWE systems are primarily used in three key areas: qualitative feedback generation, automated essay scoring, and language editing. Among these applications, feedback generation emerged as the most common use case. The reviewed studies demonstrate that GenAI can effectively identify grammatical errors, improve sentence structure, enhance coherence, and support organization in student writing. Furthermore, many studies report that AI-generated feedback encourages revision, increases learner engagement, and promotes self-regulated learning.

One of the key findings of this review is that the effectiveness of GenAI in writing evaluation is highly dependent on context. While these systems consistently perform well in surface-level writing tasks such as grammar correction, spelling, vocabulary enhancement, and rubric-guided scoring, they remain less reliable when evaluating higher-order writing skills. Assessing argumentation quality, creativity, critical thinking, rhetorical effectiveness, and disciplinary reasoning continues to present significant challenges for current AI systems. As a result, many studies characterize GenAI as conditionally effective rather than universally transformative.

The review also highlights important pedagogical implications. On the positive side, GenAI enables immediate and scalable feedback, making writing support more accessible to large numbers of learners. Teachers frequently report reduced workload and faster feedback cycles when AI tools assist with routine evaluation tasks. Students often experience increased motivation and confidence due to the availability of instant feedback and revision support. These benefits suggest that GenAI can play a valuable role in formative assessment environments.

However, the findings also reveal several limitations and concerns. Many studies report that AI-generated feedback can be overly generic, vague, or lacking in pedagogical depth. While GenAI excels at identifying language-related issues, it often struggles to provide meaningful guidance on complex writing constructs. Furthermore, concerns regarding scoring consistency, transparency, bias, and fairness remain unresolved. Small variations in prompts or model updates can lead to different evaluations of the same text, raising questions about reliability and reproducibility.

Another major challenge involves ethical and educational considerations. Researchers have expressed concerns about academic integrity, student overreliance on AI-generated feedback, privacy issues, and the potential erosion of learner agency. Excessive dependence on AI tools may reduce critical thinking, creativity, and independent writing development. Additionally, the opaque decision-making processes of many large language models make it difficult for educators to understand or justify automated scores and feedback.

Our review also uncovers several important research gaps. First, there is a need for stronger theoretical foundations, as a majority of existing studies lack explicit theoretical frameworks. Second, most research has been conducted in higher education settings, leaving primary, secondary, and professional education contexts relatively underexplored. Third, there is a pressing need for standardized evaluation frameworks that move beyond simple agreement with human raters and instead assess validity, fairness, and long-term learning outcomes. Finally, more research is needed on multilingual and cross-cultural applications of GenAI-based AWE, as current evidence remains concentrated in a limited number of countries and languages.

Looking forward, the successful integration of GenAI into writing assessment will require a balanced approach that combines technological innovation with human expertise. Rather than replacing teachers, GenAI should be viewed as a complementary tool that supports educators by handling routine feedback tasks while allowing human instructors to focus on higher-order aspects of writing development. Human oversight remains essential to ensure fairness, contextual sensitivity, and meaningful pedagogical guidance.

In conclusion, this review demonstrates that Generative AI has substantial potential to reshape Automated Writing Evaluation. Its strengths in providing rapid, scalable, and consistent feedback make it a valuable resource for writing instruction. Nevertheless, significant challenges remain regarding validity, reliability, fairness, transparency, and the assessment of higher-order writing skills. As educational institutions continue to adopt AI-powered assessment tools, future research must focus on developing robust evaluation frameworks, improving model transparency, and ensuring that these technologies are deployed in ways that enhance rather than diminish educational quality.

The future of writing assessment will likely involve collaborative partnerships between human educators and intelligent systems. By carefully addressing the challenges identified in this review, researchers and practitioners can help ensure that GenAI becomes a powerful and responsible tool for supporting the next generation of writing education.