X Close

IOE Blog

Home

Expert opinion from IOE, UCL's Faculty of Education and Society

Menu

The future of AI in high stakes testing: the fairness question

By IOE Blog Editor, on 3 April 2025

Backs of rows of students using computers in a classroom.

Credit: .shock via Adobe Stock.

3 April 2025

By Sandra Leaton Gray

Artificial intelligence (AI) is transforming high stakes testing. But how do candidates experience these tests? Are they trusted as fair and reliable measures? In 2023 a major research collaboration between UCL IOE and Pearson explored these crucial questions, focusing on the AI-driven PTE Academic test, a computer-based English proficiency language test often used to support study and visa applications and professional registration. Nearly two years on from the publication of our research report, AI-led assessment has continued to expand, raising new challenges around equity, bias and transparency. While AI has the potential to improve efficiency and standardisation, its role in shaping test-takers’ futures demands ongoing scrutiny.

AI in language testing: findings from the UCL-Pearson study

PTE Academic is one of the world’s most widely used AI-powered English proficiency tests, offering fully automated scoring. Unlike traditional exams, there are no human examiners involved, ensuring consistency, rapid results and scalability. However, the shift from human to AI assessment raises questions about trust and fairness and how test-takers adapt to this new reality.

Using surveys, interviews and micro-analytic techniques, IOE researchers explored how candidates interact with AI-based testing. The study found:

Trust in AI varied

Many candidates valued AI’s ability to remove human bias. However, some were sceptical about how AI assessed spoken language, particularly nuances such as intonation, hesitations and cultural variation in speech. The study suggested that even greater transparency in AI scoring mechanisms could improve trust, particularly in high-stakes exams.

Challenges in test preparation

Some candidates struggled to prepare for an AI-marked test, feeling unsure how best to optimise their responses. Pearson provides preparation materials and mock tests, but the shadow test preparation industry has capitalised on this uncertainty, offering expensive AI-specific coaching. A key question is whether AI testing will widen social inequalities if wealthier candidates can afford targeted preparation while others cannot.

Mixed emotional responses

Some candidates appreciated the objectivity of AI marking, while others found the lack of human interaction unsettling. Speaking tests were particularly divisive, with some missing the natural back-and-forth of human conversation. The incorporation of more interactive or responsive elements to replicate a real-world communication experience may come with time.

Comparisons with human-scored tests

While some candidates felt AI eliminated bias, others questioned whether it could fully assess spoken language proficiency beyond pronunciation and fluency. This aligns with concerns about AI’s limitations in assessing complex human expression.

What has changed since we published the report?

The role of AI in assessment has evolved significantly over the past two years, with new developments reinforcing the need for careful oversight.

Advances in AI reasoning

In early 2025, researchers announced that AI models had made significant advances in logical reasoning, allowing them to handle complex assessment tasks with greater accuracy. This has fuelled debates over whether AI should assess critical thinking and higher-order skills, or if it risks reducing nuanced responses to formulaic patterns.

AI in professional assessments

The use of AI in high-stakes testing is no longer limited to language proficiency. Top law firms like Linklaters are now making AI sit legal exams to evaluate its ability to provide legal advice. This raises wider questions about whether AI should take on a greater role in professional certification, and how human judgement should be integrated into automated assessments.

Universities rethinking assessment

As one example, the University of South Australia has introduced oral examinations (viva voce) to counter concerns about AI-generated work in written assessments. This reflects a growing trend of human-centred assessments re-emerging as a safeguard against AI influence, challenging the assumption that AI should be fully autonomous in grading.

AI’s expanding role in language assessment

AI-powered speech recognition is now embedded in many assessment systems. Universities and employers increasingly rely on automated tools to evaluate language proficiency, providing instant feedback on grammar, pronunciation and fluency. However, research from Berkeley is exploring whether these systems reinforce linguistic biases, disadvantaging speakers with diverse accents or non-standard speech patterns.

Where do we go from here?

AI-driven assessment is not going away, but how it is implemented will shape the future of fairness in testing. Key areas for improvement include:

Greater transparency in AI scoring

Candidates should have clearer insights into how AI evaluates responses and why certain scores are given. At present, many AI-driven assessments function as ‘black boxes’, where test-takers receive scores without a clear explanation of how their responses were analysed. Transparency could be improved by providing detailed scoring rubrics, exemplars of high- and low-scoring responses, and, where feasible, interpretability tools that allow candidates to see how their performance aligns with assessment criteria. This would not only increase trust in AI assessments but also help candidates better prepare by understanding what is being measured.

Ensuring accessibility for all learners

AI testing should be designed with diverse learners in mind, ensuring no group is disadvantaged due to socio-economic background, language variety or access to resources. Research has shown that digital assessment platforms can inadvertently disadvantage candidates who are less familiar with technology, raising equity concerns. In addition, AI speech recognition systems may not always accommodate different accents, dialects or speech patterns, leading to potential biases in scoring. To address this, developers should implement rigorous fairness auditing, expand training datasets to encompass a broader range of linguistic and cultural backgrounds and provide accessible test interfaces with accommodations for neurodivergent and disabled candidates.

Balancing AI efficiency with human oversight

While AI offers strengths in consistency, scalability and rapid processing of assessments, human judgement remains crucial in evaluating complex responses, particularly in areas such as spoken language and critical reasoning. AI systems currently struggle with nuanced aspects of communication, such as irony, cultural references and rhetorical sophistication. A hybrid model, where AI handles initial scoring but human examiners review borderline cases or particularly complex responses, could ensure fairness while maintaining efficiency. Institutions should also consider periodic audits of AI-generated scores by human assessors to detect systematic biases and refine scoring algorithms accordingly.

At UCL, researchers continue to explore these issues, examining how AI can make assessments more effective, reliable and fair. Our collaboration with Pearson has provided the basis for valuable insights into how test takers experience AI-led exams and where further refinements are needed.

Print Friendly, PDF & Email

Leave a Reply