Inclusive Learning Practices through Multiple Choice and Short Answer Questions.
By Admin, on 22 July 2025
Authors: Michelle Lai, Luke Dickens, Bonnie Buyuklieva, Janina Dewitz, Karen Stepanyan

Close-up of traditional fountain pen with an iridium nib by Peter Milosevic, courtesy of WikiMedia Commons.
Introduction
Accommodating neurodiversity in higher education often requires reasonable adjustments that take the form of additional time for completing the work, alternative assessment or examination settings, and the use of assistive technology among other options. An alternative approach to accommodating neurodiversity is to ensure that the material given to all students is universally accessible. We here examine AI based tools for improving the language used in tests to better support those with neurodiverse characteristics.
There is a significant proportion of the student body with neurodiverse characteristics. The statistics reported by HESA on self-disclosed learning disabilities such as dyslexia, dyspraxia, or AD(H)D show an increase of 26% between 2014-15 and 2021-22 (Tang et al., 2024). The total number of students with such learning differences in 2022-23 stands at 140,745 corresponding to roughly 6.5% of the total student population (HESA, 2025). Moreover, others have argued that this number is likely to underestimate the issue (Devine, J., 2024).
Neurodiverse characteristics, such as those associated with AD(H)D, autism, and dyslexia, have been linked to learning challenges associated with differences in cognitive function including: executive function (Southon, C., 2022), test anxiety (Mahak, S., 2024), and sensory sensitivities resulting in susceptibility to distraction (Irvine, B. et al., 2024.). This had led to calls for a revised approach to learning and assessment to include considerations of neurodiversity, e.g. (Sewell, A. and Park, J., 2021). Beyond academia, industry recruitment practices have shifted towards specifically taking into account neurodiversity of graduates as a way of obtaining greater competitive advantage (Borrett, 2024). Many universities are now leaning into offering greater support. In practice this takes the form of technology-based interventions, comprehensive support programmes, and transition into university and then employment (McDowall & Kiseleva, 2024). As a result of these changes the education sector is undergoing a paradigm shift that moves away from the traditional teacher-centred education to inclusive environments in Higher Education (Tang, Griffiths & Welch, 2024).
One approach to making education environments more inclusive is applying Universal Design principles to teaching materials. The principles of Universal Design advocate for developing environments that can be accessed, understood, and used to the greatest extent possible by all people, regardless of their ability or disability. Designs suitable for all people suggest that learning environments should strive to be as inclusive as possible, benefiting all stakeholders to the greatest degree possible, and reducing specific adjustments as much as possible (UCL-ARENA, 2024; Center for Universal Design, n.d.). This framework prioritises making the language of assessments more ‘neurodiversity-friendly’. We specifically focus on exploring the relevance of generative AI approaches for reformulating questions that are identified as problematic for neurodiverse cohorts. This project was funded by the UCL Centre for Humanities Education, which enabled us to conduct a case-study that looked at current examples of assessment, more specifically a sample of Multiple Choice Questions (MCQs), and the potential for making these more accessible to all students. In this case study, we look at two AI products that market themselves as generating textual outputs in neurodiversity-friendly language. We investigate how their suggested changes compared with the original questions set by two lecturers of technical subjects at UCL. The study was carried out by a funded postgraduate student and supported by academic staff at the Department of Information Studies, UCL, who co-authored this report.
Can AI Tools Make Assessment More Accessible?
Informed by the current body of research and by surveying available AI tools for diverse classroom applications (see Appendix 1), we shortlisted three tools based around two AI products (see Table 1). We then assessed whether these tools could make multiple-choice and short-answer questions written for assessment in technical modules more neurodiversity-friendly.
Our focus was on the AI tools that purport to be useful for making reading and language easier for neurodivergent people. We adopted Hemingway Editor and Goblin Tools Formaliser in our study. The latter product was tested under two modes: ‘More Accessible’ and ‘More to the point’ (see Table 1).
| AI Tool | AI Technology Type | Short Description |
| Hemingway Editor | Rule based readability & style analyser. Light AI proofreading. | Live colour‑coding of “hard‑to‑read” sentences, passive voice, over‑used adverbs. Grade‑level calculation. AI pass for spelling/grammar and optional paraphrases (paid tier only). |
| Goblin: More accessible | LLM‑powered text simplification. | Rewrites any passage into plainer language at a lower reading level while preserving meaning. |
| Goblin: To the point | LLM‑based concision / summarisation. | Strips filler, tangents and redundancies, returning essence‑only version of the text. |
Table 1: AI tools examined in this research, which were selected from the list of tools with the greatest potential to support language processing needs of neurodiverse students. Note: the full list of AI Tools considered for examination in this research is presented in Appendix 1.
Evaluation and Results
We combined the questions into question sets based on the topic and word length to accommodate the limitations of the tools. There were 7 sets of questions:
- Web Tech. 1, comprises 7 theoretical and introductory multiple-choice questions (MCQs) on the topic of the Internet and Web technologies;
- Web Tech. 2, comprises 7 MCQs containing HTML code;
- Web Tech. 3, comprises 7 theoretical MCQs on accessibility;
- Stats. 1, comprises a single MCQ in Statistics, which included a longer scenario and a longer text for the MCQ options;
- Stats. 2, comprises 4 MCQs on fundamental concepts in Statistics;
- Stats. 3, comprises on 2 MCQs that required simple statistical calculations;
- Stats. 4, comprises 3 MCQs that can be attributed to set theory – a foundational area of Mathematics.
The question bank comprises 31 questions, and was taken from an introductory web technologies modules taught at an undergraduate level, and an introductory statistics module taught to masters students, both at UCL.
| Set | Name | Count | Length | Hemingway Editor | Goblin: More accessible | Goblin: To the point | |||||||||
| Red | Orange | Green | No change | Red | Orange | Green | No change | Red | Orange | Green | No change | ||||
| 1 | Web Tech. 1 | 7 | 12.7 | 1 | 4 | 2 | 0 | 1 | 0 | 4 | 2 | 0 | 0 | 2 | 5 |
| 2 | Web Tech. 2 | 7 | 19.4 | 1 | 1 | 4 | 1 | 0 | 1 | 5 | 1 | 0 | 1 | 5 | 1 |
| 3 | Web Tech. 3 | 7 | 28.4 | 2 | 0 | 4 | 0 | 0 | 2 | 4 | 0 | 1 | 2 | 4 | 0 |
| 4 | Stats. 1 | 1 | 99 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 5 | Stats. 2 | 4 | 45.8 | 1 | 2 | 1 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 4 | 0 |
| 6 | Stats. 3 | 2 | 59.5 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 |
| 7 | Stats. 4 | 3 | 45 | 1 | 1 | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 3 | 0 |
Table 2: Summary of studied question sets, including the number of questions in each set (Count), the average length of text in words (Length), and frequency of accuracy flags (Red, Orange and Green) by adopted AI tool for each question set or frequency of output with no change.
We passed each of the questions into the chosen three AI tools. We, therefore, had upto 93 suggested revisions along with the original set of 31 questions. Table 2 shows the number of questions (count) and average word count (length) for each set, along with the accuracy flags discussed further down.
One indication of how straightforward some text is to read is captured by the FORCAST (FORmula for Forecasting Average Single-Syllable word Count) readability formula (Scott, n.d.). We calculated the FORCAST scores for the original questions as well as the questions modified by the AI tools (Figure 1). A higher FORCAST score indicates a greater difficulty of written text.

Figure 1: FORCAST Scores by tool for each of the question sets in the question bank using the Original, Hemingway, Goblin More Accessible and Goblin Straight to the point..
While FORCAST scores can be useful, it is not always possible to gauge whether the output is clearer or more concise. However, there is no widely accepted framework for such an assessment, and we had to develop our own method for doing this consistently. Our assessment of clarity and conciseness was inspired by the earlier work of Bischof & Eppler (2011). However, we adopted a single Likert scale score for human judgment of each aspect for simplicity.
| 1 | Language used is unclear, inappropriate, grammatically incorrect, or awkward. |
| 3 | Appropriate language and sentence structure is used. Certain words or turn of phrase are potentially awkward or incorrect, but do not impact the overall readability and understanding. |
| 5 | Appropriate language and sentence structure is used. Exceptionally easy to understand with no errors. |
Table 3: Likert scale, used in evaluating Clarity of Language.
| 1 | Overly wordy and convoluted. Severely impacts / delays the reader’s understanding. |
| 3 | Uses words of an appropriate difficulty for the topic at hand, with a standard / acceptable number of words to convey the intended meaning. |
| 5 | Uses the simplest and least number of words needed to convey the intended meaning. |
Table 4: Likert scale, used in evaluating Conciseness.
All original and AI-generated question sets were assessed by one of the researchers using these scores for clarity and conciseness. For our analysis, a researcher was asked to rate the question for clarity on the given Likert scale (see Table 3). Next a researcher was asked to rate each question for conciseness on a Likert scale (Table 4). Lower scores, therefore, indicate a lack of clarity or conciseness.
The results of the mean scores of the Clarity and Conciseness evaluations are presented for all seven sets of questions in Figure 2 and Figure 3 respectively.

Figure 2: Mean Clarity scores by tool for each of the question sets in the question bank using the Original, Hemingway, Goblin More Accessible and Goblin Straight to the point..

Figure 3: Mean Conciseness scores by tool for each of the question sets in the question bank using the Original, Hemingway, Goblin More Accessible and Goblin Straight to the point..
Finally, for each question in the sets the original authors of the test questions were asked if the questions produced by the AI tools retained the intended meaning with three possible judgement flags: green – the question retained the original meaning, orange – there were some issues with the meaning, and red – the question was no longer accurate given the context of the question. A final flag, gray, was given to questions that were unchanged by the tool. As indicated above, Table 2 presents the accuracy flag counts of the evaluated tools for each set. For readability these data are also plotted in Figure 4. It is clear from these results that these tools can perform differently across different types of questions, and no one tool performed without issues. Hemingway performed particularly badly on theoretical questions (Sets 1 and 3). It is also clear that Goblin: To the point performed well on questions containing code (Set 2). Overall accuracy, regardless the types of questions, does appear to be higher for both Goblin tools. However, this result should be viewed with caution due to a very small sample of questions used in the study. Furthermore, it is also important to study individual questions in greater detail to try and elicit why some of the questions were “tricky” for the AI tools.

Figure 4: Accouracy flag counts by tool for each of the question sets.
Preliminary Findings
The study highlighted that the adoption of Generative AI tools for making text accessible to neurodiverse audiences does not necessarily offer improvements. The FORCAST scores for some of the sets increased (see Sets 2 & 4), making the questions potentially more complex as a result of using AI tools. However, occasionally (see Sets 1, 2 and 6, Figure 1) the use of AI-generated content has offered improvements in FORCAST scores. In some cases, the original questions scored worse on clarity and conciseness than the AI-generated alternatives (see Set 4, Figure 2 & 3). Similarly, improvements in clarity were evident in some instances too (see Set 2, 4 & 5 in Figure 2). Most importantly, however, the noted improvements were rather small, with the range of variation in scores remaining minimal for all of the adopted metrics.
Goblin ‘more to the point’ had a higher conciseness score overall (Figure 3), but had a higher FORCAST score as well, meaning that a higher grade / education level was required to read it. The FORCAST formula uses the number of single-syllable words to calculate a score, which may be why a more concise output using less ‘extra’ words will generate a higher FORCAST score.
Tool Specific Findings
Hemingway Editor
Hemingway Editor is a tool that offers readability statistics with the premium membership, including AI rewrite suggestions. This AI tool uses popular AI services, namely OpenAI, Anthropic, and Together. The interface of the Hemingway editor is easy to navigate, and is customisable. When feeding the questions in, the system recognised that the text was meant for a university-level reader and adjusted its suggestions accordingly. However, questions that contained HTML/CSS code were not processed correctly, failing to escape the tags, and changing the meaning of the question as a result.
Goblin Tools Formaliser: More Accessible
Goblin tools are a set of single-task tools that aim to break down and simplify tasks for neurodivergent people, and use “models from different providers, both open and closed source.” (goblin-tools, n.d.). The Formaliser tool converts texts with up to 15 prompts, such as ‘more accessible’ and ‘more to the point (unwaffle)’. There is a ‘spiciness level’ which controls how strongly the generated text will come across; for this study, it was set to the lowest level.
The ‘more accessible’ option often changed words to more commonly used and ‘lower grade’ words, as well as changing sentence structures. While this generally improved the readability and clarity of the outputs, the accuracy was sometimes affected, given the highly technical language of many questions.
Goblin Tools Formaliser: More to the Point
The ‘straight to the point’ Goblin Tools option often made little to no changes to the shorter questions, which suggests that in most cases the original question author had set questions using concise language. There tended to be no changes in wording, but this meant that the accuracy was usually unaffected, while the clarity of the question did not improve. For longer questions, there was a greater change in wording, which resulted in greater variation in clarity, and higher conciseness scores.
In addition, while the use of generative AI tools did not offer consistent performance in improving the FORCAST, clarity and conciseness scores, the review of generated alternative wording suggested by the tool were at times viewed as useful for formulating alternative ways of phrasing the questions.
Additional Reflections and Summary
The literature often refers to the clarity and conciseness as factors that could affect a neurodivergent student’s understanding of a question, relating to processing speed and cognitive load (e.g. Fakhoury et al., 2018).
In summary, our study suggests that generative AI tools can be useful as scaffolding tools for inclusive assessment design, when paired with informed human judgement. The Hemingway and Goblin Tools occasionally improved clarity, conciseness or readability, but introduced new difficulties related to the use of code snippets or altering meaning. It appears that generative AI can support producing neurodiversity‑friendly assessments, but it should only be adopted in an assistive role.
***
Dr Bonnie Boyana Buyuklieva FHEA, FRGS is a Lecturer (research) in Data Science for Society at UCL’s Department of Information Studies. Bonnie’s background is in architecture and computation (Foster and Partners; Bauhaus University Weimar), and she holds a PhD from the Bartlett Centre for Advanced Spatial Analysis (CASA, UCL).
Janina Dewitz has worked for over a decade as an Innovations Officer at UCL, having previously worked as a Learning Technologist and Learning Facilitator at Barking and Dagenham College. Janina earned a Bachelor’s in Humanities from the University of Hertfordshire in 1998 and a Higher National Certificate in Performing Arts from Barking & Dagenham College in 2004.
Dr Luke Dickens is an Associate Professor at UCL’s Department of Information Studies. Luke is also a Founding member and co-lead of the Knowledge Information and Data Science (KIDS) research group at UCL, and a Founding member of the cross-institutional Structured and Probabilistic Intelligent Knowledge Engineering (SPIKE) research group based at Imperial College.
Michelle Lai recently completed a Masters of Arts in Special and Inclusive Education (Autism) at UCL, having previously earned a BSc in Psychology with Education, Educational Psychology. Michelle is interested in how psychology can inform and enhance inclusive practices in special education and currently works as a Specialist Teaching Assistant at Ambitious about Autism.
Dr Karen Stepanyan is an Associate Professor (Teaching) in Computing and Information Systems. Based at the Department of Information Studies, he is leading the delivery of the Web Technologies (INST0007) module and Database Systems (INST0001). He contributed to the development of the BSc Information in Society programme at UCL East. His research spans the inter-relation of information technologies and the concepts of knowledge.
***
References:
UCL-ARENA. (2024). “Inclusive education: Get started by making small changes to your education practice.” Teaching and Learning Retrieved April 1 2025, from https://www.ucl.ac.uk/teaching-learning/news/2024/nov/inclusive-education-get-started-making-small-changes-your-education-practice.
Bischof, N., & Eppler, M. J. (2011). Caring for Clarity in Knowledge Communication. J. Univers. Comput. Sci., 17(10), 1455-1473.
Borrett, A. (2024, Dec. 19) ‘UK employers eye “competitive advantage” in hiring neurodivergent workers’, Financial Times, accessed July 17 2025, Available at: https://www.ft.com/content/e692c571-b56b-425a-a7a0-3d8ae617080b.
Center for Universal Design. (n.d.). College of Design. https://design.ncsu.edu/research/center-for-universal-design/. Retrieved April 8, 2025 from https://design.ncsu.edu/research/center-for-universal-design/.
Devine, J. (2024). Comment: Neurodiversity and belongingness. Buckinghamshire New University. Retrieved July 18, 2025, from https://www.bucks.ac.uk/news/comment-neurodiversity-and-belongingness
Fakhoury, S., Ma, Y., Arnaoudova, V., & Adesope, O. (2018, May). The effect of poor source code lexicon and readability on developers’ cognitive load. In Proceedings of the 26th conference on program comprehension (pp. 286-296).
goblin-tools. About (n.d.). Retrieved July 18, 2025, from https://goblin.tools/About
HESA (Higher Education Statistics Agency). (2025, April 3). Who’s studying in HE?: Personal characteristics. Retrieved July 18, 2025, from https://www.hesa.ac.uk/data-and-analysis/students/whos-in-he/characteristics#breakdown
McDowall, A., & Kiseleva, M. (2024). A rapid review of supports for neurodivergent students in higher education. Implications for research and practice. Neurodiversity, 2. https://doi.org/10.1177/27546330241291769 (Original work published 2024)
Scott, B. (n.d.). Readability Scoring System PLUS. https://readabilityformulas.com/readability-scoring-system.php.
Tang, E. S. Y., Griffiths, A., & Welch, G. F. (2024). The Impact of Three Key Paradigm Shifts on Disability, Inclusion, and Autism in Higher Education in England: An Integrative Review. Trends in Higher Education, 3(1), 122-141. https://doi.org/10.3390/higheredu3010007
***
Appendix 1:
List of AI Tools considered for this study and classified by category (i.e. EF – Executive Function (planning, prioritizing, time management, task initiation); WM – Working Memory / information overload; LP – Language Processing / writing / reading load; SN – Sensory / modality flexibility (visual, auditory, text, speech); ANX – Reducing Anxiety (performance, math, participation); SOC – Social Communication load (participate without speaking live, review later); MATH – Math Conceptual/step scaffolding; ACC – General accessibility (alt input, captions, screen-reader friendly, etc.))
| Tool | Short description | AI Tool Categories |
| Conker.ai | One‑click generator of differentiated quizzes & question banks | ANX, WM |
| Consensus AI | Answers research questions by ranking peer‑reviewed evidence | WM |
| Desmos | Web graphing calculator with interactive sliders & tables | MATH, SN |
| DreamBox | Adaptive K‑8 maths lessons that adjust every 60 seconds | MATH, ANX |
| Elicit AI | Semantic‑searches papers and auto‑extracts key findings | WM |
| Explain Paper | Click‑highlight any PDF sentence to get plain‑English explainer | LP |
| fireflies.ai | Live meeting recorder that transcribes, timestamps and summarises team calls | WM, SOC, ACC |
| GeoGebra | Dynamic geometry & graphing suite—visual proofs, 3‑D, AR | MATH, SN |
| goblin.tools Chef | Generates grocery lists & cooking steps from meal ideas | EF |
| goblin.tools Compiler | Condenses chat or notes into tidy bullet points | WM |
| goblin.tools Estimator | Predicts how long a task list will really take | EF |
| goblin.tools Formaliser | Rewrites text into more formal / academic register | LP |
| goblin.tools Judge | Rates whether your instructions are “clear enough” | EF |
| goblin.tools MagicToDo | Turns a vague task into an ordered, timed action plan | EF |
| goblin.tools Professor | Explains any concept at a chosen complexity level | WM |
| Grammarly | Real‑time grammar, tone and clarity checker inside browsers & docs | LP |
| Hemingway Editor | A readability tool that color‑codes dense or passive sentences, flags adverbs, and shows grade level so writers can simplify and clarify their prose. | LP |
| Heuristica | Generates interactive concept maps you can chat with | WM |
| IXL | Skills‑driven practice that levels up as students gain accuracy | MATH |
| Julius AI | Chat‑style data analyst that cleans, queries and plots spreadsheets | WM |
| Knewton Alta | Mastery‑based, adaptive courseware for math & science in HE | MATH |
| Litmaps | Builds visual citation maps to reveal research connections | WM |
| MathGPTPro / Mathos AI | LLM tutor that OCR‑reads handwritten maths and explains steps | MATH |
| MATHia | Carnegie Learning’s AI tutor that coaches each step of algebra problems | MATH, ANX |
| MathPapa | Symbolic algebra solver that shows stepwise solutions | MATH |
| Motion | AI calendar that auto‑schedules tasks against deadlines and reshuffles as priorities change | EF |
| Nuance Dragon Speech | High‑accuracy speech‑to‑text dictation across OS‑level apps | ACC, LP |
| Otter.ai | Live captioning & searchable transcripts for meetings and lectures | WM, SOC, ACC |
| PhET Interactive Simulations | Free, click‑and‑drag science & maths models (HTML5) | SN |
| SciSpace | All‑in‑one platform to search 200 M papers and ask PDFs questions | WM |
| Scite.AI | Shows whether later studies support or contradict a cited paper | WM |
| Scribe (To‑do) | Breaks complex goals into step‑by‑step checklists and sets reminders | EF |
| Scribe (Writing helper) | Generates SOPs and how‑to docs from screen recordings and prompts | LP, WM |
| Semantic Scholar (TLDR) | Gives one‑sentence abstract of any research paper | WM |
| SmartSparrow | Learner‑authored adaptive modules with branching feedback | WM |
| Socratic (Google) | Mobile camera‑based homework helper with brief video explainers | MATH, WM |
| Symbolab | Multi‑step calculator covering calculus, series, matrices, proof | MATH |
| Synthesia | Turns typed scripts into captioned avatar videos in 120+ languages | SN, ACC |
| Tavily | Real‑time search API built for LLMs & agents, returns source snippets | WM |
| Topmarks | Curated hub of short interactive literacy & numeracy games | SN, MATH |
| TXYZ.ai | AI “research OS” that finds, organises and chats over papers | WM |
| Unity ML‑Agents Toolkit | Open‑source SDK for building reinforcement‑learning–driven 3‑D sims and games | SN |
| Writer | Enterprise‑grade generative writing assistant with style‑guide enforcement | LP |
Close




