EMNLP 2025: George Drayson
By Claire Hudson, on 15 January 2026
In November I had the opportunity to attend the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) in Suzhou, China, an ancient water town about an hour from Shanghai. With its beautiful canals, classical gardens (including the famous Humble Administrator’s garden), and historic architecture, Suzhou provided a lovely backdrop to this year’s conference. I particularly enjoyed visiting Hanshan temple and walking through Pingjiang road and Shantang street with their narrow streets that run parallel to the waterways.
At the conference I presented our work “Machine-generated text detection prevents language model collapse.” This research, conducted with my PhD supervisors Professor Emine Yilmaz and Dr. Vasileios Lampos, addresses a growing concern in the era of large language models: as AI-generated content becomes increasingly prevalent online, future models risk being trained on an unknown portion of synthetically generated data. This creates a feedback loop that can result in model collapse, a degenerative process where models gradually lose linguistic diversity and degrade in performance over successive generations. 
Our work first explored how different model decoding strategies affect the severity of collapse, revealing that certain sampling methods accelerate degradation more than others. Building on this, we introduced a prevention method that uses a machine-generated text detector to estimate the likelihood that each training sample is synthetic. We then apply detector-guided resampling to up sample high-confidence human samples and down sample likely AI-generated text. The results were encouraging, not only does this approach prevent model collapse, but it can also lead to improved performance compared to training on human data alone. This underscores the benefit of using synthetic data in model training, an area that is gaining lots of popularity in this field, but also the importance of data curation.
The conference offered a rich program of talks, tutorials, and workshops, exposing me to many new ideas and research directions I hadn’t previously encountered such as pixel language modelling and novel Mixture of Experts architectures. I particularly enjoyed the Allen AI keynote on the Olmo 3 model series and appreciate how transparent they are in their foundational model development. I also really valued the opportunity to connect with other researchers exploring similar questions around synthetic data, model collapse, and continual learning, areas that are central to both my PhD work and my role as Chief AI Officer at Locai Labs.
At Locai, I’m applying these concepts to develop large language models from the UK without the extensive cost of training models from scratch by building on existing open-source models through continual learning. This work directly extends the themes in our EMNLP paper, as we explore how models can avoid catastrophic forgetting, the tendency to lose performance when trained on new data, by training on a carefully curated portion of their own generated outputs. This helps mitigate excessive distribution shift and hence forgetting during further fine-tuning. We have applied these methods to develop our first model, Locai L1 Large, which is available at locai.chat. I wrote about the model release in a technical blog, with the full paper coming soon!
I’m grateful to the UKRI Foundational AI CDT for supporting my attendance of EMNLP. Conferences like this are invaluable for connecting research with real-world impact, and I’m already looking forward to the next conferences!

Close