Loading Events
  • This event has passed.

Characterizing the Confidence of LLM-based Evaluators

June 5 @ 1:45 pm - 3:15 pm

Rickard Stureborg
Abstract:
Considerable research effort has been put into improving Large Language Models (LLMs) to evaluate NLP tasks automatically. This work generally tries to achieve high correlations with human judgements on the same task. However, it is still unclear what level of correlation is good enough for practical applications of LLM-based automatic evaluation systems. This paper characterizes these LLM evaluators’ confidence in ranking candidate NLP models and develops a configurable Monte Carlo simulation method. We show that even automatic metrics with low correlation with human judgement can reach high-confidence rankings of candidate models with reasonable evaluation set sizes (100s of examples). Further, we describe tradeoff curves between the LLM evaluator performance (i.e., correlation with humans) and evaluation set size; loss in correlation can be compensated with modest increases in the evaluation set size. We validate our results on RoSE, a text summarization dataset, and find our estimates of confidence align with empirical observations.
Bio:
Rickard is a PhD candidate in Computer Science, where his research focuses on high-subjectivity tasks in natural language processing, including applications for misinformation and automatic evaluation of machine-generated text. He works as a researcher at Grammarly, where he helps build the next generation of tools to integrate AI into people’s writing workflow. Rich’s interdisciplinary research has been featured in the most prestigious peer-reviewed research venues across several fields, including natural language processing (*ACL conferences), artificial intelligence (NeurIPS and AAAI workshops), human-computer interaction (CHI), optics (SPIE, Journal of Biomedical Optics), and public health (Vaccine). Rich is serving a three-year term on Duke’s Board of Trustees, where he is a member of the Graduate Education and Research Committee.
Co-sponsored by: Jose Delpiano
Speaker(s): Rich Stureborg
Room: Sala I-103, Bldg: Edificio de Ingeniería, San Carlos de Apoquindo 2500, Las Condes, Universidad de los Andes, Santiago, Region Metropolitana, Chile