Speaker
Title
General Evaluation of AI Agents
Abstract
Artificial Intelligence (AI) and machine learning (ML), in particular, have emerged as scientific disciplines concerned with understanding and building single and multi-agent systems with the ability to act and perform as humans do in a variety of contexts.
As is true for any scientific discipline, it is critically important to identify and measure scientific progress in AI and ML. However, overall progress in AI and ML is often measured indirectly by evaluating tangible research artifacts, such as models, agents/algorithms, and architectures, on specific tasks (e.g., datasets, benchmarks, or suites).
In particular, the field has designed and constructed thousands of tasks, benchmarks, and datasets for testing wide-ranging capabilities.
This proliferation of evaluation tasks has also given rise to a wide range of evaluation methodologies, often influenced by community-driven dynamics and the particularities of each area.
Unsurprisingly, AI and ML evaluation methods and practices have undergone numerous critique-review cycles. Nevertheless, there has been steady progress toward gaining a foundational understanding of evaluation in recent years.
Techniques from statistics, game theory or social choice theory have offered more principled approaches.
Today, with the deployment of increasingly complex models, agents, and systems that tackle evermore challenging tasks, there is a growing need to execute well-grounded and transparent evaluations.
This tutorial, based on an AAMAS 2025 tutorial by the same name, covers the fundamentals of the AI evaluation problem. We review existing methodologies, including statistics, probabilistic choice models, game theory, and social choice theory.
The learning outcomes of this tutorial include 1) an understanding of some of the challenges and pitfalls that arise with an evaluation of AI systems, 2) an introduction to methodologies for the evaluation problem, and 3) the pros and cons of each methodology, including insights as to when and how to apply them.