AI agents are gaining popularity as a new research avenue with potential real-world applications. These agents utilize foundation models like large language models (LLMs) and vision language models (VLMs) to interpret natural language instructions and pursue complex goals independently or semi-independently. They can utilize tools like browsers, search engines, and code compilers to validate their actions and reason about their objectives.
However, a recent study conducted by researchers at Princeton University has uncovered several limitations in current agent benchmarks and evaluation practices that impede their applicability in real-world scenarios. One major issue highlighted in the study is the lack of cost control in agent evaluations. AI agents can be much more expensive to operate than a single model call, as they often rely on stochastic language models that may yield different results when given the same query multiple times.
To improve accuracy, some agentic systems generate multiple responses and employ methods like voting or external verification tools to select the best answer. While this approach can enhance performance, it comes with a significant computational cost. In research settings, where the aim is to maximize accuracy, inference costs may not be a concern. However, in practical applications, there is a budget constraint for each query, making it essential for agent evaluations to be cost-controlled.
The researchers propose visualizing evaluation results as a Pareto curve of accuracy and inference cost and utilizing techniques that optimize the agent for these two metrics simultaneously. They evaluated accuracy-cost tradeoffs of different prompting techniques and agentic patterns introduced in various papers.
Another issue brought to light by the researchers is the disparity between evaluating models for research purposes and developing downstream applications. While research often prioritizes accuracy, inference costs play a pivotal role in choosing models and techniques for real-world applications. Evaluating inference costs for AI agents can be challenging due to fluctuating model prices and varying API call costs.
The researchers created a website that adjusts model comparisons based on token pricing to address this issue. They also conducted a case study on NovelQA, a benchmark for question-answering tasks on lengthy texts, which revealed that benchmarks intended for model evaluation can be misleading when used for downstream evaluation.
Overfitting, where ML models find shortcuts to perform well on benchmarks but fail in real-world applications, is a significant concern for agent benchmarks. The researchers recommend creating and maintaining holdout test sets to prevent overfitting by ensuring that agents cannot memorize test samples during training.
In conclusion, the study emphasizes the need for rethinking benchmarking practices for AI agents, as they differ significantly from models. As AI agents continue to evolve and potentially become integral to everyday applications, understanding their limitations and testing their boundaries are crucial for progress in this emerging field.