Artificial intelligence (AI) has come a long way in recent years and can perform tasks that were once considered to be achievable only by humans. AI systems like GPT-4 and PaLM 2 have even surpassed human performance on various benchmarks. These benchmarks are standardized tests that measure the performance of AI systems on specific tasks and goals, enabling researchers and developers to compare and improve different models and algorithms. However, a new paper published in Science challenges the validity and usefulness of many existing benchmarks for evaluating AI systems.
The paper argues that benchmarks often fail to capture the real capabilities and limitations of AI systems, which can lead to false or misleading conclusions about their safety and reliability. This poses a major challenge to making informed decisions about where these systems are safe to use. Researchers and developers need to make sure they understand what a system is capable of and where it fails to develop AI systems that are safe and fair.
One of the key problems that the paper points out is the use of aggregate metrics that summarize an AI system’s overall performance on a category of tasks. While these metrics are convenient because of their simplicity, they obscure important patterns in the data that could indicate potential biases, safety concerns, or just help researchers learn more about how the system works. Moreover, the lack of instance-by-instance evaluation reporting makes it difficult for independent researchers to verify or corroborate the results published in papers.
To address the problem of better understanding and evaluating AI systems, the paper provides several guidelines. It includes publishing granular performance reports, breaking down specific features of the problem space, and working on new benchmarks that can test specific capabilities instead of aggregating various skills into a single measure. Researchers should also be more transparent in recording their tests and making them available to the community.
While the academic community is moving in the right direction, with conferences and journals recommending or requiring the uploading of code and data alongside submitted papers, some companies are moving away from sharing and transparency. This new culture will incentivize companies to hide the limitations and failures of their AI models and cherry-pick evaluation results that make it seem like their models are incredibly capable and reliable.
In conclusion, the article emphasizes the importance of rethinking AI benchmarks and developing new ways of evaluating AI systems to build AI systems that are transparent and fair. It highlights the need to make evaluation results public to allow for independent verification and scrutiny of results. The article draws attention to the risks associated with the lack of transparency and the increasing trend of commercial AI companies towards secrecy and highlights the importance of regulatory solutions in addressing these concerns.