In recent years, artificial intelligence (AI) has rapidly transformed various sectors, from healthcare to entertainment, and even education. However, along with the typical performance benchmarks that gauge AI’s usability, a trend has emerged wherein bizarre and unofficial benchmarks have captivated a broader audience. Notably, one of these unconventional tests involves the hilariously surreal task of generating a video featuring actor Will Smith devouring spaghetti. This phenomenon illustrates not only the peculiar relationship between AI and popular culture but also reveals a deeper critique of conventional metrics used to measure AI efficacy.
The trend of evaluating AI through a Will Smith spaghetti video didn’t arise in a vacuum; instead, it reflects humanity’s intrinsic desire for humor and absurdity. By parodying this unusual application, Smith himself contributed to the viral phenomenon, demonstrating that even celebrities are not immune to the whims of internet culture. Videos of Smith munching on noodles have now become a benchmark reference, not for their technical rigor, but for their entertainment value. This instance raises questions about the quality of AI benchmarks and challenges the idea that performance should always be measured through rigorous academic or professional criteria.
Recorded performances that amuse and engage, such as AI-generated Minecraft creations or AI-versus-AI board games, are emerging as popular yardsticks for measuring AI capabilities. While traditional metrics focus on academic knowledge and complex problem-solving, these unconventional benchmarks gauge accessibility and relatability—qualities that resonate with the average user. They allow ordinary people to interact with and understand AI in a way that traditional measures fail to achieve.
Mathematically rigorous AI benchmarks, such as those based on Math Olympiad problems or Ph.D. level quandaries, often alienate the general user. Most individuals use AI for more mundane tasks—crafting emails, curating lists, or generating simple text—rendering complicated benchmarks irrelevant. The disconnect between academic standards and real-world applications highlights a significant gap in how AI effectiveness is assessed. As a result, conventional metrics lack the charm or comprehension required to attract public interest.
Crowdsourced measures, like the Chatbot Arena, attempt to provide a more democratic evaluation of AI performance. However, these platforms often suffer from skewed representative samples dominated by tech-savvy individuals whose voting preferences might not reflect the views of the average user. This blatant bias only furthers the argument for the necessity of more relatable benchmarks that account for the subjective experiences of everyday users.
Ethan Mollick, a management professor at Wharton, sheds light on the shortcomings of traditional AI benchmarks. He argues that a lack of varied comparisons—especially in sectors like medicine or law—limits our understanding of how AI systems perform in real-world applications. The focus should shift from abstract calculations of performance to measurable impacts on society. By prioritizing downstream effects, we can better ascertain whether AI contributes positively to the diverse realms of life where it is increasingly employed.
While more elaborate measures of performance may be essential, they often fall flat compared to the quirky benchmarks that entice users to engage with AI. The contrasting dynamic reveals a key insight: usability and relatability in AI might outweigh technical proficiency in the growth and adoption of these technologies.
As we venture deeper into 2025, it is undeniable that unconventional benchmarks like Will Smith eating spaghetti are here to stay. Their amusing nature captures public attention, creating a bridge between AI and users who otherwise might remain indifferent. As the tech industry grapples with the intricacies of AI, the pursuit of entertaining yet simplistic assessment methods may continue to flourish.
The emergence of these bizarre benchmarks demands a reassessment of our approach to evaluating AI. Rather than relying solely on rigid academic definitions of success, the AI community should embrace creativity in testing. After all, engaging users is key to allowing AI to realize its transformative potential. Who knows what oddities lie on the horizon? Perhaps a future benchmark will involve a celebrity debating an AI on the merits of various pasta dishes—the mind boggles, and the memes await!