Large Language Models Benchmarks

Towards domain-adapted large language models for water and wastewater management: methods, datasets and benchmarking

Large language models (LLMs) have shown significant promise for water and wastewater management. However, current foundation models are not yet reliable. This Perspective outlines a pathway for ...

Nature

Benchmarking large language model-based agent systems for clinical decision tasks

Clinical decision-making entails complex, data-intensive, and often uncertain judgments, resulting in excessive workload and exceeding the cognitive limits of many clinicians. For more than two ...

Medical Xpress

Multilingual benchmark evaluates how well AI interprets clinical text and health records in nine languages

Researchers at Mass General Brigham recently developed BRIDGE, a multilingual benchmark that evaluates how well large language models (LLMs) understand clinical patient care text, including language ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

ascopubs.org

RadOncRAG: A Novel Retrieval-Augmented Generation Framework Improves Large Language Model Benchmark Performance in Radiation Oncology

Large language models (LLMs) show promise in assisting knowledge-intensive fields such as oncology, where up-to-date information and multidisciplinary expertise are critical. Traditional LLMs risk ...

13don MSN

China's Z.ai GLM-5.2 tops OpenAI’s GPT 5.5 model on key benchmarks

Chinese startup Z.ai has launched GLM-5.2, a powerful AI model for complex coding projects. This new large language model ...

STAT

OpenAI leaps into health care with AI benchmark to evaluate models

OpenAI on Monday released a large dataset for evaluating how well large language models answer questions related to health care. Experts lauded the open-source data and detailed evaluation rubrics, ...

Small Language Models Outperform Frontier AI On Cost, Speed And Accuracy

Bigger has defined AI from day one. New data says task-specific small models beat frontier LLMs on accuracy, cost and speed — ...

13d

Z.ai’s open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost

It allows engineering teams to host frontier-level AI on their own sovereign infrastructure, entirely eliminating vendor lock ...

Geeky Gadgets

How to Build Custom LLM Benchmarks for Your AI Applications

Have you ever wondered why off-the-shelf large language models (LLMs) sometimes fall short of delivering the precision or context you need for your specific application? Whether you’re working in a ...

PsyPost on MSN

Artificial intelligence models show massive gaps on traditional human intelligence tests

Artificial intelligence programs designed to process and generate text show remarkably high verbal reasoning abilities, but ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results