How to Test AI Models

Hosted on MSN

OpenAI introduces FrontierScience to test AI’s expert-level scientific reasoning across physics, chemistry, biology

OpenAI on December 16 announced FrontierScience, a new benchmark designed to evaluate artificial intelligence systems on expert-level scientific reasoning across physics, chemistry and biology, as AI ...

5don MSN

AI models are starting to crack high-level math problems

“I was curious to establish a baseline for when LLMs are effectively able to solve open math problems compared to where they ...

ZDNet

Anthropic's open-source safety tool found AI models whistleblowing - in all the wrong places

The "Petri" tool deploys AI agents to evaluate frontier models. AI's ability to discern harm is still highly imperfect. Early tests showed Claude Sonnet 4.5 and GPT-5 to be safest. Anthropic has ...

How AI Is Transforming Software Development

The code can improve itself, but humans will still be responsible for understanding why something changed and whether it ...

Quanta Magazine

Distinct AI Models Seem To Converge On How They Encode Reality

Is the inside of a vision model at all like a language model? Researchers argue that as the models grow more powerful, they ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results