LLM Benchmarks & Leaderboards
Each benchmark below ranks large language models on a specific task — and pairs every score with the model's cost per million tokens, so you can see which model is the best value, not just the highest score.
LMArena (Chatbot Arena) Elo
192 modelshuman-preference
View leaderboard →
LiveBench
71 modelsreasoning/coding
View leaderboard →
SWE-bench Verified
49 modelscoding
View leaderboard →
Aider Polyglot
40 modelscoding
View leaderboard →
Looking for the underlying datasets? Browse our benchmark datasets (MMLU, GPQA, HumanEval and more).