LLM benchmark scores provide a standardized way to compare language models across reasoning, coding, math, and knowledge tasks. This reference covers 20+ models including GPT-4o, Claude 3.5, Gemini 1.5, and open-source models like Llama 3 and Mistral.

Data as of April 2026 Scores from official provider disclosures and research papers

Model Provider MMLU HumanEval GSM8K HellaSwag TruthfulQA GPQA MT-Bench Context Open
Source

Score color coding (per benchmark column):

Top 3 in column Mid-range Bottom third N/A (not evaluated)

Click column headers to sort. All scores are percentages (0-100). N/A = not publicly available.