General AI

Benchmark

Definition

Standardized tests and datasets used to evaluate and compare AI model performance across different tasks.

In-Depth Explanation

Benchmarks provide objective metrics for comparing models. Popular LLM benchmarks include MMLU (knowledge), HumanEval (coding), and GSM8K (math). While useful for comparison, benchmarks may not reflect real-world performance and can be gamed through training data contamination.

Real-World Example

GPT-4 scores 86.4% on MMLU, significantly higher than GPT-3.5s 70%.

0 views0 found helpful