General AI
Benchmark
Definition
Standardized tests and datasets used to evaluate and compare AI model performance across different tasks.In-Depth Explanation
Benchmarks provide objective metrics for comparing models. Popular LLM benchmarks include MMLU (knowledge), HumanEval (coding), and GSM8K (math). While useful for comparison, benchmarks may not reflect real-world performance and can be gamed through training data contamination.
Real-World Example
GPT-4 scores 86.4% on MMLU, significantly higher than GPT-3.5s 70%.
0 views0 found helpful