We Need More Realistic Benchmarks for AI Models in Medicine
Large Language Models have shown impressive capabilities, but their medical knowledge has so far only been tested on medical licensing exams. We find that on more realistic tasks, such as clinical decision making, they lag behind medical experts, highlighting the need for more realistic benchmarks.