TutorBench: Grading the Next Generation of AI Tutors

https://scale.com/blog/tutorbench(scale.com)

To grade the next generation of AI tutors, a new benchmark called TutorBench evaluates models on their ability to teach high school and AP-level STEM subjects. It uses 1,500 multimodal conversations, many of which include images of handwritten notes, to test an AI's skill in providing adaptive explanations, feedback, and active learning support. A sophisticated rubric system, judged by an LLM that closely aligns with human experts, scores responses on dimensions like truthfulness, visual reasoning, and even emotional awareness. Despite testing 15 frontier models, the results show that AI has not yet mastered tutoring, with the top-performing model achieving a score of only 55.65%.

0 points•by hdt•9 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?