VisualToolBench: Testing the Limits of AI Vision

https://scale.com/blog/visualtoolbench(scale.com)

Scale AI's VisualToolBench (VTB) is a new benchmark designed to test how well models can "think with images" by actively manipulating them to solve problems. The results showed that even top-tier Multimodal Large Language Models performed poorly, with no model achieving over 19% correctness. The primary reason for these failures was not flawed logic but fundamental errors in visual perception, where models could not correctly interpret or extract information from images. The study also found that simply having access to image manipulation tools did not guarantee models could use them effectively or efficiently.

0 points•by chrisf•8 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?