In recent years, there have been significant advancements in AI systems in terms of recognizing and analyzing complex images. However, a recent study conducted by researchers from Auburn University and the University of Alberta has shed light on the fact that many state-of-the-art Vision Learning Models (VLMs) struggle with basic visual analysis tasks that are simple for humans.
The researchers created eight straightforward visual acuity tests with objectively correct answers to assess the performance of four different visual models – GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5. These tests range from identifying intersecting colored lines to counting nested shapes in an image. Surprisingly, all four models fell short of achieving 100 percent accuracy on these simple tasks, highlighting the limitations of AI in low-level visual reasoning.
Interestingly, the performance of the AI models varied significantly depending on the specific task. While Gemini-1.5 Pro showed nearly 93 percent accuracy in identifying circled letters, it struggled to count the number of rows and columns in a blank grid, achieving less than 60 percent accuracy. This discrepancy in performance suggests that AI models have blind spots when it comes to abstract images that do not align with their training data.
Moreover, small variations in the tasks had a noticeable impact on the results. For instance, while all models correctly identified five overlapping hollow circles, their accuracy dropped below 50 percent when faced with six to nine circles. This phenomenon led the researchers to believe that AI models might be biased towards well-known patterns, such as the Olympic logo with five circles.
The study also highlighted that fine-tuning AI models using specific images from the tasks did not significantly improve their performance. The researchers suggested that an “early fusion” training approach, which integrates visual encoding alongside language training, could enhance the models’ ability to handle low-level visual tasks more effectively.
Overall, the findings of this study underscore the limitations of AI models in basic visual reasoning and the importance of developing more robust training methods to address these challenges. While AI systems have made remarkable progress in complex image analysis, they still struggle with fundamental visual tasks that humans find effortless. This research opens up new avenues for improving the capabilities of AI models and enhancing their performance in a wide range of visual tasks.