The academic world recently introduced the BabyVision benchmark, and the results were jaw-dropping—current multimodal large models can rival PhD students in language comprehension, yet when it comes to visual reasoning, they fall short of even a three-year-old. The contrast is like watching a literature professor struggle with kindergarten puzzles: full of theories but clumsy in practice.
When designing this test, researchers deliberately simulated infants' cognitive development trajectory. Tasks included asking AI to recognize partially hidden toys or understand the physics of balancing blocks. Surprisingly, these seemingly effortless challenges for human toddlers tripped up even the most advanced models. One experiment was particularly telling: when asked "What happens if this block tower is pushed over?", the model could accurately describe the collapse but failed to instinctively reach out to steady it like a two-year-old would.
This disconnect reveals deeper issues in AI development. We've taught machines to quote classics but haven't equipped them with basic life intuition—like raising a child who can solve calculus but can't tie shoelaces. Perhaps AI research should take cues from human infants: crawl before walking, experience the world before philosophizing. After all, an intelligent system that can't grasp object permanence still has light-years to go before achieving true artificial general intelligence.