
(Image credit: Alamy)
New research has revealed another set of tasks most humans can do with ease that artificial intelligence (AI) stumbles over — reading an analogue clock or figuring out the day on which a date will fall.
AI may be able to write code, generate lifelike images, create human-sounding text and even pass exams (to varying degrees of success) yet it routinely misinterprets the position of hands on everyday clocks and fails at the basic arithmetic needed for calendar dates.
Researchers revealed these unexpected flaws in a presentation at the 2025 International Conference on Learning Representations (ICLR). They also published their findings March 18 on the preprint server arXiv, so they have not yet been peer-reviewed .
“Most people can tell the time and use calendars from an early age. Our findings highlight a significant gap in the ability of AI to carry out what are quite basic skills for people,” study lead author Rohit Saxena, a researcher at the University of Edinburgh, said in a statement. These shortfalls must be addressed if AI systems are to be successfully integrated into time-sensitive, real-world applications, such as scheduling, automation and assistive technologies.”
To investigate AI’s timekeeping abilities, the researchers fed a custom dataset of clock and calendar images into various multimodal large language models (MLLMs), which can process visual as well as textual information. The models used in the study include Meta’s Llama 3.2-Vision, Anthropic’s Claude-3.5 Sonnet, Google’s Gemini 2.0 and OpenAI’s GPT-4o.
And the results were poor, with the models being unable to identify the correct time from an image of a clock or the day of the week for a sample date more than half the time.
Related: Current AI models a ‘dead end’ for human-level intelligence, scientists agree
However, the researchers have an explanation for AI’s surprisingly poor time-reading abilities.
“Early systems were trained based on labelled examples. Clock reading requires something different — spatial reasoning,” Saxena said. “The model has to detect overlapping hands, measure angles and navigate diverse designs like Roman numerals or stylized dials. AI recognizing that ‘this is a clock’ is easier than actually reading it.”
Dates proved just as difficult. When given a challenge like “What day will the 153rd day of the year be?,” the failure rate was similarly high: AI systems read clocks correctly only 38.7% and calendars only 26.3%.
This shortcoming is similarly surprising because arithmetic is a fundamental cornerstone of computing, but as Saxena explained, AI uses something different. “Arithmetic is trivial for traditional computers but not for large language models. AI doesn’t run math algorithms, it predicts the outputs based on patterns it sees in training data,” he said. So while it may answer arithmetic questions correctly some of the time, its reasoning isn’t consistent or rule-based, and our work highlights that gap.”
The project is the latest in a growing body of research that highlights the differences between the ways AI “understands” versus the way humans do. Models derive answers from familiar patterns and excel when there are enough examples in their training data, yet they fail when asked to generalize or use abstract reasoning.
“What for us is a very simple task like reading a clock may be very hard for them, and vice versa,” Saxena said.
The research also reveals the problem AI has when it’s trained with limited data — in this case comparatively rare phenomena like leap years or obscure calendar calculations. Even though LLMs have plenty of examples that explain leap years as a concept, that doesn’t mean they make the requisite connections required to complete a visual task.
The research highlights both the need for more targeted examples in training data and the need to rethink how AI handles the combination of logical and spatial reasoning, especially in tasks it doesn’t encounter often.
Above all, it reveals one more area where entrusting AI output too much comes at our peril.
“AI is powerful, but when tasks mix perception with precise reasoning, we still need rigorous testing, fallback logic, and in many cases, a human in the loop,” Saxena said.