When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Yiyang Zhou¹²*, Haoqin Tu¹³*, Zijun Wang³, Zeyu Wang¹³

Niklas Muennighoff⁴, Fan Nie⁴, Yejin Choi⁴, James Zou⁴

Chaorui Deng¹, Shen Yan¹, Haoqi Fan¹, Cihang Xie³

Huaxiu Yao²†, Qinghao Ye¹†

¹ByteDance Seed ²UNC-Chapel Hill ³UCSC ⁴Stanford

* Equal Contribution. † Corresponding Authors.

Paper Code

Hugging Face Twitter

Introduction

MIRA (Multimodal Imagination for Reasoning Assessment) is a new benchmark that consists of carefully curated multimodal questions, each requiring the ability to generate or utilize intermediate visual images (i.e., Visual Chain-of-Thought) to successfully perform complex reasoning.

Reference to Figure 1 from MIRA paper

Fig. 1. MIRA Reveals MLLM Weaknesses in Visual Reasoning. Leading MLLMs such as GPT-5, o3, o4-mini, and Gemini 2.5 Pro perform well on benchmarks like MMMU, MMStar and RealWorldQA but drop below 20% accuracy on MIRA. This sharp decline highlights MIRA’s ability to expose core challenges in reasoning tasks that require generating intermediate visual imagery. The left example shows a dice-rolling task where humans, able to visualize motion, succeed, while MLLMs relying only on text reasoning fail.

Reference to Figure 2 from MIRA paper

Fig. 2. MIRA categorizes Visual-CoT reasoning tasks into two primary types: Static (Single-Step) and Dynamic (Multi-Step), with representative examples from each category illustrated in the figure. The dataset includes 20 types of tasks, 546 input images with manually designed questions, and 936 manually constructed single-step and multi-step intermediate images.

Results

The MIRA benchmark is highly challenging for all Multimodal Large Language Models (MLLMs), with even the most advanced closed-source models achieving an overall accuracy of no more than 20% under direct input (D). Text-based Chain-of-Thought (T) has limited effect, or even a negative impact on strong models like Gemini 2.5 Pro and o3. In contrast, providing human-annotated intermediate visual cues (V) significantly boosts model performance, yielding an average relative gain of 33.7%, highlighting the critical role of visual information in complex reasoning.

Analyze

Expanding the model's decoding search space (e.g., using Pass@k) brings only limited performance gains on the MIRA task, with improvements quickly saturating beyond Pass@4. For stronger models, the benefits of Pass@k or majority voting are minimal, indicating that their failures stem from a fundamental lack of capability rather than mere random reasoning errors.

Replacing the generic Text-CoT prompt with task-specific prompts to better simulate the guidance provided by Visual-CoT yields consistent but relatively marginal performance improvements (an average gain of about 1.4% to 1.5%). This limited improvement, in contrast to the substantial gains brought by Visual-CoT, highlights the inherent limitations of purely textual guidance, which struggles to adequately capture the visual information required for certain reasoning steps.

Data Case

We showcase several representative examples for each category below.

BibTeX

@article{zhou2025mira,
  title={When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought},
  author={Zhou, Yiyang and Tu, Haoqin and Wang, Zijun and Wang, Zeyu and Muennighoff, Niklas and Nie, Fan and Choi, Yejin and Zou, James and Deng, Chaorui and Yan, Shen and Fan, Haoqi and Xie, Cihang and Yao, Huaxiu and Ye, Qinghao},
  journal={arXiv preprint arXiv:2511.xxxx},
  year={2025},
  url={https://mira-benchmark.github.io/}
}