Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Runze He^1,2,3, Yiji Cheng¹, Tiankai Hang¹, Zhimin Li¹, Yu Xu¹, Zijin Yin¹, Shiyi Zhang¹

Wenxun Dai¹, Penghui Du³, Ao Ma³, Chunyu Wang^1,✝, Qinglin Lu¹, Jizhong Han^2,3, Jiao Dai^2,3,‡

¹Hunyuan, Tencent ²IIE, CAS ³UCAS
^✝Project lead ^‡Corresponding author

Our proposed Re-Align supports image synthesis conditioned on flexible image-text interleaved prompts, namely a) in-context image generation, also referred to as subject-driven image generation, and b) in-context image editing, also referred to as reference-based image editing. c) An inference example from Re-Align, including an aligned reasoning–image pair. The reasoning text is converted from XML to JSON for clearer visualization.

Abstract

In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model’s overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

Overview

Re-Align adopts a structured reasoning mechanism, namely In-Context Chain-of-Thought (IC-CoT), which explicitly decomposes the reasoning process into semantic guidance and reference association and is uniformly applied to both image generation and editing. The former provides a clear textual target for image generation, partly simplifying the image-text interleaved task into a text-to-image generation; the latter analyzes the role of each reference image within the multi-image context to prevent reference confusion.

To further enhance model's performance on complex interleaved prompts, we employ Group Relative Policy Optimization (GRPO) with a surrogate reward that measures the correspondence between the CoT context and the resulting image. The reasoning-induced diversity strategy is proposed to improve the diversity of samples between groups, thereby stabilizing the training of GRPO.

To support model training, we develop an automated data construction and filtering pipeline, yielding Re-Align-410K, a high-quality ICGE dataset with IC-CoT annotations spanning multiple in-context image generation and editing tasks.

The two-stage training pipeline of Re-Align. Upper: First, we perform supervised fine-tuning on carefully curated training data to enable the model to generate images guided by IC-CoT reasoning. Lower: Next, we apply policy optimization to further enhance reasoning–generation consistency, using an alignment score between the structured IC-CoT and the corresponding generated image.

Dataset

As shown in following, to support model training, we introduce ReAlign-410K, a high-quality collection covering multiple task types. The dataset is constructed via an automated data construction pipeline that integrates advanced MLLMs and state-of-the-art image generation models.

The data construction pipeline of Re-Align-410K and its task distribution. a) reference images preparation, b) adaptive instruction generation, c) reasoning text generation, d) target image generation, e) data filtering, and f) the data distribution of Re-Align-410K.

Qualitative Results

Qualitative comparisons of proposed Re-Align with BAGEL, OmniGen2, Echo-4o, Qwen-Image-Edit(2509) and DreamOmni2 on the in-context image generation and editing tasks.

More Visualization

Figure(a) provides additional incontext image generation and editing examples, demonstrating that the model produces accurate and highly consistent images when conditioned on one to four reference inputs. Furthermore, Figure(b) showcases in-context image editing capabilities, where the first, second, and third rows illustrate object addition, object replacement, and attribute modification with reference images, respectively. These results underscore the strong versatility and effectiveness of Re-Align across a broad range of creative generation tasks.

BibTeX

@misc{he2026realign,
      title={Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing}, 
      author={Runze He and Yiji Cheng and Tiankai Hang and Zhimin Li and Yu Xu and Zijin Yin and Shiyi Zhang and Wenxun Dai and Penghui Du and Ao Ma and Chunyu Wang and Qinglin Lu and Jizhong Han and Jiao Dai},
      year={2026},
      eprint={2601.05124},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.05124}, 
}