RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

Yucheng Chen¹ Yang Yu² Yufei Shi¹ Conghao Xiong³ Xulei Yang² Si Yong Yeo¹^(✉)

¹MedVisAI Lab, Lee Kong Chian School of Medicine, Nanyang Technological University (NTU), Singapore; Centre of AI in Medicine, Singapore ²Department of Machine Intellection, Institute for Infocomm Research (I²R), A*STAR, Singapore ³Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China

📄 Paper 📚 arXiv 💻 Code

Abstract

Radiology report generation remains challenging because it requires precise alignment between complex visual patterns and long-form clinical narratives. Existing methods often treat reports as flat sequences, overlooking their inherent hierarchical structure and limiting fine-grained visual-text correspondence. RIHA addresses this challenge through hierarchical cross-modal alignment at paragraph, sentence, and word levels. It introduces a Visual Feature Pyramid (VFP) and a Text Feature Pyramid (TFP) to represent multi-scale visual information and multi-granularity textual semantics, and aligns them through a Cross-modal Hierarchical Alignment (CHA) module based on optimal transport. In addition, Relative Positional Encoding (RPE) is incorporated into the decoder to improve token-level alignment and contextual consistency. This design enables RIHA to better capture the structured nature of clinical reasoning and generate more accurate and coherent radiology reports.

Key Contributions

We propose RIHA, a hierarchical cross-modal alignment framework for radiology report generation, aligning visual and textual features at paragraph, sentence, and word levels.
We design a Visual Feature Pyramid (VFP) and a Text Feature Pyramid (TFP) to extract multi-scale visual features and multi-granularity textual representations.
We introduce a Cross-modal Hierarchical Alignment (CHA) module based on optimal transport, enabling distribution-level alignment across modalities and semantic levels.
We incorporate Relative Positional Encoding (RPE) into the decoder to improve token-level alignment and enhance semantic consistency in generated reports.

Motivation

Radiology reports are inherently hierarchical: paragraph-level impressions provide overall clinical context, sentence-level descriptions detail specific anatomical findings, and word-level terminology captures precise medical measurements. Existing RRG methods typically treat reports as flat token sequences, missing the opportunity to leverage this structured hierarchy. RIHA is motivated by the need to establish synchronized cross-modal correspondences at multiple semantic levels simultaneously, mirroring the hierarchical reasoning process of clinical radiologists.

Figure 1. A chest X-ray image and its corresponding radiology report, illustrating the hierarchical structure of paragraph-, sentence-, and word-level information. This highlights the need for multi-level visual-textual alignment.

Method Overview

RIHA is built around the idea that radiology reports are inherently hierarchical, with paragraph-level summaries, sentence-level findings, and word-level clinical terms reflecting different stages of diagnostic reasoning. To model this structure, RIHA first extracts multi-scale visual features with a Visual Feature Pyramid (VFP) and multi-granularity textual features with a Text Feature Pyramid (TFP). These representations are then aligned through the Cross-modal Hierarchical Alignment (CHA) module, which formulates visual-text matching as an optimal transport problem across different semantic levels. Finally, the aligned visual features are decoded into reports with the help of Relative Positional Encoding (RPE), which improves token-level contextual modeling and strengthens the coherence of generated sentences. This design allows RIHA to bridge the gap between structured clinical language and heterogeneous visual evidence more effectively than conventional flat-sequence approaches.

Figure 2. The architecture of RIHA: An image is fed into the VFP Extractor to obtain shallow, middle, and high-level features. The multi-granularity text features of paragraph, sentence, and word-level features are extracted by the TFP extractor. Multi-granularity visual and textual features are then sent into CHA for hierarchical alignment. After that, refined visual and textual features are fed into a transformer encoder-decoder structure for report generation.

Results

We evaluate RIHA on two benchmark datasets, IU-Xray and MIMIC-CXR, using both natural language generation (BLEU, METEOR, ROUGE-L, CIDEr) and clinical efficacy metrics.

The results show that RIHA consistently outperforms existing state-of-the-art methods across both datasets. The improvements are attributed to the proposed hierarchical alignment mechanism, which enables more precise mapping between visual features and structured report components.

Figure 3 presents qualitative comparisons of generated reports on the MIMIC-CXR dataset. Compared with the baseline model, RIHA generates reports that more faithfully reflect the ground-truth clinical findings, as indicated by the higher overlap of highlighted terms. In particular, RIHA reduces both omission errors, where important findings are missed, and hallucination errors, where unsupported observations are introduced. This improvement is closely related to the proposed hierarchical alignment mechanism, which enables the model to associate visual evidence with textual content at paragraph, sentence, and word levels, leading to more structured and clinically consistent reports.

Cross-modal hierarchical alignment via optimal transport

Figure 3. Examples of generated reports from the MIMIC-CXR testing subset using the baseline model and our proposed RIHA method. Identical findings in the ground truth (GT) and generated reports are highlighted with matching colors, demonstrating the superior performance of our approach.

Figure 4 provides further insight into how this improvement is achieved. The attention map visualizations show that RIHA attends more precisely to clinically relevant regions corresponding to each keyword, whereas the baseline model often exhibits diffuse or misaligned attention. This suggests that the proposed alignment strategy improves the model's ability to ground textual tokens in specific visual evidence. Such fine-grained visual-text correspondence is particularly important in clinical report generation, where interpretability and reliability depend on whether textual descriptions are supported by the appropriate image regions.

Relative positional encoding in the transformer decoder

Figure 4. Attention map visualizations for various keywords from the Baseline and RIHA models reveal that RIHA assigns more precise attention regions, highlighting its improved focus for each keyword.

Together, these results demonstrate that RIHA not only improves linguistic quality, but also strengthens clinically meaningful alignment between images and reports, which is essential for trustworthy radiology report generation.

Citation

@article{chen2026riha,
  title={RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation},
  author={Chen, Yucheng and Yu, Yang and Shi, Yufei and Xiong, Conghao and Yang, Xulei and Yeo, Si Yong},
  journal={IEEE Journal of Biomedical and Health Informatics},
  year={2026},
  doi={10.1109/JBHI.2026.3670023}
}