ECHO
Efficient Chest X-ray Report Generation
with One-step Block Diffusion

Lifeng Chen^1,2*, Tianqi You^1,3*, Hao Liu^1†‡, Zhimin Bao¹, Jile Jiao¹, Xiao Han¹,
Zhicai Ou¹, Tao Sun¹, Xiaofeng Mou¹, Xiaojie Jin², Yi Xu^1†

¹ AIRC, Midea Group · ² Beijing Jiaotong University · ³ Dalian University of Technology

^*Equal Contribution, ^†Corresponding Author, ^‡Project Leader.

arXiv Hugging Face Video Code

Demonstration of ECHO generating a radiology report from a chest X-ray input.

About ECHO

ECHO is a discrete diffusion vision–language model for automated chest X-ray report generation. Existing diffusion language models face an inherent quality–speed trade-off, where generating coherent outputs requires multiple denoising steps that substantially increase inference latency. ECHO overcomes this limitation by introducing non-factorized distillation targets constructed from the teacher's on-policy denoising trajectory, enabling coherent one-step-per-block inference that prior distillation methods could not achieve. A response-asymmetric adaptation strategy further reduces the training cost of AR-to-diffusion conversion by eliminating redundant processing of long vision token contexts. On three public CXR benchmarks, ECHO consistently outperforms autoregressive and diffusion-based state-of-the-art models, improving RaTEScore and SemScore by 64.33% and 60.58% respectively over AR baselines, while achieving up to an 8× inference speedup with marginal quality degradation.

Motivation

Discrete diffusion language models approximate the joint token distribution through token factorization, treating each position as conditionally independent. This approximation ignores inter-token dependencies, requiring multi-step remasking to progressively recover output coherence. Each additional denoising step, however, incurs an extra model forward pass, increasing inference latency and creating a fundamental quality–speed dilemma. ECHO resolves this through Direct Conditional Distillation (DCD), which constructs non-factorized supervision from the teacher's on-policy multi-step trajectories, enabling the student to capture joint token dependencies in a single forward pass per block — achieving multi-step quality at single-step speed.

Method

ECHO is built through three successive training stages, from a domain-specialized autoregressive model to a multi-step block diffusion backbone, and finally to the distilled one-step-per-block model. Response-Asymmetric Diffusion (RAD) adapts the autoregressive model into a block diffusion model by duplicating only the response portion of each training sequence, which eliminates the redundant processing of long vision token contexts imposed by prior two-stage conversion methods. Direct Conditional Distillation (DCD) then distills this multi-step teacher into a single-step student by constructing joint, non-factorized supervision targets from the teacher's confidence-heuristic remasking trajectory, allowing the student to capture inter-token dependencies that single-step decoding would otherwise lose.

Fig. 2: ECHO training pipeline overview — **Overview of the ECHO training pipeline.** ECHO is built in three successive stages: continued pre-training (Stage 1) produces ECHO_AR; Response-Asymmetric Diffusion adaptation (Stage 2) converts it into the block diffusion model ECHO_Base; and Direct Conditional Distillation (Stage 3) distills ECHO_Base into the final one-step-per-block model ECHO. **(a)** RAD duplicates only the response portion of the training sequence, avoiding the redundant duplication of long vision token contexts required by prior two-stage conversion methods. **(b)** DCD proceeds in two phases: the teacher's confidence-heuristic remasking trajectory is collected to form a joint, non-factorized supervision target, which is then used to align the student's single-step prediction via KL divergence.

Highlights

🏥 State-of-the-Art CXR Report Generation. ECHO consistently outperforms both autoregressive and diffusion-based state-of-the-art models on clinical fidelity metrics, with improvements of 64.33% on RaTEScore and 60.58% on SemScore over strong AR baselines, including models larger in size.

⚡ Up to 8× Inference Speedup. ECHO achieves up to an 8× decoding speedup over multi-step diffusion baselines with only marginal quality degradation, offering a more favorable quality–speed trade-off than all existing distillation methods across block size configurations.

🧩 Training-Efficient AR-to-Diffusion Adaptation. By duplicating only the response tokens rather than the full sequence during adaptation, ECHO reduces training FLOPs by 72.3% relative to prior two-stage conversion methods, making the AR-to-diffusion conversion substantially more practical at the scale of high-resolution medical images.

Experiments

ECHO is comprehensively evaluated on multiple public CXR benchmarks against a wide range of baselines. On clinical fidelity metrics, ECHO outperforms MedGemma-27B, the current open-source state-of-the-art medical VLM, by 17–40%. Among diffusion-based methods, ECHO also achieves superior quality–speed trade-offs, reaching up to 274 tokens per second and a throughput of 8 tokens per forward pass with only marginal quality degradation relative to the multi-step teacher.

ECHO experimental results — Comparison with autoregressive medical VLMs, general-purpose proprietary models, and diffusion-based distillation methods on three public CXR benchmarks. Metrics cover linguistic quality (ROUGE-L, CIDEr), clinical fidelity (RaTEScore, SemScore), and decoding throughput (TPF: tokens per forward pass; TPS: tokens per second).

BibTeX

@article{chen2026echoefficientchestxray,
      title={ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion}, 
      author={Lifeng Chen and Tianqi You and Hao Liu and Zhimin Bao and Jile Jiao and Xiao Han and Zhicai Ou and Tao Sun and Xiaofeng Mou and Xiaojie Jin and Yi Xu},
      year={2026},
      journal={arXiv preprint arXiv:2604.09450}
}

ECHO Efficient Chest X-ray Report Generationwith One-step Block Diffusion