LeapAlign: Post-Training Flow Matching Models
at Any Generation Step by Building Two-Step Trajectories

Zhanhao Liang1,2,* Tao Yang2,*,† Jie Wu2 Chengjian Feng2 Liang Zheng1
1The Australian National University 2ByteDance Seed
*Equal contribution Project lead

TL;DR: LeapAlign constructs a two-step leap trajectory from a full generation trajectory for efficient reward gradient backpropagation. It enables fine-tuning at any generation step without incurring excessive memory cost or gradient explosion. LeapAlign shows strong performance in aligning flow matching models with general human preferences and improving image-text alignment.

Generated image example from LeapAlign fine-tuned Flux Generated image example from LeapAlign fine-tuned Flux Generated image example from LeapAlign fine-tuned Flux Generated image example from LeapAlign fine-tuned Flux Generated image example from LeapAlign fine-tuned Flux
Images generated by Flux fine-tuned with LeapAlign using HPSv3 as the reward model. More examples appear in the gallery below.

Overview

Text-to-image flow matching models can be fine-tuned by directly backpropagating reward gradients through differentiable sampling trajectories. However, full-trajectory backpropagation is memory-intensive and prone to gradient explosion. Existing methods either update only late generation steps[1], [2], missing early steps that control image layout, or stop gradients at the model input, discarding gradient terms that reflect interaction between steps[3].

LeapAlign constructs a two-step leap trajectory from a full trajectory and backpropagates reward gradients through this shortened path. This enables updates at any generation step, while gradient discounting retains gradient terms that reflect interaction between steps and stabilizes training.

Our contributions are as follows:

Method

Overview of the LeapAlign method
Overview of LeapAlign. Reward gradients are backpropagated through a two-step leap trajectory instead of the full generation trajectory.

Leap Trajectory Construction

We randomly select two timesteps \(k\) and \(j\) from the generation trajectory, where \(k > j\). We then construct the leap trajectory as:

\[ \begin{aligned} \hat{x}_{j \mid k} &= x_k - (k-j)v_\theta(x_k), && \color{#4f71be}{\text{One-Step Leap Prediction}} \\ x_j &= \hat{x}_{j \mid k} + \operatorname{stop\_gradient}(x_j-\hat{x}_{j \mid k}), && \color{#e18341}{\text{Latent Connector}} \\ \hat{x}_{0 \mid j} &= x_j - jv_\theta(x_j), && \color{#4f71be}{\text{One-Step Leap Prediction}} \\ x_0 &= \hat{x}_{0 \mid j} + \operatorname{stop\_gradient}(x_0-\hat{x}_{0 \mid j}), && \color{#e18341}{\text{Latent Connector}} \end{aligned} \]

This process forms a two-step leap trajectory:

\[ x_k \;{\color{#4f71be}{\rightarrow}}\; (\hat{x}_{j \mid k} \;{\color{#e18341}{\dashrightarrow}}\; x_j) \;{\color{#4f71be}{\rightarrow}}\; (\hat{x}_{0 \mid j} \;{\color{#e18341}{\dashrightarrow}}\; x_0) \]

The solid blue arrows denote one-step leap predictions by the flow matching model, while the dashed orange arrows denote latent connectors that align predicted latents with real latents. Reward gradients are backpropagated through the leap trajectory instead of the full generation trajectory. Because \(k\) and \(j\) are randomly selected, LeapAlign can fine-tune any generation step.

Gradient Discounting

Backpropagating through the leap trajectory gives the following gradient:

\[ \begin{aligned} \frac{\partial x_0}{\partial \theta} &= \underbrace{ -\,j\frac{\partial v_\theta(x_j)}{\partial \theta} - (k-j)\frac{\partial v_\theta(x_k)}{\partial \theta} }_{\text{single-step gradients at } k \text{ and } j} \\ &\quad+ \underbrace{ j(k-j)\frac{\partial v_\theta(x_j)}{\partial x_j} \frac{\partial v_\theta(x_k)}{\partial \theta} }_{\text{nested gradient}}. \end{aligned} \]

The nested gradient is useful for capturing interactions across different generation steps, but it can have a large magnitude. To control it, we introduce gradient discounting. With a discounting factor \(\alpha \in [0, 1]\), we modify the second one-step leap prediction:

\[ \hat{x}_{0 \mid j} = x_j - jv_\theta\!\left(\alpha x_j + (1-\alpha)\operatorname{stop\_gradient}(x_j)\right). \]

The gradient then becomes:

\[ \begin{aligned} \frac{\partial x_0}{\partial \theta} &= -\,j\frac{\partial v_\theta(x_j)}{\partial \theta} - (k-j)\frac{\partial v_\theta(x_k)}{\partial \theta} \\ &\quad+ {\color{#d62728}{\alpha}}j(k-j)\frac{\partial v_\theta(x_j)}{\partial x_j} \frac{\partial v_\theta(x_k)}{\partial \theta}. \end{aligned} \]

By adjusting \(\alpha\), we can moderate the gradient magnitude without discarding any component of the gradient flow. This, together with the leap trajectory design, stabilizes optimization while retaining full learning signals.

Fine-Tuning Objective

To reduce reward hacking and prevent unstable optimization toward excessively high or misleading reward values, we use a simple hinge-style objective:

\[ \mathcal{L}_{\text{raw}} = \max(0, \lambda - r(x_0)). \]

We evaluate the reward using the generated image \(x_0\), which directly reflects the output quality of the full generation trajectory. This allows the reward model to make more faithful assessments of visual and semantic quality, providing reliable supervision signals for fine-tuning.

Trajectory-Similarity Weighting

To emphasize leap trajectories that better match the original generation dynamics and provide reliable training signals, we introduce trajectory-similarity weighting. We measure similarity by the average absolute difference between predicted states \(\hat{x}\) and actual states \(x\) at the two connection points:

\[ d_j = \operatorname{mean}\!\left(|x_j-\hat{x}_{j \mid k}|\right), \qquad d_0 = \operatorname{mean}\!\left(|x_0-\hat{x}_{0 \mid j}|\right). \]

To avoid overemphasizing near-identical pairs, we clamp each distance with a minimum value \(\tau\) and define the weighting factor as:

\[ w_{\text{sim}} = \frac{1}{\max(d_j, \tau) + \max(d_0, \tau)}. \]

The final objective is formulated as:

\[ \mathcal{L} = \operatorname{stop\_gradient}(w_{\text{sim}})\,\mathcal{L}_{\text{raw}}. \]

Experimental Results

General Preference Alignment

General preference alignment on Flux with HPSv2.1 as reward. MixGRPO is fine-tuned using HPSv2.1, PickScore, and ImageReward as reward models for general preference alignment experiments.

Method In-Domain Out-of-Domain
HPSv2.1 HPSv3 PickScore UnifiedReward Alignment UnifiedReward Image-Quality ImageReward
Pretrained Model
Flux 0.3078 13.5020 22.7902 3.4514 3.5708 1.0455
Policy-Gradient-Based Methods
DanceGRPO 0.3451 14.8336 23.1186 3.4660 3.6199 1.2347
MixGRPO 0.3692 14.7530 23.5184 3.4393 3.6241 1.6155
Direct-Gradient Methods
ReFL 0.3852 15.5127 23.6299 3.4786 3.6870 1.3468
DRaFT-LV 0.3859 15.3699 23.6437 3.4868 3.6887 1.3384
DRTune 0.3882 15.5606 23.5185 3.4793 3.6679 1.3562
LeapAlign 0.4092 15.7678 23.7137 3.4984 3.7244 1.5104

LeapAlign achieves the best scores on HPSv2.1, HPSv3, PickScore, UnifiedReward-Alignment, and UnifiedReward-Image-Quality. Although MixGRPO jointly uses three reward models, LeapAlign is trained only with HPSv2.1 and still obtains higher HPSv2.1 and PickScore scores while remaining competitive on ImageReward. Overall, LeapAlign delivers consistent in-domain and out-of-domain gains in human preference alignment, image-text consistency, and image quality.

Compositional Alignment

Compositional alignment on the GenEval benchmark with Flux as the base model and HPSv2.1 as the reward.

Method GenEval Benchmark
Overall Single Object Two Objects Counting Colors Position Attribute Binding
Pretrained Model
Flux 0.6535 99.38 86.62 66.88 74.47 19.50 45.25
Policy-Gradient-Based Methods
DanceGRPO 0.6775 99.38 90.15 69.38 76.33 22.25 49.00
MixGRPO 0.7232 99.69 93.69 80.00 80.05 24.25 56.25
Direct-Gradient Methods
ReFL 0.7011 99.38 92.68 69.06 75.80 26.75 57.00
DRaFT-LV 0.7024 99.69 92.42 74.06 75.53 24.00 55.75
DRTune 0.7101 99.38 93.69 73.12 76.86 27.50 55.50
LeapAlign 0.7420 99.38 96.46 72.50 80.59 30.25 66.00

LeapAlign achieves the highest overall GenEval score, 0.7420, compared with 0.7232 for MixGRPO, the strongest policy-gradient-based baseline, and 0.7101 for DRTune, the strongest direct-gradient baseline. It shows strong performance under the ‘two objects’, ‘colors’, ‘position’, and ‘attribute binding’ categories. Since MixGRPO can use policy gradients to update early steps, and DRTune can also fine-tune early steps but discards critical gradients, these results indicate the benefit of fine-tuning early steps and the effectiveness of LeapAlign.

Reward improvement during fine-tuning on the compositional alignment task
Comparison of reward improvement during fine-tuning on the compositional alignment task.
GenEval score improvement during fine-tuning
Comparison of GenEval score improvement during fine-tuning among ReFL, DRaFT-LV, DRTune, and LeapAlign.
Qualitative comparison on the GenEval benchmark
Qualitative comparison on the GenEval benchmark.

Image Gallery

Conclusion

This paper introduces LeapAlign, a new post-training method that constructs two-step leap trajectories for efficient and stable reward gradient backpropagation. We find it useful to down-scale the large-magnitude gradient term and up-weight leap trajectories that are more similar to the original trajectories. LeapAlign addresses the challenge of propagating reward gradients to early generation steps without incurring excessive memory cost or sacrificing useful gradient terms. This is reflected by consistent improvements over existing post-training methods across a wide range of metrics, including general image preference and image-text alignment.

BibTeX

@misc{liang2026leapalign,
      title={LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories}, 
      author={Zhanhao Liang and Tao Yang and Jie Wu and Chengjian Feng and Liang Zheng},
      year={2026},
      eprint={2604.15311},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.15311}, 
}

References

[1] Xu, Jiazheng, et al. "ImageReward: Learning and evaluating human preferences for text-to-image generation." Advances in Neural Information Processing Systems 36 (2023): 15903-15935.

[2] Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards." arXiv preprint arXiv:2309.17400 (2023).

[3] Wu, Xiaoshi, et al. "Deep reward supervisions for tuning text-to-image diffusion models." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.