OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

1University of Science and Technology of China    2Peking University    3JD Explore Academy

Project leader, Corresponding author

Abstract

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration.

To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT delivers consistent and comprehensive improvements.

Core Idea

OmniNFT Framework

Figure: Core pipeline of OmniNFT. Fine-grained credit assignment at three levels to overcome the limitations of vanilla RL fine-tuning with a single global advantage.

OmniNFT performs fine-grained credit assignment at three levels to overcome the limitations of vanilla RL fine-tuning with a single global advantage.

Modality-wise Advantage Routing Routes independent per-reward advantages to their respective modality branches
Layer-wise Gradient Surgery Selectively detaches video-branch gradients on shallow audio layers
Region-wise Loss Reweighting Concentrates optimization on critical audio-video sync regions

Performance

Evaluation on JavisBench. Best results highlighted in blue, second-best underlined. (↑: higher is better; ↓: lower is better).

Model Size AV-Quality Text-Consistency AV-Consistency AV-Synchrony
VQ ↑ AQ ↑ TV-IB ↑ TA-IB ↑ CLIP ↑ CLAP ↑ AV-IB ↑ AVHScore ↑ JavisScore ↑ DeSync ↓
T2A + A2V
TempoTkn1.3B ----0.084--0.205--0.1390.1220.1031.532
TPoS1.0B ----0.201--0.229--0.1240.1290.0951.493
T2V + V2A
ReWaS0.6B ------0.123--0.2800.1100.1040.0791.071
See&Hear0.4B ------0.129--0.2630.1600.1430.1121.099
FoleyCrafter1.2B ------0.149--0.3830.1930.1860.1510.952
MMAudio0.1B ------0.160--0.4070.1980.1820.1500.849
T2AV
JavisDiT3.1B 1.2914.4780.2630.1430.3020.3910.1970.1790.1541.039
UniVerse-16.4B 1.3574.8390.2720.1110.3090.2450.1040.0980.0770.929
JavisDiT++2.1B 1.4625.0490.2820.1640.3160.4240.1980.1840.1590.832
LTX-219B 2.0385.1970.2720.1700.3110.4120.2320.2230.1920.569
T2AV + RL
LTX-2 + GDPO19B 3.2095.5230.2650.1840.3080.4280.2330.2230.1850.412
LTX-2 + OmniNFT19B 3.3265.7150.2610.1890.3100.4450.2620.2570.2200.269
Our RL Δ-- +1.288+0.518-0.011+0.019-0.001+0.033+0.030+0.034+0.028-0.300

Table 1.

  • Benchmark results on JavisBench across different settings with improvement (Δ, compared with base LTX-2)
  • VQ: Visual Quality, AQ: Audio Quality, TV-IB/TA-IB: Text-Video/Audio ImageBind, AV-IB: Audio-Video ImageBind
  • OmniNFT achieves the best performance on most metrics, with substantial gains in AV-Quality, AV-Consistency, and AV-Synchrony

Generation Demos

Side-by-side comparison: LTX-2 (Baseline) vs. LTX-2 + OmniNFT (Ours). Click to play with audio. Prompts shown are abbreviated for display.

Paper Cases

LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
In a medium close-up, a young woman with blonde shoulder-length hair stands in a lavender field under a twilight sky. She says, "告诉我这条光滑的绿色带子见证了多少年的沉重。" The audio features gentle, atmospheric singing establishing a calm and wistful mood.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A white rooster with a bright red comb stands on a weathered wooden beam. The brown chicken takes a few steps forward. The white rooster lifts its head and lets out a crow. The audio shows rural morning sounds with a rooster's sharp crow and softer clucking.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
Two men face each other nose-to-nose in a tense confrontation within a dimly lit bank. The older man says, "我们才不会害怕残忍的流氓." The clown replies, "你知道吗?你让我想起了我的父亲,我恨我的父亲." The audio shows suffocating silence broken by tense dialogue.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
SpongeBob and Patrick sit on a rock underwater. Patrick says: "Knowledge cannot replace friendship, SpongeBob!" SpongeBob responds: "I agree, but I also really want that Krabby Patty!" Cinematic 2D animation, soft underwater lighting, 8k resolution.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A professional basketball game in a brightly lit arena. Player number 30 dribbles up court, executes a crossover and fake, then leaps and shoots. The ball passes cleanly through the net. The crowd roars as the scoreboard updates. Audio: dribbling, shoe squeaks, "swish", and crowd cheering.

More Cases

LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A monk in orange robes kneels beside a man on a thin mat. The monk intones, "Beings who have not yet obtained liberation have unfixed natures and consciousnesses. Their bad habits reap karma; their good habits bring rewards." Soft, serene ambient soundscape creating a meditative atmosphere.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A young woman with long black hair in a wooden boat drifting on calm water surrounded by misty hills. She murmurs, "The orioles are singing on the island in the river. A beautiful lady is the ideal match for the gentleman." Audio: gentle water ripples and distant birdsong.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A woman with short brown hair in a gray wool coat holds a vintage camera outdoors. She says, "Photography isn't merely capturing moments—it's about seeing beyond surfaces, finding stories hidden within everyday life. Each click tells its own tale." Audio: gentle wind and distant city ambience.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
An astronaut in a full white EMU spacesuit drifts weightlessly above Earth's cloud formations. The astronaut says, "我现在感觉很好。我们的星球是如此美丽。" Audio: soft clicks of a lighter, faint drag from smoking, and a quiet sigh.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A small anthropomorphic rodent-like creature with large ears dashes along an autumn forest path. It calls out, "Hedra has brought us to life with the most advanced character." Audio: a single line of clear dialogue with no background sounds.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A male anime character with spiky black hair in ornate armor stands in an outdoor arena under a turbulent sky with lightning. He declares, "You think you've seen pain? You think you know suffering? Try beat it." Audio: rolling thunder with deep, resonant voice.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
Extreme close-up of a tabby cat's face. The cat's mouth opens slightly as if meowing, head shifting gently from side to side. Audio: soft and persistent cat mews repeating gently against a quiet background.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
Extreme close-up of a cicada clinging to a warm orange-red wall, its patterned wings spread in sharp detail. Audio: continuous, steady cicada chirps with a clear, high-pitched, rhythmic buzzing quality.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A person in a beige hooded garment steps out of a white SUV in a forest clearing. Heavy rain begins to pour as they walk toward the tree line past a moss-covered rock near a stream. Audio: dense patter of heavy rain and steady rush of stream water.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
High-angle extreme close-up of two hands with manicured nails pressing round, matte gray keys of a mint-green retro keyboard. Fingers move briskly with practiced rhythm. Audio: soft, steady keyboard typing with distinct, crisp clicks in a continuous pattern.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A man in a dark sweater and fingerless gloves glides on a longboard down a paved road through a rugged mountain landscape. He shifts weight in fluid motions, accelerating. Audio: sharp, rushing wind sound that grows in intensity with increasing speed.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
In a hazy, rain-soaked city street, a rider on a motorcycle moves forward. The camera pushes in, descends to track the rear wheel displacing water, then rises and circles the motorcycle, revealing towering buildings disappearing into thick mist.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A tranquil seascape with a small boat, dark rock formations, and a crescent moon. Text: "DJI MAVIC 4 PRO, 28mm 哈苏相机". The camera zooms in, pans across the sunset-streaked sky, then circles and ascends to a bird's-eye view of the vast sea.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
A female anime character with white hair streaked with blue holds a dark flask emitting cool blue glow in a dim library. She lowers her head with determination, waves her hand causing the glow to expand, then strides into shadows. Audio: delicate melody with whispers of wind and turning pages.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
In a dense, misty forest, a character with braided hair crouches beside a glowing blue object with game UI overlays. She rises, moves stealthily, unleashes a lightning bolt at a mechanical beast, and a teammate ambushes from the opposite side. Audio: forest ambience interrupted by energy crackling and metallic screeches.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
Five superhero figures stand amidst twisted metal and rubble under golden backlighting. The man in red-and-gold says, "I can't shake the feeling that this isn't over yet." The muscular man replies, "As long as we're still standing together, we can get through anything." Audio: soft wind through ruins.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
An East Asian man on a rooftop says, "事情到这个地步,你还有退路吗?" The second man replies with quiet confidence, "退路?你觉得我会需要吗?" The coastal city skyline and sea blur behind them. Audio: quiet open-air ambiance with clear, tense dialogue.
LTX-2 (Baseline)
OmniNFT (Ours)
Prompt(Abbreviated)
In a hazy encampment, a man in dark robes grips a weapon beside a yellow banner with '安'. He speaks firmly: "那是我至亲至爱的师弟,得加钱。" An armored man on horseback responds: "好,你开个价吧。" Audio: steady flapping of banners with a low musical undertone.

Citation

@article{zhang2026omninft,
  title={OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation},
  author={Zhang, Guohui and Ma, XiaoXiao and Huang, Jie and Xu, Hang and Yu, Hu and Fu, Siming and Li, Yuming and Xue, Zeyue and Song, Lin and Huang, Haoyang and Duan, Nan and Zhao, Feng},
  journal={arXiv preprint arXiv:2605.12480},
  year={2026}
}
}