DanceOPD: Composing Multiple AI Image Generation Capabilities Into One Model

DanceOPD framework overview - composing multiple AI image generation capabilities

Contents

The Multi-Capability Problem in Image Generation

Modern image generation models must handle diverse tasks: text-to-image (T2I) synthesis, local attribute editing, and global style transformations. However, these capabilities rarely align naturally — they often conflict.

For example, fine-tuning a model for editing tasks tends to degrade its T2I performance. Similarly, global and local editing operations interfere with each other, creating a fundamental tension in multi-capability model training.

Existing approaches struggle with this composition challenge. Joint training across all capabilities leads to gradient conflicts, while model merging techniques produce inconsistent results. The field needs a principled framework for composing multiple generative capabilities into a single model without mutual degradation.

What is DanceOPD?

DanceOPD introduces a novel framework for on-policy generative field distillation in flow-matching models. The core insight is elegant: treat each capability as a velocity field defined over a shared flow state space.

Instead of forcing the student model to learn multiple conflicting objectives simultaneously, DanceOPD routes each training sample to exactly one capability field. The student then learns from fields queried on its own rollout states, composing expert capabilities without interference.

This formulation differs fundamentally from prior work like DiffusionOPD and Flow-OPD. While those methods use off-policy distillation, DanceOPD queries states from the student’s own trajectory, ensuring consistency between training and inference distributions.

Key Design Choices

Hard-Routed Sample-Wise Field Matching

DanceOPD uses hard routing rather than soft mixing when assigning samples to capability fields. Each training sample is assigned to exactly one field, preventing gradient conflicts between competing objectives.

This approach achieves a 15.2% improvement over soft mixing strategies. The key advantage is that each field receives clean, unambiguous gradient signals, allowing the student to learn each capability in isolation before composing them.

On-Policy Student-State Querying

Off-policy distillation uses teacher-generated states that may not align with the student’s actual generation trajectory. DanceOPD addresses this by querying states from the student’s own rollout.

This ensures the teacher’s guidance is always relevant to the student’s current capabilities, reducing distribution mismatch and improving learning efficiency.

Semantic-Side Single Query

The framework queries low-noise states near the clean end of the diffusion process. These states contain the most capability-specific information, as semantic content is better preserved at low noise levels.

Low-t querying yields a 23.7% improvement over high-noise alternatives, confirming that semantic-side states are critical for effective capability transfer.

The Velocity MSE Objective

DanceOPD uses a simple velocity MSE objective rather than KL-weighted alternatives. This choice is surprisingly effective and offers several practical advantages.

The plain MSE loss naturally absorbs operator-defined fields like classifier-free guidance (CFG). When the teacher uses CFG during training, the student learns to internalize this guidance directly into its velocity field, eliminating the need for external guidance at inference time.

This absorption capability means the distilled student can match teacher performance without requiring the same guidance configuration during sampling.

Experimental Results

T2I + Editing Composition

When composing text-to-image generation with editing capabilities, DanceOPD achieves an +8.1% improvement on GEditBench over the best OPD baseline. The model successfully maintains T2I quality while gaining editing proficiency.

Qualitative results show the model can generate high-quality images from text prompts and apply diverse edits — from local attribute changes to global style transformations — without performance degradation in either task.

Local + Global Editing

The most impressive gains appear in composing local and global editing tasks. DanceOPD outperforms baselines by +16.1%, demonstrating that hard-routed field matching effectively handles conflicting editing objectives.

The model can precisely modify local attributes (e.g., changing a dress color) while simultaneously supporting global transformations (e.g., converting a daytime scene to nighttime) on the same source image.

Realism-Field Absorption

DanceOPD absorbs a photorealism teacher field into the student model, achieving a +9.9% realism reward improvement. Remarkably, this closes 85.3% of the student-to-teacher gap.

This result demonstrates that the velocity field formulation can effectively transfer complex reward-based improvements without requiring the reward model at inference time.

CFG Absorption

The framework can absorb classifier-free guidance during training and compose it with external guidance at inference. The effective guidance strength approximately follows the product of training and evaluation scales.

This compositional property enables flexible control over guidance strength without retraining, as the absorbed CFG field becomes part of the student’s learned velocity field.

Ablation Studies

Comprehensive ablations validate each design choice in DanceOPD. Rollout steps, timestep queries, and trajectory queries all significantly affect performance.

The studies confirm that hard routing consistently outperforms soft mixing across all metrics. Longer training rollouts refine clean-side coverage but don’t automatically improve results — the rollout serves as a query-state generator, not a trajectory-compression target.

Timestep selection is critical: low-t queries significantly outperform median and high-t alternatives, as semantic information is best preserved near the clean end of the diffusion process.

Limitations and Future Work

DanceOPD assumes a shared field architecture across all capabilities. While this simplifies composition, it may limit the model’s ability to learn highly specialized representations for each task.

The current framework uses predefined routing with uniform probabilities. Adaptive routing strategies that dynamically assign samples based on difficulty or curriculum could further improve composition quality.

Future work could explore extending this framework to video generation, 3D synthesis, and other modalities where multi-capability composition is essential.

Conclusion

DanceOPD establishes a practical route for multi-capability generative model composition. By treating capabilities as velocity fields and using on-policy distillation with hard routing, the framework effectively composes T2I, local editing, and global editing without mutual degradation.

The key contributions — hard-routed sample-wise field matching, on-policy student-state querying, and the velocity MSE objective — provide a principled foundation for building unified generative models. The framework’s ability to absorb operator-defined fields like CFG and photorealism rewards further extends its practical utility.

With strong empirical results across multiple benchmarks, DanceOPD demonstrates that generative field distillation in flow-matching models is a viable path toward unified, multi-capability image generation systems.