Training a large language model is expensive. But here’s the part most people overlook: it’s not just the GPU-hours on the final model that burn through your budget. Choosing the right data mixture — the proportions of training data from different sources — can consume hundreds of GPU-hours before you even start the real training run.
A new paper from Tencent’s Hunyuan LLM team, FastMix, proposes a radical simplification: optimize data mixture weights using gradient descent on a single proxy model. No grid search, no hundreds of proxy runs — just one model learning the mixture as it trains.
Contents
The Data Mixture Problem
LLM training data is a cocktail of sources: web crawls, academic papers, code repositories, dialogue transcripts, and more. The ratio at which you blend these sources has an outsized impact on the final model’s capabilities. Get it wrong, and your model excels at code but fails at reasoning, or vice versa.
The current state of the art involves expensive proxy-based search:
- RegMix: Trains 512 proxy models under different mixtures, fits a regression model, and extrapolates the optimal ratios. Total cost: ~720 GPU-hours.
- CLIMB: Iteratively narrows the search space, training 64 proxy models. Total cost: ~72 GPU-hours.
- Manual tuning: Experts guess, train, evaluate, repeat. Slow, non-reproducible, and doesn’t scale.
FastMix asks: can we do this with one model?
What FastMix Does
FastMix is not a new LLM architecture. It’s an optimization process for discovering optimal data mixture ratios. Published at ICLR 2026, it comes from researchers at the University of Hong Kong, Tencent’s Hunyuan LLM team, and the Chinese University of Hong Kong.
The key insight is a mathematical reparameterization. Instead of treating mixture ratios as non-differentiable sampling probabilities (which you can’t backpropagate through), FastMix recasts them as continuous loss weights under uniform sampling. This makes the entire optimization problem differentiable — and amenable to standard gradient-based methods.
Code is available at github.com/hrtan/fastmix.
How It Works: Bilevel Optimization
FastMix frames mixture selection as a bilevel optimization problem with two interleaved loops:
Inner Loop: Train the Model
Given current mixture weights α, update the model’s parameters by minimizing a weighted sum of per-source training losses. Each source contributes its loss scaled by its current weight:
loss = Σᵢ αᵢ · L_train(Dᵢ, w)
This is equivalent to sampling with mixture ratios, but computed under uniform source sampling — each source is selected with equal probability, and the mixture ratio acts as a differentiable loss weight.
Outer Loop: Update Mixture Weights
After training for a few steps, evaluate the model on a validation target and update the mixture weights via gradient descent. The gradient for each source’s weight is proportional to the alignment between:
- The validation gradient ∇w Lval
- The training gradient from source Dᵢ: ∇w Ltrain(Dᵢ)
If a source’s training gradients point in the same direction as the validation gradient (positive dot product), its weight increases. If they conflict, its weight decreases. Sources that help the model improve on the target get more data; sources that hurt get less.
Regularization: Preventing Collapse
Two tricks keep the optimization stable:
- Entropy regularization: A penalty term Σᵢ αᵢ log αᵢ discourages the mixture from collapsing to a narrow subset of sources. The weight λ is kept small (e.g., 10⁻⁵).
- Training loss as auxiliary target: The search objective combines validation loss with a fraction of training loss (weighted by β ≈ 0.1), reducing overfitting to the validation set.
Results: Faster and Better
The numbers speak for themselves. Here’s how FastMix compares on pre-training mixture optimization (1B model, 25B tokens, evaluated on 14 benchmarks):
| Method | Avg Score | Rank | GPU-Hours | Proxy Models |
|---|---|---|---|---|
| RegMix | 47.2 | 3 | 720.5 | 512 |
| CLIMB | 47.5 | 2 | 71.9 | 64 |
| FastMix | 48.2 | 1 | 1.3 | 1 |
FastMix achieves 550× speedup over RegMix and 55× speedup over CLIMB, while scoring highest across all 14 benchmarks (best on 9 of them).
In post-training (SFT), the results are even more striking. Using a math-tuned mixture on Qwen2.5-Math-Instruct 7B:
| Method | Avg Score | GPU-Hours |
|---|---|---|
| CLIMB | 59.9 | 117.4 |
| RegMix | 58.3 | 115.9 |
| FastMix | 65.4 | 2.2 |
That’s a +5.5 point lead over CLIMB, and it generalizes beyond math to coding (LiveCodeBench) and STEM QA (GPQA-Diamond) — even though the optimization target was purely mathematical benchmarks.
Practical Tips from the Authors
The paper includes hard-won lessons from industrial deployment:
- Non-differentiable targets: When your metric is discrete (e.g., accuracy), use a differentiable proxy like SFT loss. Black-box gradient estimators (finite differences, SPSA) don’t converge reliably on real data.
- Small proxy models: Models under 0.5B parameters can be unstable and produce noisy mixture ratios. Use larger proxies when possible.
- Search target data length: Pre-training sequences are long; SFT data is short. Concatenate multiple SFT sequences to match pre-training sequence lengths, or the gradients will diverge and the optimization fails.
- Keep n₂ = 1: The outer-loop update horizon should be 1 step. Longer horizons require backpropagation through time, which is memory-prohibitive and unstable.
Quick Start
The repo includes scripts for both pre-training and SFT mixture optimization. Here’s the basic setup:
# Install dependencies
bash scripts/setup_env.sh
# Download sample data
cd preprocess && python download_dataset.py --dataset_name sail/regmix-data-sample
cd ..
# Preprocess into packed .bin shards
bash preprocess/run_preprocess.sh
# Run FastMix (validation target variant)
TRAIN_DATA_DIR=/path/to/welldata CUDA_VISIBLE_DEVICES=0 bash scripts/run_val.sh
The learned mixture weights are saved to checkpoints/<out_name>/FastMixtureOut/probs_module_step*.pt. Apply softmax to recover the sampling distribution over data sources.
Why This Matters
Data mixture optimization has been a bottleneck that only the largest labs could afford. RegMix required 512 proxy models. CLIMB needed 64. Both took days of compute.
FastMix reduces this to a single model trained for ~1 hour. That’s not just faster — it makes mixture optimization accessible to teams without massive compute budgets. A single GPU and an afternoon is enough to find an optimal data mixture.
The approach works for both pre-training and post-training, and the authors note it could extend to data curriculum design and data source attribution. If you’re training an LLM and still hand-tuning your data mix, FastMix is worth a serious look.
Paper: arXiv:2606.14971 · Code: github.com/hrtan/fastmix · ICLR 2026