Contents
Introduction
Imagine dropping ink onto wet paper. At first, it’s just a blurry, random blob. But as you spread and guide it, shapes begin to emerge. A letter. A word. A sentence. That’s how Sumi-7B works — not writing left-to-right like conventional language models, but starting from pure noise and gradually shaping it into coherent text.
This is not your typical LLM. Sumi-7B is the first Uniform Diffusion Language Model (UDLM) trained from scratch at a meaningful scale: 7 billion parameters on 1.5 trillion tokens. And unlike autoregressive (AR) models that can only write in one direction without ever looking back, Sumi can revisit and revise any token at any step during generation. The paper, released by researchers at Tohoku University, marks a real milestone in non-autoregressive text generation.
Autoregressive vs. Diffusion vs. Uniform Diffusion
To understand why Sumi is special, let’s look at three different ways to generate text.
Autoregressive (AR) models like GPT-4 or Llama write like you’re drafting a letter by hand — once a word lands on the page, it stays. You cannot go back and fix a typo three sentences ago. Each new word depends only on everything you’ve written so far. This is simple and effective, but it’s inherently sequential and rigid.
Masked Diffusion models (like MDLM or SSD-LM) work like paint-by-numbers. You start with a canvas where some spots are already filled in (the prompt) and others are blank (masked tokens). The model fills in the blanks over several steps. But here’s the catch: once a masked position is filled, it’s sealed — you can’t go back and change it.
Uniform Diffusion (Sumi’s approach) is different. Imagine spreading ink on paper again. Initially, every position on the canvas is filled with random noise. Then, over many steps, the model iteratively refines all positions at once, gradually reducing noise until clear text emerges. The key insight? Any token can be updated at any step. The model can start with a rough sketch and refine details everywhere simultaneously.
In technical terms: Sumi initializes a fixed-length canvas of random tokens, then applies a learned denoising process (the reverse diffusion) to transform noise into meaningful text.
Who Built Sumi?
Sumi comes from Tohoku University in Japan, built by a team of 6 researchers. The training infrastructure? 288 NVIDIA H100 GPUs, running for a total of 43,308 GPU-hours. That’s roughly $400K-$500K in cloud compute costs at retail pricing.
Everything is fully open: Apache 2.0 license, model weights, intermediate checkpoints, data mixture details, and the full training recipe.
Architecture at a Glance
Sumi-7B uses a familiar LLaMA-style backbone with:
- 7 billion parameters across 36 transformer layers, hidden size 4096
- SwiGLU activation (gated linear unit with Swish)
- Grouped-Query Attention (GQA): 32 attention heads, 8 key-value heads
- Rotary Position Embeddings (RoPE) with extended theta = 500,000
- RMSNorm instead of LayerNorm
- OLMo 3 tokenizer with 100,278 vocabulary size
The model uses the GIDD framework with SNR reparameterization and Megatron-LM for distributed training.
Training Data — The Education-Heavy Mix
Sumi was trained on 1.5 trillion tokens in two phases:
- Pre-training: 1.3 trillion tokens from llm-jp-corpus-v4, heavily filtered for educational content using FineWeb-Edu scoring
- Mid-training: 250 billion additional tokens with a targeted composition
Mid-training breakdown: 81.4B code (32.5%), 74.3B math (29.7%), 52.4B general (21.0%), 42.0B reasoning (16.8%).

Benchmark Results
Here’s Sumi-7B compared against equivalent 7B-class models:
| Benchmark | Sumi-7B | Falcon-7B | Llama 2-7B | OLMo-7B |
|---|---|---|---|---|
| MMLU (5-shot) | 51.1 | 27.2 | 46.0 | 28.0 |
| HumanEval (0-shot) | 22.6 | 0.0 | 12.8 | 13.4 |
| GSM8K (4-shot) | 32.8 | 5.3 | 13.5 | 3.8 |
| PIQA (0-shot) | 66.4 | 80.5 | 78.7 | 79.8 |
| HellaSwag (0-shot) | 60.0 | 76.3 | 76.2 | 75.6 |
The pattern is clear. Sumi leads on MMLU, HumanEval, and GSM8K — beating even Llama 2-7B on math by a wide margin. But it trails significantly on PIQA and HellaSwag (commonsense).
The Canvas Problem
One unique quirk of Sumi is the concept of canvas length. Unlike AR models, Sumi allocates a fixed-length canvas at the start and fills it. The researchers found that canvas length 2048 is the sweet spot. Outside this range, perplexity rises sharply, especially on GSM8K.

Self-Correction? Not Really
The researchers ran a revision budget study. Between 58% and 100% of edit steps resulted in token overwrites, but net token change was only 0.1-1% and accuracy remained flat. The model flips tokens back and forth (A→B→A round trips) without real semantic improvement.
However, the model exhibits a self-organized commitment order during denoising — some positions commit to final values before others.

How to Use Sumi
You can run Sumi using Hugging Face Transformers:
from transformers import AutoModelForMaskGeneration, AutoTokenizer
model_id = 'tohoku-nlp/sumi-7b'
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForMaskGeneration.from_pretrained(model_id, trust_remote_code=True)
# Generate uses canvas-based denoising, not left-to-right
Note: trust_remote_code=True is required. The generate() method denoises a fixed-length canvas, not left-to-right.
What’s Next?
The team has announced an SFT version on the roadmap. Open challenges include instruction tuning for diffusion LMs, adaptive canvases, and better data mixing to improve commonsense without sacrificing knowledge/coding performance.
Sumi proves that the dominant autoregressive paradigm isn’t the only path. The ink is still spreading.
